Back to blog

How to Transform Sensitive Data Files with AI Without Sending the Data to the Model

Scott Delia||
securityprivacysensitive-dataai

Somewhere in your company this week, someone had a deadline, a data file, and a chatbot open in the next tab. Maybe it was a claims export that needed flattening for a clearinghouse. Maybe it was a customer list that needed deduplicating before an import. The fastest path was obvious: paste the file in, describe the output, copy the result.

It is common. Cyberhaven found that 40% of workplace AI interactions involve sensitive data, often through unmanaged personal accounts.

This post is about a different way to get the same speed: use the AI to write the transformation scripts, not to perform the transformation. The AI writes the scripts, the scripts process your data. Done right, that means no data sent to AI at run-time, while the work still takes minutes instead of a sprint ticket.

Why pasting data into a chatbot does not scale

You lose control of the data the moment you paste it. A general-purpose chatbot account, especially a personal one, sits outside your governance. Retention is set by the provider's terms, which vary by plan and change over time. There is no audit trail, no retention schedule you control, and no way to un-paste.

The exposure repeats. Data work recurs. The vendor file arrives every Monday; the claims export runs every month. Pasting is not a one-time event, it is a habit, and every repetition puts a fresh copy of your data on someone else's infrastructure.

Models are not built to move rows. A language model transforms data by generating it, token by token. On 40 rows you can eyeball the result. On 40,000 rows you cannot, and a model under output pressure will truncate, summarize, and occasionally invent. A missing row in a claims file is not a quirky AI story, it is a reconciliation incident.

None of this means the instinct is wrong. The instinct is exactly right: reshaping files is the kind of tedious, structural work AI is genuinely good at. The mistake is in what gets handed to the model.

Separate the thinking from the doing

A data transformation is two jobs that usually get conflated:

  1. Deciding how to transform. Understanding the input layout, the target shape, the type coercions, the edge cases. This is the judgment-heavy part, and modern AI is excellent at it.
  2. Actually transforming. Applying those decisions to every row. This part is mechanical, and it is what ordinary code has been perfect at for fifty years.

The model only needs to be involved in the first job. To write the scripts that rename columns, reformat dates, and flatten nested records, the AI needs to understand the shape of your data: field names, types, formats, quirks. It does not need your million rows.

Once a script exists, the model's job is over. A script is deterministic: the same input produces the same output, on row 3 and on row 3 million, this week and next year. Nothing about running it requires AI, so nothing about running it should send data to AI.

That is the whole idea: the AI writes the scripts, the scripts process your data.

What the build phase needs to see (and what it does not)

Honesty first: writing a useful script requires showing the model something. You have three options, in increasing order of caution:

A sample you choose. Upload a representative slice of the file and describe what you need. You control what goes in that sample: subset the rows, mask the identifiers, or fabricate values that keep the real structure. The model sees the sample and your instructions, and you get a Python script you can read line by line.

A spec or schema. Vendor layout documents, API schemas, EDI implementation guides. The model can write transformation scripts from a precise description of the shape without seeing any real records.

A data dictionary alone. When no data can leave your environment at all, define the fields, types, formats, and rules, and build from that. No rows, real or fabricated, ever reach the model. Data Shepherd supports this directly: upload a spec and it extracts the dictionary for you, or define one by hand.

You will notice the careful claim in this post. We say "no data sent to AI at run-time," not "the AI never sees your data," because the second claim would only be true for dictionary-only builds. During the build phase, the model sees whatever you choose to share, and you should choose deliberately. A vendor that hand-waves this distinction is asking you not to think about it.

What a trustworthy run phase looks like

The build happens once. The run happens every Monday for the next three years, so the run phase is where the security properties matter. This is the checklist we built Data Shepherd's execution path around, and it is a reasonable bar for any tool you evaluate:

Sealed. The script is stored server-side. The browser can never hand a script back to be executed, so a compromised laptop or a malicious extension cannot inject code into your pipeline.

Integrity-verified. Every script is hashed with SHA-256, and the hash is checked before execution. The script that runs is provably the script that was reviewed.

Security-screened. Before any script can run, a validator screens it for things a data transformation never needs: network calls, filesystem access, process spawning, dynamic code execution. AI-written scripts are still generated code, and they get screened like untrusted code.

Sandboxed. Scripts execute in an isolated environment with no outbound network and enforced resource limits. A script can read its input and write its output. That is all it can do.

Transparent. You can review every line of your scripts read-only, with full version history. A script you cannot inspect is a black box with extra steps, which is the thing you were trying to escape.

No AI at run-time. The full dataset is processed by the sealed script, in infrastructure you can point your security team at. It is never sent to a model, on any run, at any size.

The honest footnotes: what happens when a run fails

A clean data-flow story should survive its edge cases, so here is the full picture for Data Shepherd, failure paths included:

WhenWhat reaches the AI model
Building or editing a transformationYour instructions, the sample you choose to provide, and any spec or schema you upload
Running a saved transformationNothing. No data sent to AI at run-time
A run fails: plain-English diagnosisThe error and the transformation context, not your dataset
A run fails: optional auto-heal (Pro and Max)The failing script and a small sample of the failing rows, so the model can propose a fix

Auto-heal deserves the footnote treatment. When a saved transformation hits data it has never seen, say the vendor quietly adds a column, auto-heal sends the failing script plus a small sample of the failing rows to the model so it can propose a fix. The proposed script is still security-screened, and heals are capped per job. It is genuinely useful for pipelines that need to recover overnight, and it is optional: if you never want failing rows seen by a model, leave it off and fix failures in the build phase instead, where you control the sample.

The full data-flow table, sub-processor list, and retention schedule are published on our Trust Center.

Could you build this workflow yourself?

Yes, and if you have engineering time to spare, it is a fine answer. Ask a coding assistant to write pandas scripts from your column descriptions, review them, and run them locally. That is the same principle, and it is dramatically better than pasting records into a chat window.

What you take on is everything around the scripts: screening generated code before it runs (with full local permissions, generated code can do surprising things), versioning as vendor specs drift, scheduling, retries, an audit trail, and above all the human factor. The DIY workflow holds up right until a deadline Friday, when someone skips it and pastes the file because pasting was easier. The real product is making the safe path the easy path.

A note on compliance

If you handle regulated data, attestations matter, so here is our status stated plainly: audit logging is live today; SOC 2 Type II and HIPAA readiness with a BAA are in progress, not complete. We publish the current status, alongside the architecture described above, on the Trust Center. For a young vendor, the architecture is the substance; the paperwork is catching up to it.

The takeaway

You do not have to choose between AI speed and data control. Split the work: let the model do the thinking once, in a build phase where you decide exactly what it sees, and let sealed, verified, screened scripts do the doing every time after, with no data sent to AI at run-time.

The next time a file is due and the chatbot tab is already open, that is the moment this matters. Set up the transformation once, review the script, and run it every week after without re-exposing a single row.

Ready to automate your data transformations?

Try Data Shepherd free. Describe what you need, and let AI handle the rest.

Get Started Free