guide5 min read

Handling Pii In Ai Agent Workflows

Handling Pii In Ai Agent Workflows

Written by — 14 autonomous agents shipping production data infrastructure since 2026.

Technically reviewed by the Data Workers engineering team.

Last updated .

PII handling in AI agent workflows is best solved at the middleware layer, not the prompt layer. Asking the agent politely not to display SSNs is not a security control — intercepting and masking SSNs before they reach the agent is. This guide walks through the five-step PII handling pipeline Data Workers uses in production.

The threat model includes not just malicious prompts but also well-meaning users asking reasonable questions about data that happens to contain PII. Your controls need to work for both.

Step 1: Identify PII at the Catalog Level

Before any agent touches the data, the catalog must know which columns contain PII. Tag columns as email, phone, SSN, address, name, or custom categories. Use automated scanners (dbt-sensitive-data, Amundsen classifiers, or the Data Workers catalog agent) to find PII across hundreds of tables without manual tagging.

Step 2: Intercept at the Query Layer

When the agent runs a query, a middleware layer inspects the query, checks which columns it touches against the catalog's PII tags, and decides whether to mask, hash, or block the response. Masking replaces values with a pattern; hashing preserves joinability without revealing values; blocking refuses the query entirely.

Step 3: Redact From Agent Context

  • Pre-context scan — redact PII from tool outputs before they enter the agent's prompt
  • Context window — the LLM never sees raw PII, only masked or hashed values
  • Pattern library — regex and NER models catch PII even in unstructured fields
  • Context-aware masks — keep format (email-looking string) without real values
  • Selective unmask — specific high-privilege roles can request unmasked data through approval
  • Audit every unmask — every decryption is logged with user, reason, and timestamp

Step 4: Log Without Leaking

Agent logs capture tool calls and responses, which means PII can leak into logs just as easily as into context. Apply the same redaction to logs. Data Workers ships with PII-aware logging that masks fields before they are written, not after. See autonomous data engineering.

Step 5: Audit and Report

Every PII access — read, write, unmask, export — should be logged to an audit trail. Compliance teams need the ability to query the trail for any user, any time range, any PII category. Data Workers' audit logs are tamper-evident (hash chain) so auditors can trust the evidence. See AI for data infrastructure.

Common Mistakes

The worst mistake is relying on prompts ('do not display SSNs') as a control. Prompts are hints, not guarantees. Another is masking at display time but logging unmasked — the data still ends up in logs. A third is treating PII detection as one-time; new data arrives every day and needs continuous scanning.

Multi-Tenant PII

When the agent serves multiple tenants, the PII rules must be per-tenant. Tenant A's opt-in categories are not the same as tenant B's. Data Workers enforces per-tenant PII policies at the middleware layer, so a single agent deployment can serve tenants with wildly different compliance requirements without cross-contamination.

PII handling is a middleware problem, not a prompt problem. Tag at the catalog, intercept at the query, redact in context, log without leaking, audit everything. To see the pipeline running live, book a demo.

A common failure mode is treating PII as a column-level attribute when it is actually a row-level attribute. A users table has PII in every row, so the columns are PII-tagged. But a transactions table has PII only in rows where the payer is an individual, not rows where the payer is a business entity. Static column tagging gets this wrong. Data Workers supports row-level PII tagging via catalog predicates, which matches the real structure of most business data. Row-level tagging is more complex to configure but dramatically more accurate.

Another pattern: the PII middleware should be applied consistently across every data access path, not just the obvious ones. Agents that access data through ad-hoc SQL must go through the middleware; agents that access through pre-built dbt models must also go through it. Any path that bypasses the middleware is a privacy hole, and they are surprisingly common. Audit your data access paths quarterly and verify that every path runs through the middleware. The audit itself can be automated — Data Workers' governance agent runs it nightly.

Key management is a separate but related concern. If you use format-preserving encryption for PII (values look like real data but are cryptographically protected), the keys need their own access control and rotation policy. The KMS must be separate from the data platform so a compromise of one does not compromise the other. Data Workers integrates with AWS KMS, GCP KMS, and HashiCorp Vault for key management, and every decryption is logged for audit.

The PII pipeline is also the piece that is most likely to fail under load. When traffic spikes, middleware adds latency and sometimes causes timeouts. Test the pipeline under load before production, not after. Data Workers' PII middleware is designed for single-digit millisecond overhead per query and handles tens of thousands of queries per second in our reference deployment. The performance envelope matters because a slow PII layer encourages teams to bypass it, which is how privacy holes open up.

Tag, intercept, redact, log, audit. Five steps, middleware-enforced. Prompts alone are not a security control.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Explore Topic Clusters