guideLast updated Mar 23, 20268 min read

CLAUDE.md as Your Data Stack's Persistent Memory Layer

Project memory that compounds instead of resetting

CLAUDE.md is a markdown file that Claude Code reads at the start of every session — making it a persistent memory layer for your project. For data teams, it stores schema conventions, metric definitions, tribal knowledge, and operational rules in a format that compounds across every interaction, instead of treating Claude as a stateless assistant.

Most data engineers treat CLAUDE.md as a project README — a place to describe what the project does. But CLAUDE.md is not documentation. It is persistent memory that Claude Code loads on every invocation, making it the single most effective way to encode your data stack's tribal knowledge, schema conventions, metric definitions, and operational rules into a format Claude carries into every conversation.

This matters because data engineering is a domain where context is everything. The difference between a helpful AI assistant and a dangerous one is whether it knows that revenue means net revenue post-refund, that the orders table must be filtered by is_deleted = false, and that the payments pipeline has a known latency issue every Monday morning. CLAUDE.md is where you store this context so you never have to repeat it. Data Workers' 15 agents build on this same principle -- persistent context that makes every agent interaction smarter than the last.

What CLAUDE.md Actually Does

CLAUDE.md is a markdown file that Claude Code automatically reads at the start of every session. It is not optional documentation that might get read -- it is guaranteed context. Every prompt Claude Code processes is informed by the content in CLAUDE.md.

Claude Code supports CLAUDE.md files at three levels, each with a different scope:

•Project-level (`./CLAUDE.md`). Context specific to this project. For a dbt project, this includes model naming conventions, warehouse connection details, testing requirements, and deployment procedures.
•User-level (`~/.claude/CLAUDE.md`). Context that applies across all projects for a specific user. Personal preferences, common workflows, frequently used SQL patterns.
•Enterprise-level. Context that applies across all users and projects in an organization. Company-wide data governance policies, security requirements, naming standards.

The layering is key. A data engineering team can define organization-wide rules (never query production tables directly, always use staging models) at the enterprise level, project-specific conventions (this dbt project uses stg_, int_, fct_, dim_ prefixes) at the project level, and individual preferences (I prefer CTEs over subqueries) at the user level.

The Five Categories of Data Engineering Context for CLAUDE.md

After working with dozens of data teams, we have identified five categories of context that belong in CLAUDE.md for data engineering projects. Each category reduces a specific type of error or inefficiency.

1. Schema Conventions and Naming Standards

This is the highest-value context category. Schema conventions determine how Claude Code names models, columns, tables, and tests. Without explicit conventions, Claude Code will use reasonable defaults -- but 'reasonable' may not match your team's standards.

Example CLAUDE.md content: 'Model naming: staging models use stg_{source}_{table}, intermediate models use int_{domain}_{description}, fact tables use fct_{domain}_{event}, dimension tables use dim_{entity}. Column naming: timestamps end in _at, dates end in _date, booleans start with is_ or has_, foreign keys end in _id. All models must define a primary key column named {table_name}_id.'

This context eliminates the back-and-forth where Claude Code generates a model called staging_orders and you correct it to stg_shopify_orders. Every session starts with the right conventions already loaded.

2. Metric Definitions and Business Logic

Metric definitions are where hallucinations cause the most damage. If Claude Code does not know your organization's definition of 'revenue,' it will guess -- and guesses about business metrics lead to wrong numbers in dashboards.

Encode your critical metrics in CLAUDE.md: 'Revenue: net revenue, post-refund, in USD. Source of truth: finance.fct_revenue. Do not use raw.orders.total_amount for revenue calculations -- it includes tax and refunded orders. Customer LTV: sum of net revenue over the customer's lifetime, starting from first purchase date. Churn: customer is churned if they have not made a purchase in 90 days. This was previously 60 days; the definition changed in January 2026.'

This context grounds every query Claude Code generates. When someone asks for a revenue report, Claude Code uses the correct table, applies the correct filters, and uses the correct definition -- every time, in every session.

3. Data Quality Rules and Known Issues

Every data stack has quirks. Tables that must be filtered a specific way. Columns with known data quality issues. Sources with latency patterns. Encoding these in CLAUDE.md prevents Claude Code from generating queries that look correct but produce wrong results.

Example: 'Known issues: The payments table has duplicate rows for transactions processed through the legacy system (before March 2025). Always deduplicate by transaction_id and take the most recent updated_at. The users table includes test accounts. Filter by is_test_user = false for any user-facing metrics. The events table has a 4-hour latency from the source system. Do not use it for real-time reporting; use events_realtime instead.'

4. Environment Configuration

Data engineering involves multiple environments with different access levels and purposes. CLAUDE.md should encode which environment is which and what actions are allowed in each.

Example: 'Environments: Development warehouse is ANALYTICS_DEV (safe to run any query). Staging is ANALYTICS_STAGING (safe to read, DDL requires approval). Production is ANALYTICS_PROD (read-only in Claude Code -- never run DDL or DML against production). dbt target: use dev for development, staging for PR validation, prod for production deployments only through CI/CD.'

5. Team Conventions and Operational Procedures

This category captures the tribal knowledge that experienced engineers carry in their heads: how to respond to common incidents, which Slack channels to notify, what the deployment procedure is, and how to handle edge cases.

Example: 'Deployment: All dbt model changes require at least one not_null and one unique test on the primary key. Models touching financial data require a second reviewer. Incremental models should use unique_key for idempotency. When a pipeline fails in production, first check the Airflow logs, then check if the upstream source has a known outage (status page: status.sourcesystem.com), then check for schema changes in the source system.'

CLAUDE.md as Compounding Knowledge

The most powerful property of CLAUDE.md is that it compounds. Every time you correct Claude Code -- 'no, we use stg_ prefix, not staging_' -- you can add that correction to CLAUDE.md. Next session, the correction is automatic. Over weeks and months, CLAUDE.md evolves from a sparse set of conventions into a comprehensive knowledge base that encodes your team's entire data engineering practice.

This compounding effect is why CLAUDE.md is a persistent memory layer, not just documentation. Documentation is written once and decays. CLAUDE.md is actively maintained through daily use. When you discover a new edge case, you add it. When a convention changes, you update it. The file stays current because it is part of your daily workflow, not a separate maintenance task.

Data Workers amplifies this compounding effect. Our agents automatically discover schema conventions, metric definitions, and data quality patterns from your stack and can suggest additions to CLAUDE.md based on what they observe. The result: a persistent memory layer that grows from both human input and automated agent observation.

CLAUDE.md Across the Data Team

CLAUDE.md is most powerful when shared across a team. A project-level CLAUDE.md in your dbt repository means that every data engineer on the team gets the same conventions, the same metric definitions, and the same operational knowledge -- regardless of their experience level.

This is particularly valuable for onboarding. A new data engineer joins the team, opens Claude Code in the dbt project, and immediately has access to every convention, every business definition, and every operational procedure the team has encoded. They do not need to read through hundreds of pages of wiki documentation. They do not need to ask senior engineers basic questions. The knowledge is in CLAUDE.md, and Claude Code applies it automatically.

Start building your data stack's persistent memory today. Add a CLAUDE.md to your dbt project with schema conventions, metric definitions, and quality rules. Then connect Data Workers' 15 agents to compound that knowledge with automated observation. Book a demo to see how persistent memory transforms data engineering workflows, or read the docs for CLAUDE.md best practices.

See Data Workers in action

15 autonomous AI agents working across your entire data stack. MCP-native, open-source, deployed in minutes.

Book a Demo

Related Resources

Anthropic Claude Documentation — external reference
Claude Code Memory Claude Md For Data — Claude Code Memory Claude Md For Data
The Persistent Memory Layer: How AI Agents Remember Across Sessions — A persistent memory layer stores context, lineage, past actions, and learned patterns — letting agents compound intelligence over time.
Claude Code Data Tools: The Complete Guide for Data Engineers (2026) — The definitive guide to Claude Code data tools: MCP servers for Snowflake, BigQuery, dbt, and Airflow; pipeline scaffolding; debugging wo…
Claude Managed Agents for Data Pipelines: From Prototype to Production in Days — Claude Managed Agents (April 2026) handles orchestration and long-running execution. Combined with Data Workers MCP servers, go from prot…
Sub-Agents and Multi-Agent Teams for Data Engineering with Claude — Claude Code spawns sub-agents in parallel — one explores schemas, another writes SQL, another validates. Multi-agent data engineering.
Claude Code + MCP: Connect AI Agents to Your Entire Data Stack — MCP connects Claude Code to Snowflake, BigQuery, dbt, Airflow, Data Workers — full data operations platform.
Hooks, Skills, and Guardrails: Production-Ready Claude Agents for Data — Claude Code hooks and skills transform Claude into a production-ready data engineering agent.
File-Based Agent Memory: Why Claude Code Agents Don't Need a Database — File-based agent memory is simpler, portable, and version-controlled. No database required.
Long-Running Claude Agents for Data Pipeline Monitoring — Long-running Claude agents monitor pipelines continuously — detecting anomalies and auto-resolving incidents.
From Broken Pipelines to Claude-Native Data Infrastructure — Claude-native data infrastructure replaces manual maintenance with autonomous agents, persistent memory, and MCP.
How Claude Code Handles 'Why Don't These Numbers Match?' Questions — Use Claude Code to trace why numbers don't match — across tables, joins, and transformations.
Claude Code + Data Migration Agent: Accelerate Warehouse Migrations with AI — Migrating from Redshift to Snowflake? The Data Migration Agent maps schemas, translates SQL, validates data, and manages rollback — all o…

Explore Topic Clusters

Data Governance: The Complete Guide — Policies, access controls, PII, and compliance at scale.
Data Catalog: The Complete Guide — Discovery, metadata, lineage, and the modern catalog stack.
Data Lineage: The Complete Guide — Column-level lineage, impact analysis, and observability.
Data Quality: The Complete Guide — Tests, SLAs, anomaly detection, and data reliability engineering.
AI Data Engineering: The Complete Guide — LLMs, agents, and autonomous workflows across the data stack.
MCP for Data: The Complete Guide — Model Context Protocol servers, tools, and agent integration.
Data Mesh & Data Fabric: The Complete Guide — Federated ownership, domain-oriented architecture, and interop.
Open-Source Data Stack: The Complete Guide — dbt, Airflow, Iceberg, DuckDB, and the modern OSS toolkit.
AI for Data Infra — The complete category for AI agents built specifically for data engineering, data governance, and data infrastructure work.