Engineering7 min read

What Martin Kleppmann's Evolutionary Compatibility Method Taught Our Schema Agent

The rolling-upgrade insight at the heart of DDIA — and how we encoded it into the agent that guards every schema change

By The Data Workers Team

If you work in data engineering long enough, you will eventually break a pipeline by changing a schema. Not because you were careless, but because you thought the change was safe — and it was, for readers on the new version. What you missed was the reader still running the old version on the other side of the cluster.

Martin Kleppmann has spent much of his career making this class of problem legible. His 2017 book Designing Data-Intensive Applications (DDIA) is one of the most-referenced books in the data engineering field, and his 2012 blog essay on schema evolution in Avro, Protocol Buffers, and Thrift is still the clearest comparative treatment of the topic that exists. He is currently an Associate Professor at the University of Cambridge, where his research focuses on local-first software and distributed systems. He does not run a data platform company and has no commercial stake in how you manage your schemas.

What Is Actually Worth Learning From His Work

Kleppmann's schema method is built on one foundational observation: in any real system, you cannot flip every producer and consumer to a new schema version at exactly the same instant. Deployments are staggered. Services restart on different schedules. Old data written under a previous schema lives in storage indefinitely. The problem is not 'how do I change the schema' — it is 'how do I change the schema while the system keeps running with mixed versions everywhere.'

From that observation, two compatibility requirements fall out, and both must hold simultaneously for a schema change to be safe:

  • Backward compatibility: newer reader code can read data written under the older schema. This protects you when you deploy new consumers before old producers are retired.
  • Forward compatibility: older reader code can safely process data written under the newer schema — typically by ignoring unknown fields. This protects you when new producers ship before all consumers have upgraded.

Most teams check one of these, not both. The Kleppmann method insists on both, because which direction matters depends on rollout order — and rollout order in distributed systems is rarely in your control.

The second insight is that your encoding format is not neutral on this question. In Protocol Buffers and Thrift, fields are identified by integer tag numbers, not names. Renaming a field is safe because the binary never contained the name. Removing a field is safe as long as you never reuse its tag number. In Avro, there are no tag numbers — the writer schema and reader schema are matched by name, and a schema registry or embedded schema version is required to translate between them. These are not implementation details; they determine which evolution operations are physically possible and which will silently corrupt data.

The third insight, from his writing on log-based data infrastructure, is that immutability is what makes the compatibility guarantee durable. When you model data as an append-only log rather than in-place mutation, the schema version a given record was written under is a fixed fact. Readers can always be given that version alongside the record. If you allow overwriting stored data, you lose the ability to reason about version provenance — and compatibility guarantees become a social contract rather than a technical one.

As Kleppmann writes in his 2015 essay on logs for data infrastructure: 'The application only appends the data to a log... All the different representations of this data are constructed by consuming the log in sequential order.' The log does not just solve throughput — it preserves the version history that makes safe evolution tractable.

How a Method Becomes a Skill

The dw-schema agent handles schema change assessment and migration planning. Its default behavior is to check whether a proposed change is 'structurally compatible' — and structural compatibility, it turns out, is necessary but not sufficient. A column rename can pass a structural check and still break three downstream models that hard-reference the old name. A type widening can look safe and still break a consumer that casts explicitly to the narrow type.

The evolutionary-schema-compatibility skill encodes Kleppmann's dual-direction test into an eight-step procedure. It starts by snapshotting the live schema — never an assumed baseline, because schema drift between what you think is deployed and what actually is deployed produces false-safe verdicts. It then classifies the change type (additive, subtractive, type-widening, type-narrowing, constraint change), runs both backward and forward compatibility checks, and then validates against the real consumer registry — not just the structural schema. Consumer registries capture hard expectations (a NOT NULL assumption, a required field) that the schema definition itself does not always encode.

A key step is rollout-order assessment. For a backward-compatible-only change, readers must be deployed before writers. For a forward-compatible-only change, the order reverses. For a change that is neither, the skill blocks until a deprecation window is agreed: the old field and new field coexist, consumers migrate to the new one, and the old field is removed only after all readers have updated. This is the canonical additive-migration pattern — adding rather than replacing — which is the operational translation of Kleppmann's core insight.

One of More Than 400

The evolutionary-schema-compatibility skill is one of more than 400 method-named skills across 19 agents in the Data Workers swarm. Each skill names a method, not a person, and cites its intellectual source in a Provenance block. The goal is to encode the actual working logic — the step sequence, the decision points, the things that go wrong — rather than a summary of principles.

Kleppmann's work is particularly well-suited to this encoding because it is unusually concrete. He does not just argue that compatibility matters; he specifies which operations are safe in which encoding formats, in what order, and under what conditions. That specificity is what makes the method teachable to an agent.

A note on this post: This is independent commentary and homage. It distills publicly available writing and talks by Martin Kleppmann to illustrate a working method, and every quote is drawn from and verified against the primary sources linked above — specifically his blog essays at martin.kleppmann.com and his book Designing Data-Intensive Applications (O'Reilly, 2017). The skill it describes is named for the method, not the person, and contains no marketing claims attributed to them. Data Workers is not affiliated with, sponsored by, or endorsed by Martin Kleppmann. If you are Martin Kleppmann and would like anything adjusted or removed, email hello@dataworkers.io and we will respond promptly.

Related Posts