What Jay Kreps's Log-Centric Architecture Taught Our Streaming Agent
The co-creator of Apache Kafka spent a decade arguing that an append-only log is the unifying abstraction under all distributed data systems. His method is all public — so we studied it and built a streaming skill around it.
By The Data Workers Team
In 2013, Jay Kreps published an essay on the LinkedIn Engineering blog that has since become required reading for anyone building distributed data systems. The title was deliberately understated: 'The Log: What every software engineer should know about real-time data's unifying abstraction.' The essay argued that the humble append-only log — the kind of structure that has been inside databases and operating systems for decades — was not a low-level implementation detail but the single most important organizing principle for modern data infrastructure.
Kreps had already built Apache Kafka at LinkedIn to solve a real problem: the data integration graph was approaching what he described as O(N²) pipelines, with every source system talking directly to every destination system via a custom connector. The log was the answer. Each system integrates once — writing to or reading from a shared, ordered, replayable log — rather than talking to every other system directly. The combinatorial explosion of pipelines collapses into a single integration point.
A year later he followed with 'Questioning the Lambda Architecture,' which extended the argument into stream processing: if your stream processing framework is good enough, you do not need two separate code paths — a batch layer and a speed layer — to produce the same result. You need one replayable log and one processing path. The maintenance burden of keeping two systems synchronized disappears.
Both essays are still online, still precise, and still worth reading in full. We did exactly that — and then built a streaming skill around the method.
What Is Actually Worth Learning
It is easy to focus on the technology — Kafka, topics, partitions, consumer groups. Those are the implementation. The method is something Kreps has articulated directly in his own writing, and three ideas do the real work.
The log as state-machine input. Kreps's core principle: 'If two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state.' A totally-ordered log enforces that input order across distributed replicas. Every downstream index, view, or query-optimised store is just a projection — a materialized view — of the log. If you have the log, you can rebuild any projection from scratch. This is what makes the log the source of truth rather than just a message bus.
Tables and logs are dual. One of the quieter insights in the essay: 'In this sense you can see tables and events as dual: tables support data at rest and logs capture change.' This duality is practically powerful. A database changelog becomes a stream. A stream of events materializes into a table. Any system that can read a log can derive the state it needs; any system that can emit a changelog can participate in the streaming fabric. The consequence is that the log is not a special new thing sitting beside your existing systems — it is the other face of what they already contain.
Reprocessing as a first-class deployment primitive. The Lambda Architecture — a separate batch layer and real-time layer producing the same result — was a widely adopted pattern when Kreps critiqued it. His diagnosis was exact: 'The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be.' His alternative: one stream processing path, backed by a replayable log with enough retention to cover the reprocessing windows you actually need. When a bug is fixed or a new downstream consumer is added, reprocessing is routine: 'start a second instance of your stream processing job that starts processing from the beginning of the retained data.' Merge and cut over when it has caught up. No batch/speed reconciliation required.
- •The N² problem: N sources × N sinks = N² custom pipelines. The log collapses this to N integrations, one per system.
- •Every downstream index is a projection of the log — rebuild any of them by replaying from the beginning.
- •Tables and logs are dual faces of the same data: one captures state, the other captures change.
- •Reprocessing should be a deployment step, not an incident. Retain enough log to make it routine.
- •A single code path beats two synchronized code paths. If your stream processor is good enough, drop the batch layer.
How a Method Becomes a Skill
The distillation approach we use was inspired by Mimeo, the open-source project from K-Dense AI: turn a body of public expertise into a SKILL.md. The non-negotiable rule is provenance — every principle in the skill traces to something the practitioner actually published, with the quote verified against the source. No paraphrase-drift. No invented method. And the skill is named for the method, not the person.
For Kreps, both primary sources are still live and citable: the LinkedIn Engineering essay and the O'Reilly Radar post. Every quote in the skill below was fetched from those URLs and confirmed word-for-word before being included.
The skill we built, log-centric-streaming, runs in our streaming agent and encodes the method as a procedure with explicit decision points:
- •Map the integration graph first. If the connection count is approaching N², that is the signal to centralise around a log — each system integrates once, not to every other system directly.
- •Partition by how consumers need to order events, not by convenience. The log's value depends on its ordering guarantees holding for the consumers that need them.
- •Treat every downstream store as a projection of the log. If a sink is receiving a direct database extract instead of a topic subscription, that is an N² pipeline risk — flag it.
- •Reprocessing defaults to an isolated consumer group, never a reset of the live consumer's offsets. Parallel catch-up, then cut over after validation.
- •If two consumers receive the same logical event via different code paths — one from the log, one from a batch export — that is the Lambda Architecture dual-path problem. Flag it and recommend converging.
The decision points are where the method's judgment lives. If retention has expired for the requested reprocessing window, the skill stops rather than replaying a partial window and calling it complete. If a sink's idempotency is unknown, the skill requires explicit confirmation before any offset reset touches a live consumer. These are the failure modes Kreps's writing anticipates, made explicit and executable.
Why a Skill, Not a Clever Prompt
You could paste a summary of the log-centric method into a chat window. It would help occasionally and evaporate at the end of the session. A skill is durable in a way a prompt is not: it has explicit triggers, so the agent reaches for it on the right kind of question; it is wired to specific tools — configure_stream, get_streaming_lineage, monitor_lag, get_stream_health — so 'check the integration graph' means actually calling the lineage API, not hand-waving; and it declares its handoffs, so when retention has expired the work routes to the connector agent rather than silently replaying a partial window. It is version-controlled, reviewable, and composable with the rest of the swarm.
One of More Than 400
This is not a one-off. The log-centric-streaming skill is one of more than 400 skills we have authored across 19 specialized agents — covering connectors, catalog and context, cost, governance, incidents, analytics, migration, ML, observability, orchestration, pipelines, quality, schema, search, streaming, and usage intelligence. Some are built from first principles. Some, like this one, are distilled from the public work of the best practitioners in the field. All of them are version-controlled, validated against the tools they call, and ready for an agent to run.
The reason we build this way is the same reason Kreps's essays are still worth reading twelve years after publication. The hard, durable value in data infrastructure is not any single pipeline — it is the architectural instinct for what belongs in a log, what belongs in a table, and when reprocessing is the right answer. Most of that instinct is trapped in individual heads and lost on every team reorganization. A skill library is a bet that you can capture a published method, name it, verify it, and hand it to an agent that does not forget it.
Primary sources: 'The Log' — https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying | 'Questioning the Lambda Architecture' — https://www.oreilly.com/radar/questioning-the-lambda-architecture/
The distillation approach we drew on: github.com/K-Dense-AI/mimeo
A note on this post: This is independent commentary and homage. It distills publicly available writing and talks by Jay Kreps to illustrate a working method, and every quote is drawn from and verified against the primary sources linked above. The skill it describes is named for the method, not the person, and contains no marketing claims attributed to them. Data Workers is not affiliated with, sponsored by, or endorsed by Jay Kreps. If you are Jay Kreps and would like anything adjusted or removed, email hello@dataworkers.io and we will respond promptly.
Related Posts
What Ralph Kimball's Dimensional Modeling Taught Our Pipelines Agent
Ralph Kimball's four-step dimensional design process is one of the most durable ideas in data engineering — here is what it taught our pipelines agent.
What W. Edwards Deming's Plan-Do-Study-Act Taught Our Data Quality Agent
W. Edwards Deming spent a career arguing that quality comes from improving the process, not inspecting for defects. His Plan-Do-Study-Act cycle is the most rigorous improvement loop in the field. Here is how we encoded it into our data quality agent.
What Charity Majors's Wide-Event Method Taught Our Observability Agent
Charity Majors's core insight — emit wide, store raw, slice later — turns out to be exactly the right architecture for an agent that has to debug things it was never told to expect.