What John Allspaw's Blameless Postmortem Method Taught Our Incident Analysis Agent
How a decade-old insight from Etsy's engineering culture — ask 'how?' not 'why?' — changed the way we built the dw-incidents agent's analysis logic
By The Data Workers Team
In May 2012, John Allspaw published a post on Etsy's engineering blog titled 'Blameless PostMortems and a Just Culture.' More than a decade later, it remains one of the most linked-to pieces in all of site reliability engineering. That staying power is not nostalgia. It is because the core idea is still being violated — by templates, by tools, and now by AI agents — every day.
Allspaw is the cofounder of Adaptive Capacity Labs and former CTO of Etsy. He holds an MSc in Human Factors and Systems Safety from Lund University. His work sits at the intersection of cognitive systems engineering, resilience engineering, and software operations — a rare combination that produces writing that is simultaneously theoretically rigorous and immediately applicable.
When we built the dw-incidents agent, we had a choice: build a postmortem template-filler, or build something that actually learns. Allspaw's body of work made that choice obvious. The hard part was figuring out what 'actually learns' means in code.
What Is Actually Worth Learning
Allspaw's method rests on a small number of ideas that do heavy lifting. They are deceptively simple to state and surprisingly hard to operationalize.
The first is that causes are constructed, not found. From 'The Infinite Hows' (kitchensoap.com, 2014): 'Cause is something we construct, not find. And how we construct causes depends on the accident model that we believe in.' This is not a philosophical point — it is a practical warning. If your accident model says 'find the broken part,' you will always find a broken part, usually a person. If your model says 'understand the conditions,' you find conditions you can actually improve.
The second is local rationality. Allspaw's formulation: 'we want to know how it made sense for someone to do what they did. And make no mistake: they thought what they were doing made sense.' This is the analytical move that separates learning from blame. You are not asking whether the action was correct in hindsight. You are asking what the world looked like from inside the decision, with the information actually available.
The third is the shift from First Stories to Second Stories. First Stories treat human error as the cause of failure. Second Stories treat human error as the effect of systemic vulnerabilities. The same incident, two entirely different documents — and only one of them produces action items that make the system safer.
The fourth is the question swap. Allspaw is precise: 'Asking why? too easily gets you to an answer to the question who? which in almost every case is irrelevant.' Asking 'how?' instead — 'how did it make sense to do this?' — surfaces the conditions that allowed the event to take place. It keeps the analysis in the system, not the individual.
The fifth, from his 2021 post 'Understanding Incidents: Three Analytical Traps,' is a checklist of ways postmortems quietly reintroduce blame: counterfactual framing ('did not initiate'), normative language ('mismanaged,' 'insufficient'), and mechanistic reasoning (declaring a single broken component). Each one closes down learning while appearing to do analysis.
How a Method Becomes a Skill
Encoding this into a swarm agent required making several implicit moves in Allspaw's method explicit enough to be executed consistently.
The first was ordering. The dw-incidents agent now uses diagnose_incident and get_incident_history before calling get_root_cause — not after. The sequence matters. If you fetch the technical root cause first, every subsequent analysis unconsciously anchors to it. Multiple perspectives become rationalizations of a conclusion already reached. Allspaw's method demands breadth before depth.
The second was a language filter. Any draft postmortem text that contains counterfactual framing, normative adjectives, or person-first causal claims gets flagged and rewritten. The filter is not cosmetic — it is the mechanism that keeps the Second Story intact from first draft to final document.
The third was the local rationality prompt. For each decision point in the timeline, the agent constructs a context snapshot: what signals were visible via monitor_metrics at that exact moment, what the runbook said (if anything), what the escalation state was. Then it asks: given these conditions, what would a reasonable engineer with this knowledge have done? The answer almost never matches the post-hoc narrative.
The fourth was audience framing. Allspaw's point that a postmortem serves three distinct audiences — those absent, those present, and future engineers not yet at the company — changed how the agent structures its output. Context-richness is not optional padding. It is the mechanism by which institutional learning actually travels.
- •Perspectives before root cause: diagnose_incident runs before get_root_cause to prevent anchoring
- •Local rationality reconstruction: monitor_metrics surfaces what was visible at decision time, not in retrospect
- •Three-trap language filter: counterfactual, normative, and mechanistic language is flagged and rewritten
- •Three-audience output: postmortem structure is designed for the absent, the present, and the future
One of More Than 400
The dw-incidents agent carries more than 400 method-named skills across the full Data Workers swarm of 19 agents. Each skill is distilled from a practitioner's public method — a framework, a paper, a body of writing that has proven its value in the field. The blameless postmortem skill is one of those. It is named for the method, not the person, because the goal is to institutionalize the practice — to make it available to any agent invocation, at any hour, without requiring the analyst to have read everything Allspaw has written.
That is the aspiration behind this approach to skill-building: take what the best practitioners have figured out in public, verify it against primary sources, and encode it precisely enough that an agent can execute it consistently.
A note on this post: This is independent commentary and homage. It distills publicly available writing and talks by John Allspaw to illustrate a working method, and every quote is drawn from and verified against the primary sources linked above. The skill it describes is named for the method, not the person, and contains no marketing claims attributed to them. Data Workers is not affiliated with, sponsored by, or endorsed by John Allspaw. If you are John Allspaw and would like anything adjusted or removed, email hello@dataworkers.io and we will respond promptly.
Related Posts
What Ralph Kimball's Dimensional Modeling Taught Our Pipelines Agent
Ralph Kimball's four-step dimensional design process is one of the most durable ideas in data engineering — here is what it taught our pipelines agent.
What Jay Kreps's Log-Centric Architecture Taught Our Streaming Agent
Jay Kreps's core insight is deceptively simple: an append-only, totally-ordered log is not just a message bus — it is the single source of truth that eliminates N² integration pipelines and makes reprocessing routine. We studied his published writing and built a reusable streaming skill around the method.
What W. Edwards Deming's Plan-Do-Study-Act Taught Our Data Quality Agent
W. Edwards Deming spent a career arguing that quality comes from improving the process, not inspecting for defects. His Plan-Do-Study-Act cycle is the most rigorous improvement loop in the field. Here is how we encoded it into our data quality agent.