Most organizations we audit do not lack postmortems. They have a template, a meeting, and a folder of well-written documents - and the same incidents keep happening. The program produces artifacts instead of change, everyone senses it, and attendance quietly decays.
We have rebuilt this loop for our own team and for clients. The fixes are unglamorous mechanics, not cultural slogans.
Blameless does not mean consequence-free
"Blameless" is the most misunderstood word in reliability. It does not mean nobody is accountable; it means the analysis assumes people acted reasonably on the information they had, so the interesting question becomes why the system made the wrong action look right.
The tell of a broken program is a root cause written as "engineer ran the script against production". That sentence describes a trigger, not a cause. The causal questions are: why did the script accept a production target without confirmation, why did credentials for production work from that context at all, and why did nothing distinguish the two environments visually. A person-shaped root cause always hides a system-shaped one underneath - and person-shaped fixes ("be more careful", "added a checklist item") have a half-life measured in weeks.
The litmus test we apply to every draft: would this fix survive the author leaving the company? Training and vigilance leave with people. Guardrails stay.
Action items are where programs go to die
Every failed postmortem program fails the same way: action items are created in the meeting, assigned to nobody in particular, and evaporate. In one client's backlog we found over two hundred open postmortem action items, the oldest three years old, including two that would have prevented the incident that triggered our engagement.
The mechanics that fixed it:
- Every action item has one named owner and a due date at creation time. "The platform team" is not an owner. If nobody will own it, the honest move is to say so in the document and consciously accept the risk - silence is how risk gets accepted by accident.
- Action items live in the normal work tracker, not in the postmortem document. Work that is not in the backlog does not exist.
- A monthly review walks every open item. Not a metrics dashboard - a meeting where each overdue item is either rescheduled with a reason, re-scoped, or explicitly cancelled by its owner in front of peers. This meeting is a forcing function, and skipping it for one quarter reliably undoes the program.
- Items expire. Anything open past two review cycles is either promoted to a planned project or closed as accepted-risk, with the acceptance recorded. A backlog of stale intentions is not safety work; it is a place to hide from decisions.
Measure repeats, not documents
Postmortem count measures activity. The metric that measures the program is the repeat rate: what fraction of incidents share a causal class with an earlier one. We tag every incident against a small taxonomy - config change without validation, expired credential, untested failover, dependency exhaustion - and review the distribution quarterly.
When we started tagging on one client engagement, 40% of incidents were repeats of a known class. A year of the mechanics above brought it under 15%. Nothing else we track correlates as directly with "the organization actually learns".
Keep the review meeting about the system
Two rules keep the meeting itself healthy. The timeline is agreed in writing before the meeting, so the room spends its time on causes rather than reconstructing chronology from memory. And the incident responder does not chair their own review - a peer facilitates, which keeps the session from sliding into self-defense or, just as unhelpfully, self-flagellation.
A postmortem program is not a documentation habit. It is a control loop: incident, analysis, change, verification. If the output is documents, the loop is open - and an open loop, however well-written, learns nothing.