A client had a production outage caused by a developer deploying a configuration change that hadn't been tested in staging. The post-mortem conclusion: "Developer X should have tested in staging first." Action item: "Remind all developers to test in staging."

Six weeks later, a different developer deployed a different configuration change without testing in staging. Same failure mode, different person. The "fix" — reminding humans to follow a process — failed because it relied on human memory and discipline instead of system design.

The right conclusion wasn't "Developer X made a mistake." It was "our deployment system allows configuration changes to bypass staging." The right action item wasn't "remind people." It was "add a CI pipeline gate that blocks production deployments of configuration changes that haven't been deployed to staging first."

The Blameless Principle

Blameless doesn't mean "no accountability." It means asking "why did the system allow this?" instead of "who messed up?" Every production incident has a human action at the center. But the human action only caused an incident because the system didn't have adequate guardrails.

A developer deployed broken code? Why did the deployment pipeline allow broken code to reach production? A DBA ran a migration that locked the database? Why was it possible to run a locking migration during business hours without a warning? An engineer misconfigured a load balancer? Why was manual load balancer configuration possible in the first place?

When you trace every incident back to a system gap instead of a human gap, you get action items that actually prevent recurrence. Humans make mistakes at a roughly constant rate. Systems can be improved to prevent those mistakes from causing incidents.

The 48-Hour Rule

Run the review within 48 hours of the incident. Not next sprint. Not during the quarterly retrospective. Within 48 hours, while the details are fresh in everyone's memory and the emotional weight of the incident drives urgency to fix things.

The meeting should include: everyone who was involved in the incident response (on-call engineer, the person who made the change, the person who detected the issue), the engineering manager or team lead, and anyone from other teams who was impacted (customer support, sales, leadership — they add context about business impact that the engineering team often doesn't see).

The Review Template

I use a simple template that forces the conversation toward action:

Timeline. What happened, minute by minute, from the first sign of trouble to resolution. This isn't about blame — it's about understanding the sequence of events so you can identify where intervention would have been most effective.

Detection. How did we learn about the incident? If the answer is "a customer emailed support," that's a monitoring gap. If it's "our alerting fired within 2 minutes," that's a strength.

Response. What did the response team do? What worked well? What took longer than it should have? Were there missing runbooks, unclear escalation paths, or tools that didn't work?

Root cause analysis. Not "who caused it" but "what conditions allowed this to happen." Use the "5 Whys" technique: Why did the deployment fail? → Because the migration locked the table. → Why did the migration lock the table? → Because it was an ALTER TABLE on a large table. → Why was it run during business hours? → Because there's no deployment window policy. → Why is there no deployment window policy? → Because we've never had a migration large enough to cause problems before. → Action item: implement a deployment window policy for schema changes that affect tables over 1M rows.

Action items. 3-5 concrete, specific items with an owner and a deadline. Not "improve monitoring" — instead "add a database lock duration alert that fires when any query holds a lock for more than 30 seconds (Owner: Sarah, Due: March 28)."

The 3-5 Action Item Limit

The biggest mistake in post-incident reviews is generating 15 action items that never get implemented. By the next sprint, the urgency has faded, and the list sits in a document nobody reads.

Limit yourself to 3-5 action items per incident. Prioritize the ones that would have either prevented the incident entirely or reduced the time to detection and recovery. Everything else goes on a "future improvements" list that gets reviewed quarterly.

Track action item completion. If your team consistently generates action items and doesn't complete them, the review process is theater. Either reduce the number of items to something achievable, or allocate explicit sprint capacity for incident prevention work.

Publish Internally

The learning from a post-incident review should extend beyond the incident response team. Publish a summary (not the full timeline — a readable summary with key learnings and action items) to your internal engineering channel. Other teams will learn from your experience, and the visibility creates healthy social pressure to complete the action items.

Some companies go further and publish post-incident reviews externally. Cloudflare, PagerDuty, and Atlassian all publish detailed incident reports. This builds customer trust by demonstrating transparency and accountability. If you're comfortable with external publication, the trust benefit is significant — customers who've seen how you handle incidents are more confident in your reliability than customers who've never seen a failure.


Related: Engineering Metrics That Actually Matter | DevOps Fundamentals for Growing Teams | Quantifying Technology Risk