Six hours down. Customers angry. The engineering team scrambling on a bridge call, sharing half-formed theories about what broke and why. Someone finally found the fix. The platform came back up. And now you’re asking the right question: how do we make sure this never happens again?

Here’s the honest answer: you can’t guarantee it never happens again. What you can do is make it much less likely, catch it faster, and recover in minutes instead of hours. That’s a realistic goal. “Never again” is not.

Let me tell you what the companies that achieve that goal do differently.

Start With a Real Post-Incident Review

The instinct after an outage is to fix the immediate cause and move on. Someone deployed a bad config. Fine — add a review step. The database hit its connection limit. Fine — increase the limit. These are reasonable quick fixes, and you should absolutely make them. But if you stop there, you’ve treated the symptom and not the cause.

A real post-incident review — sometimes called a post-mortem, though that word implies blame, which is counterproductive — asks a different set of questions.

Not “who broke it?” but “what conditions made it possible to break?”

Not “what was the fix?” but “why did it take us three hours to identify the cause?”

Not “what failed?” but “what does our detection and response process reveal about gaps in our observability?”

The goal is a written document that captures: what happened, the timeline of detection and response, the contributing factors (there are almost always more than one), and specific action items with owners and deadlines. Not a retrospective you write and file. One that you review in 30 days to see which action items are actually done.

Blameless is not optional. Engineers who expect to be blamed for outages hide information, downplay contributing factors, and stop raising concerns about fragile systems. The companies that have the most reliable systems are the ones where engineers feel safe saying “this thing we built has a scary failure mode” before the failure happens.

Understand Why You Didn’t Know Faster

Six hours is too long to be down before you’re recovered. But equally telling: how long before you knew you were down? Was it a customer complaint? An alert? Monitoring? A developer who happened to check the dashboard?

If your answer is a customer complaint, you have a monitoring problem. You should know about production failures before your users do, in every case.

This is the observability gap most growing teams have. They built the product and added monitoring later, or they have monitoring but it’s noisy and the team has alert fatigue from too many false positives. The result is the same: by the time anyone is paying attention, the damage is done.

What you actually need is straightforward: metrics on the things that matter to users (request latency, error rate, availability), alerts that fire when those metrics cross meaningful thresholds, and someone — a person, with a phone, who is on call — who is expected to respond when that alert fires.

On-call doesn’t have to be brutal. But it does have to be real. If nobody is responsible for production at 2am, your recovery time is however long it takes for someone to notice at 9am.

Find the Real Contributing Factors

The proximate cause of your outage was whatever the engineer fixed. But six-hour outages almost never have a single cause. They have a chain of conditions that each had to be true simultaneously. That’s important, because breaking any link in that chain would have prevented or shortened the outage.

Typical contributing factors I see in post-incident reviews:

  • A deployment that wasn’t rolled back quickly because rollback wasn’t tested and nobody knew the exact procedure under pressure
  • A database or infrastructure component that had no redundancy, because it was never worth the cost — until it was
  • Monitoring that existed but wasn’t alerting on the right signal
  • An on-call rotation that was theoretical — someone was technically responsible, but nobody escalated when they didn’t respond
  • A change management process that allowed a risky deploy to go out during peak traffic hours
  • Runbooks that were out of date or missing for the specific failure mode

Each of those is fixable. None of them require rebuilding your system from scratch. But you have to surface them honestly in the post-incident review, and you have to assign someone to actually fix them.

Build Reliability Into Your Process, Not Just Your Architecture

The most common mistake companies make after a major outage is investing entirely in technical solutions — redundancy, failover, better hardware — while ignoring process. Architecture can absorb certain kinds of failures. Process is what determines whether your team can respond effectively to the ones that happen anyway.

Two practices matter more than any single technical change:

Define your reliability targets explicitly. What uptime does your business actually need? 99.9% is 8.7 hours of downtime per year. 99.95% is 4.4 hours. 99.99% is 52 minutes. Each of those requires a different investment level. If you don’t have a target, you can’t make a coherent investment decision — and your engineers can’t make good tradeoffs between speed and reliability.

Run chaos or failure exercises before incidents do it for you. Not elaborate game days — start with a simple question in your next engineering meeting: “What would happen if this database went down right now?” If the honest answer is “we’re not sure,” that’s the next thing to test. Find out in a controlled way, not at 2am.

If you’re looking at your post-incident notes from this week and trying to figure out which action items are actually going to prevent the next outage versus which ones are theater, that’s a 15-minute conversation. I’ve reviewed more post-incident reports than I can count — at companies that kept having outages and companies that stopped. The difference is usually three or four specific practices. Book time at go.nebari.cc/15-min and we’ll figure out which ones apply to you.


Related: Post-Incident Reviews That Work | Observability and Monitoring for Growing Teams | What Is an SLO?