Chaos engineering is the practice of deliberately injecting failures into your production systems to discover weaknesses before they cause real outages. Kill a server. Drop network connections between two services. Corrupt a cache. Introduce latency to a database query. Then observe: does the system degrade gracefully, or does it fall over?
Netflix created this discipline with Chaos Monkey — a tool that randomly killed production servers to ensure their systems could handle instance failures without customer impact. The idea spread across the industry and became a formal discipline with its own tools, practices, and conferences.
Why It Exists
Traditional testing verifies that things work correctly. Chaos engineering verifies that things fail correctly. These are fundamentally different questions.
Your unit tests confirm that the payment service processes a valid transaction. Chaos engineering asks: what happens when the payment service can’t reach the database for 30 seconds? Does it retry? Queue the request? Return an error to the user? Hang indefinitely and take down the checkout flow?
Most teams discover the answers to these questions during production outages — at 2 AM, with customers complaining, while an engineer scrambles through logs. Chaos engineering is the practice of discovering those answers proactively, during business hours, with the team prepared.
How It Works
A well-structured chaos experiment follows this pattern:
- Form a hypothesis. “If we kill one of three application servers, the load balancer will route traffic to the remaining two and users won’t notice.”
- Define the blast radius. Start small. Test against a subset of traffic or a non-critical environment.
- Inject the failure. Kill the server, introduce the latency, drop the packets.
- Observe the results. Did the system behave as hypothesized? Check error rates, latency, user impact.
- Fix what broke. When the hypothesis was wrong — and it often is — you’ve found a real weakness before your customers did.
Tools like Gremlin, LitmusChaos, and AWS Fault Injection Simulator make this easier to execute in a controlled way.
Who Should Be Doing It
This is where I differ from the conference circuit. Most companies are not ready for chaos engineering, and that’s fine.
Under 20 engineers: you probably don’t have the observability, the automated recovery, or the redundancy that makes chaos experiments safe and useful. If you can’t see the impact of a failure in real time (monitoring, alerting, distributed tracing), injecting failures is just creating outages with extra steps. Focus on building the foundation first.
At 20-50 engineers: you can start with lightweight chaos practices. Test your disaster recovery procedures. Kill a staging server and see what happens. Run a game day where the team practices incident response against a simulated failure. These give you most of the value without the risk.
At 50+ engineers with mature infrastructure: formal chaos engineering becomes valuable. At this scale, the system is complex enough that nobody fully understands all the failure modes. Automated chaos experiments running against a subset of production traffic can surface issues that no amount of code review or testing would catch.
Common Mistakes
Running chaos experiments without observability. If you can’t see what happened after you inject a failure, you’ve learned nothing. You need monitoring, alerting, and distributed tracing in place before chaos engineering makes sense.
Starting in production before proving it in staging. Your first chaos experiment should never be in production. Build confidence in staging, establish runbooks for the failures you might trigger, and make sure your team knows how to stop the experiment if things go wrong.
Treating it as a one-time exercise. A single game day is useful but limited. Chaos engineering delivers the most value as a continuous practice — systems change, new services get added, and failure modes that were handled six months ago might not be handled after the last refactor.
Confusing chaos engineering with breaking things for fun. Real chaos engineering is scientific: hypothesis, controlled experiment, observation, learning. Randomly killing servers without a hypothesis or observation framework is just causing problems.
The Verdict
Chaos engineering is a powerful practice that most companies aren’t ready for yet. And that’s okay. The principles — think about failure modes proactively, test your recovery procedures, don’t assume your system handles failures gracefully just because you designed it to — are valuable at any scale. Start with game days and disaster recovery drills. Graduate to automated chaos experiments when your infrastructure, observability, and team are mature enough to make them productive rather than destructive.
Related: Observability and Monitoring for Growing Teams | Post-Incident Reviews That Work
