The $150 Million Typo: AWS S3 and the Command That Broke the Internet
How a single mistyped command took down a third of the internet for 4 hours - a real-time problem-solving session for operational safety.
The $150 Million Typo: AWS S3 and the Command That Broke the Internet
The Problem Statement
You’re an AWS engineer debugging a billing system issue. It’s a routine Tuesday morning, February 28, 2017. You need to remove a small number of servers from one of the S3 subsystems to resolve the performance problem.
You have an established playbook. You’ve done this operation many times before. It’s a simple command to remove capacity from the system.
You type the command and hit enter.
In the next 4 hours, you’ll cause:
- Complete outage of Amazon S3 in the largest AWS region
- Disruption to thousands of websites and services
- $150 million in losses across S&P 500 companies
- AWS’s own status dashboard going dark (it relied on S3)
This is the real story of the AWS S3 outage, based on Amazon’s official post-incident summary available at https://aws.amazon.com/message/41926/
Let’s work through this problem together and understand how one typo created one of the internet’s worst outages.
Setting the Scene: What We Know
The Business Context
- AWS S3 is the backbone of internet storage
- US-EAST-1 region serves massive traffic (default region for many services)
- Tuesday morning, minimal AWS staff on site
- Billing system running slower than normal (not critical, but needs attention)
The Technical Setup
- S3 has multiple subsystems handling different functions
- Index subsystem: manages metadata and location of all S3 objects
- Placement subsystem: manages allocation of new storage
- Both systems designed to be resilient and scalable
- Established procedures for removing capacity during maintenance
The Operational Context
- Authorized engineer with proper access
- Following documented playbook for capacity reduction
- Operation performed many times successfully in the past
- System designed to handle capacity changes gracefully
Problem-Solving Session: What Would You Do?
Before we reveal what happened, let’s work through the decision points an engineer faces:
Decision Point 1: Command Construction
You need to remove servers from the billing subsystem. Your command needs to specify:
- Which subsystem to target
- How many servers to remove
- Which specific servers to remove
Think about this: What could go wrong when constructing this command?
- Wrong subsystem targeted?
- Wrong number of servers specified?
- Wrong servers selected?
- Command syntax error?
Decision Point 2: Validation Before Execution
What checks would you perform before running the command?
- Review the command syntax?
- Verify the target environment?
- Check current system capacity?
- Confirm the server count?
Decision Point 3: Execution Safety
What safety measures would you expect in the tooling?
- Confirmation prompts?
- Dry-run mode?
- Capacity minimums enforcement?
- Rollback capabilities?
Decision Point 4: Monitoring During Execution
How would you monitor the impact as the command runs?
- System health dashboards?
- Error rate monitoring?
- Performance metrics?
- User impact tracking?
What Actually Happened: The Cascade of Errors
The Fatal Command
The engineer intended to remove a small number of servers from the billing subsystem. Instead, they entered a parameter incorrectly and removed a large set of servers from two critical subsystems:
- Index subsystem - manages metadata for all S3 objects in the region
- Placement subsystem - manages storage allocation for new objects
The Technical Cascade
9:37 AM PST: Command executed, removes significant capacity 9:40 AM PST: Index subsystem begins failing (can’t serve GET, LIST, PUT, DELETE requests) 9:45 AM PST: Placement subsystem fails (can’t allocate storage for new objects) 9:50 AM PST: S3 in US-EAST-1 completely unavailable
The Business Impact Cascade
10:00 AM PST: Websites start failing globally
- Slack goes down
- Quora becomes unavailable
- Docker delays major announcements
- Trello stops working
- Thousands of other services affected
10:30 AM PST: The irony becomes apparent
- AWS’s own status dashboard can’t update (uses S3)
- AWS forced to communicate via Twitter
- Even the tools to report the outage are down
The Recovery Challenge
The problem: S3 systems hadn’t been fully restarted in years due to massive growth The complexity: Safety checks and metadata validation take longer at scale The timeline:
- 12:26 PM PST: Index subsystem begins partial recovery
- 1:18 PM PST: GET, LIST, DELETE operations fully restored
- 1:54 PM PST: PUT operations restored, S3 fully operational
Total outage time: 4 hours and 17 minutes
Root Cause Analysis: Why This Happened
1. Tool Design Failure
- Problem: Command allowed removing more capacity than safe
- Missing: Validation of minimum capacity requirements
- Missing: Confirmation prompts for large operations
2. Human Factors
- Problem: Easy to make parameter input errors
- Missing: Command preview or dry-run mode
- Missing: Clear feedback about operation scope
3. System Architecture
- Problem: Removing capacity triggered full system restart
- Missing: Graceful degradation for capacity reduction
- Missing: Faster restart procedures for large-scale systems
4. Operational Procedures
- Problem: Procedure assumed tool would prevent dangerous operations
- Missing: Manual validation steps
- Missing: Impact assessment before execution
The Framework: Operational Command Safety
Based on this incident, here’s a framework for preventing similar disasters:
1. Command Validation Layer
Pre-execution checks:
# Example: Validate before dangerous operations
if [ "$servers_to_remove" -gt "$safe_threshold" ]; then
echo "ERROR: Attempting to remove $servers_to_remove servers"
echo "Safe threshold is $safe_threshold"
echo "Current capacity: $current_capacity"
echo "Resulting capacity: $((current_capacity - servers_to_remove))"
exit 1
fi
Required validations:
- Command syntax verification
- Parameter range checking
- System state prerequisites
- Impact estimation
2. Operational Safety Controls
Implement multiple safety layers:
- Dry-run mode: Show what would happen without executing
- Confirmation prompts: Require explicit confirmation for high-impact operations
- Capacity minimums: Prevent removing below safe operational levels
- Rate limiting: Prevent removing too much capacity too quickly
3. Impact Assessment Process
Before any infrastructure command:
- Estimate impact: How many services/users affected?
- Assess risks: What could go wrong?
- Plan recovery: How to undo if things go bad?
- Validate timing: Is this the right time for this operation?
4. Monitoring and Circuit Breakers
Real-time monitoring during operations:
- System health metrics
- Error rate tracking
- User impact measurement
- Automatic rollback triggers
Red Flags for High-Risk Operations
Command-Level Red Flags:
- Operations affecting core infrastructure
- Commands that remove capacity or resources
- Bulk operations across multiple systems
- Commands that can’t be easily undone
Timing Red Flags:
- Operations during business hours
- Changes during high-traffic periods
- Maintenance without sufficient backup coverage
- Operations when key personnel are unavailable
System Red Flags:
- Systems that haven’t been restarted recently
- Infrastructure with unknown restart times
- Systems without graceful degradation
- Operations affecting multiple dependent systems
Questions to Ask Before High-Risk Commands
About the operation:
- “What’s the maximum impact if this goes wrong?”
- “How quickly can we detect if something’s wrong?”
- “What’s our rollback plan?”
- “Is this the right time for this operation?”
About the tooling:
- “Does our tooling prevent dangerous operations?”
- “Do we have dry-run mode for this command?”
- “Are there confirmation prompts for high-impact operations?”
- “How do we validate command parameters?”
About the system:
- “What are the dependencies of what we’re changing?”
- “How will the system behave with reduced capacity?”
- “What’s our expected recovery time if something goes wrong?”
- “Who needs to be notified before we proceed?”
AWS’s Response and Industry Changes
Immediate fixes AWS implemented:
- Modified capacity removal tool to prevent removing too much too quickly
- Added minimum capacity enforcement to prevent dangerous operations
- Improved restart procedures for large-scale systems
- Made status dashboard multi-region to avoid self-dependency
Industry-wide changes:
- Operational safety tooling became standard practice
- Chaos engineering gained broader adoption
- Multi-region architecture became table stakes
- Operational rehearsals became more common
Your Operational Safety Checklist
For Infrastructure Commands:
- Command has dry-run mode
- Impact assessment completed
- Confirmation prompts required
- Rollback plan tested
- Monitoring ready for execution
For High-Risk Operations:
- Multiple people review the plan
- Operations team on standby
- Customer communication prepared
- Executive notification sent
- Post-operation review scheduled
For Critical Systems:
- Graceful degradation tested
- Recovery procedures documented
- Dependencies mapped and notified
- Timing optimized for minimal impact
- Success criteria defined
The Lesson: Operational Discipline Matters
AWS’s $150 million typo teaches us that:
- Human error is inevitable - design tools to prevent it
- Small mistakes can have massive impact - implement proportional safeguards
- Even experts make errors - systematic safety measures matter more than expertise
- Recovery planning is critical - know how to fix things when they break
The engineer who made this mistake was authorized, experienced, and following procedures. The failure was in the system design, not the person.
Your Next Steps
- Audit your high-risk operational commands - What safeguards do they have?
- Implement dry-run modes for infrastructure operations
- Add confirmation prompts for commands with high business impact
- Practice recovery procedures for your critical systems
Remember: AWS had world-class engineers and sophisticated systems. If it can happen to them, it can happen to anyone.
The question isn’t whether you’ll make operational mistakes - it’s whether your systems are designed to catch them before they become disasters.
Need help implementing operational safety frameworks for your infrastructure? I’ve worked with companies to design command safety systems and operational procedures that prevent these kinds of disasters. Contact me to discuss how we can make your operations both efficient and safe.
Want more real-time problem-solving sessions like this? Subscribe to our technical leadership newsletter for monthly deep-dives into how engineering leaders handle crisis situations.