The $150 Million Typo: AWS S3 and the Command That Broke the Internet

The Problem Statement

You’re an AWS engineer debugging a billing system issue. It’s a routine Tuesday morning, February 28, 2017. You need to remove a small number of servers from one of the S3 subsystems to resolve the performance problem.

You have an established playbook. You’ve done this operation many times before. It’s a simple command to remove capacity from the system.

You type the command and hit enter.

In the next 4 hours, you’ll cause:

Complete outage of Amazon S3 in the largest AWS region
Disruption to thousands of websites and services
$150 million in losses across S&P 500 companies
AWS’s own status dashboard going dark (it relied on S3)

This is the real story of the AWS S3 outage, based on Amazon’s official post-incident summary available at https://aws.amazon.com/message/41926/

Let’s work through this problem together and understand how one typo created one of the internet’s worst outages.

Setting the Scene: What We Know

The Business Context

AWS S3 is the backbone of internet storage
US-EAST-1 region serves massive traffic (default region for many services)
Tuesday morning, minimal AWS staff on site
Billing system running slower than normal (not critical, but needs attention)

The Technical Setup

S3 has multiple subsystems handling different functions
Index subsystem: manages metadata and location of all S3 objects
Placement subsystem: manages allocation of new storage
Both systems designed to be resilient and scalable
Established procedures for removing capacity during maintenance

The Operational Context

Authorized engineer with proper access
Following documented playbook for capacity reduction
Operation performed many times successfully in the past
System designed to handle capacity changes gracefully

Problem-Solving Session: What Would You Do?

Before we reveal what happened, let’s work through the decision points an engineer faces:

Decision Point 1: Command Construction

You need to remove servers from the billing subsystem. Your command needs to specify:

Which subsystem to target
How many servers to remove
Which specific servers to remove

Think about this: What could go wrong when constructing this command?

Wrong subsystem targeted?
Wrong number of servers specified?
Wrong servers selected?
Command syntax error?

Decision Point 2: Validation Before Execution

What checks would you perform before running the command?

Review the command syntax?
Verify the target environment?
Check current system capacity?
Confirm the server count?

Decision Point 3: Execution Safety

What safety measures would you expect in the tooling?

Confirmation prompts?
Dry-run mode?
Capacity minimums enforcement?
Rollback capabilities?

Decision Point 4: Monitoring During Execution

How would you monitor the impact as the command runs?

System health dashboards?
Error rate monitoring?
Performance metrics?
User impact tracking?

What Actually Happened: The Cascade of Errors

The Fatal Command

The engineer intended to remove a small number of servers from the billing subsystem. Instead, they entered a parameter incorrectly and removed a large set of servers from two critical subsystems:

Index subsystem - manages metadata for all S3 objects in the region
Placement subsystem - manages storage allocation for new objects

The Technical Cascade

9:37 AM PST: Command executed, removes significant capacity 9:40 AM PST: Index subsystem begins failing (can’t serve GET, LIST, PUT, DELETE requests) 9:45 AM PST: Placement subsystem fails (can’t allocate storage for new objects) 9:50 AM PST: S3 in US-EAST-1 completely unavailable

The Business Impact Cascade

10:00 AM PST: Websites start failing globally

Slack goes down
Quora becomes unavailable
Docker delays major announcements
Trello stops working
Thousands of other services affected

10:30 AM PST: The irony becomes apparent

AWS’s own status dashboard can’t update (uses S3)
AWS forced to communicate via Twitter
Even the tools to report the outage are down

The Recovery Challenge

The problem: S3 systems hadn’t been fully restarted in years due to massive growth The complexity: Safety checks and metadata validation take longer at scale The timeline:

12:26 PM PST: Index subsystem begins partial recovery
1:18 PM PST: GET, LIST, DELETE operations fully restored
1:54 PM PST: PUT operations restored, S3 fully operational

Total outage time: 4 hours and 17 minutes

Root Cause Analysis: Why This Happened

1. Tool Design Failure

Problem: Command allowed removing more capacity than safe
Missing: Validation of minimum capacity requirements
Missing: Confirmation prompts for large operations

2. Human Factors

Problem: Easy to make parameter input errors
Missing: Command preview or dry-run mode
Missing: Clear feedback about operation scope

3. System Architecture

Problem: Removing capacity triggered full system restart
Missing: Graceful degradation for capacity reduction
Missing: Faster restart procedures for large-scale systems

4. Operational Procedures

Problem: Procedure assumed tool would prevent dangerous operations
Missing: Manual validation steps
Missing: Impact assessment before execution

The Framework: Operational Command Safety

Based on this incident, here’s a framework for preventing similar disasters:

1. Command Validation Layer

Pre-execution checks:

# Example: Validate before dangerous operations
if [ "$servers_to_remove" -gt "$safe_threshold" ]; then
  echo "ERROR: Attempting to remove $servers_to_remove servers"
  echo "Safe threshold is $safe_threshold"
  echo "Current capacity: $current_capacity"
  echo "Resulting capacity: $((current_capacity - servers_to_remove))"
  exit 1
fi

Required validations:

Command syntax verification
Parameter range checking
System state prerequisites
Impact estimation

2. Operational Safety Controls

Implement multiple safety layers:

Dry-run mode: Show what would happen without executing
Confirmation prompts: Require explicit confirmation for high-impact operations
Capacity minimums: Prevent removing below safe operational levels
Rate limiting: Prevent removing too much capacity too quickly

3. Impact Assessment Process

Before any infrastructure command:

Estimate impact: How many services/users affected?
Assess risks: What could go wrong?
Plan recovery: How to undo if things go bad?
Validate timing: Is this the right time for this operation?

4. Monitoring and Circuit Breakers

Real-time monitoring during operations:

System health metrics
Error rate tracking
User impact measurement
Automatic rollback triggers

Red Flags for High-Risk Operations

Command-Level Red Flags:

Operations affecting core infrastructure
Commands that remove capacity or resources
Bulk operations across multiple systems
Commands that can’t be easily undone

Timing Red Flags:

Operations during business hours
Changes during high-traffic periods
Maintenance without sufficient backup coverage
Operations when key personnel are unavailable

System Red Flags:

Systems that haven’t been restarted recently
Infrastructure with unknown restart times
Systems without graceful degradation
Operations affecting multiple dependent systems

Questions to Ask Before High-Risk Commands

About the operation:

“What’s the maximum impact if this goes wrong?”
“How quickly can we detect if something’s wrong?”
“What’s our rollback plan?”
“Is this the right time for this operation?”

About the tooling:

“Does our tooling prevent dangerous operations?”
“Do we have dry-run mode for this command?”
“Are there confirmation prompts for high-impact operations?”
“How do we validate command parameters?”

About the system:

“What are the dependencies of what we’re changing?”
“How will the system behave with reduced capacity?”
“What’s our expected recovery time if something goes wrong?”
“Who needs to be notified before we proceed?”

AWS’s Response and Industry Changes

Immediate fixes AWS implemented:

Modified capacity removal tool to prevent removing too much too quickly
Added minimum capacity enforcement to prevent dangerous operations
Improved restart procedures for large-scale systems
Made status dashboard multi-region to avoid self-dependency

Industry-wide changes:

Operational safety tooling became standard practice
Chaos engineering gained broader adoption
Multi-region architecture became table stakes
Operational rehearsals became more common

Your Operational Safety Checklist

For Infrastructure Commands:

Command has dry-run mode
Impact assessment completed
Confirmation prompts required
Rollback plan tested
Monitoring ready for execution

For High-Risk Operations:

Multiple people review the plan
Operations team on standby
Customer communication prepared
Executive notification sent
Post-operation review scheduled

For Critical Systems:

Graceful degradation tested
Recovery procedures documented
Dependencies mapped and notified
Timing optimized for minimal impact
Success criteria defined

The Lesson: Operational Discipline Matters

AWS’s $150 million typo teaches us that:

Human error is inevitable - design tools to prevent it
Small mistakes can have massive impact - implement proportional safeguards
Even experts make errors - systematic safety measures matter more than expertise
Recovery planning is critical - know how to fix things when they break

The engineer who made this mistake was authorized, experienced, and following procedures. The failure was in the system design, not the person.

Your Next Steps

Audit your high-risk operational commands - What safeguards do they have?
Implement dry-run modes for infrastructure operations
Add confirmation prompts for commands with high business impact
Practice recovery procedures for your critical systems

Remember: AWS had world-class engineers and sophisticated systems. If it can happen to them, it can happen to anyone.

The question isn’t whether you’ll make operational mistakes - it’s whether your systems are designed to catch them before they become disasters.

Need help implementing operational safety frameworks for your infrastructure? I’ve worked with companies to design command safety systems and operational procedures that prevent these kinds of disasters. Contact me to discuss how we can make your operations both efficient and safe.

Want more real-time problem-solving sessions like this? Subscribe to our technical leadership newsletter for monthly deep-dives into how engineering leaders handle crisis situations.

The $150 Million Typo: AWS S3 and the Command That Broke the Internet

The $150 Million Typo: AWS S3 and the Command That Broke the Internet

The Problem Statement

Setting the Scene: What We Know

The Business Context

The Technical Setup

The Operational Context

Problem-Solving Session: What Would You Do?

Decision Point 1: Command Construction

Decision Point 2: Validation Before Execution

Decision Point 3: Execution Safety

Decision Point 4: Monitoring During Execution

What Actually Happened: The Cascade of Errors

The Fatal Command

The Technical Cascade

The Business Impact Cascade

The Recovery Challenge

Root Cause Analysis: Why This Happened

1. Tool Design Failure

2. Human Factors

3. System Architecture

4. Operational Procedures

The Framework: Operational Command Safety

1. Command Validation Layer

2. Operational Safety Controls

3. Impact Assessment Process

4. Monitoring and Circuit Breakers

Red Flags for High-Risk Operations

Command-Level Red Flags:

Timing Red Flags:

System Red Flags:

Questions to Ask Before High-Risk Commands

About the operation:

About the tooling:

About the system:

AWS’s Response and Industry Changes

Immediate fixes AWS implemented:

Industry-wide changes:

Your Operational Safety Checklist

For Infrastructure Commands:

For High-Risk Operations:

For Critical Systems:

The Lesson: Operational Discipline Matters

Your Next Steps

Tags