The Kubernetes Cluster That Nearly Killed Black Friday

July 25, 2024

How a seemingly small configuration change brought down our entire e-commerce platform during the biggest shopping day of the year - and what we learned about capacity planning.

The Kubernetes Cluster That Nearly Killed Black Friday

The Business Context

This is a composite scenario based on multiple real incidents I’ve encountered. Details have been anonymized and combined to illustrate common patterns.

A mid-sized e-commerce company was preparing for their biggest Black Friday sale, projecting 10x normal traffic. The stakes were high - this single day represented a significant portion of their annual revenue.

Two weeks before the sale, we made what seemed like a routine optimization to their Kubernetes cluster configuration. We increased the CPU requests for their main application pods to improve performance during load testing.

The Technical Decision

The change was simple: bump CPU requests from 100m to 500m per pod. In our staging environment with 50 pods, this worked perfectly. Performance improved, and our load tests showed we could handle the projected traffic.

What we didn’t account for: production was running 200 pods across multiple nodes. The math was simple but devastating:

  • Before: 200 pods × 100m CPU = 20 CPU cores reserved
  • After: 200 pods × 500m CPU = 100 CPU cores reserved
  • Available: Our cluster only had 80 CPU cores total

What Went Wrong

Black Friday morning, 6 AM EST. Traffic started ramping up. At 8 AM, when the sale officially launched, our cluster began rejecting new pods. The Kubernetes scheduler couldn’t find nodes with enough unreserved CPU to meet our resource requests.

The cascade failure:

  1. Pods couldn’t scale to meet demand
  2. Existing pods became overwhelmed
  3. Response times spiked from 200ms to 30+ seconds
  4. Users abandoned their carts
  5. Revenue dropped 80% during peak hours

The Recovery

Immediate response (10 minutes):

  • Rolled back CPU requests to 100m
  • Manually scaled pod replicas to 300+
  • Added emergency worker nodes

Damage control (30 minutes):

  • Increased cluster capacity by 50%
  • Implemented emergency rate limiting
  • Set up real-time monitoring dashboards

Post-incident (weeks):

  • Built automated capacity planning tools
  • Implemented staging environment that mirrors production scale
  • Created runbooks for traffic surge scenarios

The Lessons Learned

1. Staging Must Mirror Production Scale

Our staging environment with 50 pods couldn’t reveal resource contention that would occur with 200+ pods. Now we maintain a staging cluster that can scale to production levels.

2. Resource Requests vs. Reality

There’s a massive difference between what your application needs and what Kubernetes reserves. Always calculate total resource reservations across your entire cluster.

3. Monitoring Resource Allocation

We were monitoring CPU usage but not CPU allocation. These are different metrics with different implications for scaling.

4. Load Testing Under Constraints

Load testing should include resource constraint scenarios. What happens when you can’t scale? How does your application degrade?

How to Recognize This Pattern

Warning signs you might hit this:

  • Resource requests that look small per pod but multiply across many pods
  • Staging environments that don’t match production scale
  • Monitoring that focuses on usage rather than allocation
  • Scaling policies that assume infinite resources

Questions to ask your team:

  • “What’s our total resource reservation across all pods?”
  • “Can our cluster handle a 5x increase in pod count?”
  • “How do we test scaling under resource constraints?”
  • “What happens when the scheduler can’t place new pods?”

The Business Impact

Cost of the incident:

  • $2.1M in lost revenue during 4-hour degradation
  • $500K in emergency cloud infrastructure costs
  • Immeasurable damage to customer experience during peak shopping

Cost of prevention:

  • Better staging environment: $10K/month
  • Improved monitoring: $2K/month
  • Capacity planning tools: 2 weeks of engineering time

The math is clear: prevention is 100x cheaper than the cure.

Your Next Steps

  1. Audit your resource requests - Calculate total reservations across your cluster
  2. Build realistic staging - Match production scale, not just production code
  3. Monitor allocation, not just usage - Track what Kubernetes reserves vs. what pods consume
  4. Practice scaling under pressure - Load test with artificial resource constraints

Ready to dive deeper? This incident led us to develop a comprehensive capacity planning framework that’s helped dozens of companies avoid similar disasters. Contact me to learn about our Kubernetes scaling assessment and how we can help you avoid the next Black Friday catastrophe.


Want more war stories like this? Subscribe to our technical leadership newsletter for monthly deep-dives into real production incidents and the lessons they teach.