The GitLab Disaster: When 5 Backup Methods Failed and What It Teaches About Validation

October 18, 2024

How GitLab lost 6 hours of production data despite having 5 backup methods - and the framework for ensuring your backups actually work when you need them.

The GitLab Disaster: When 5 Backup Methods Failed and What It Teaches About Validation

The Business Context

This analysis is based on GitLab’s public postmortem of their January 31, 2017 database incident. All details are from their transparent reporting available at https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/

On January 31, 2017, GitLab experienced what every engineering leader fears: a production database accidentally deleted during what should have been routine maintenance. The incident lasted 18 hours and resulted in the permanent loss of 6 hours of user data - affecting 5,000 projects, 5,000 comments, and 700 new user accounts.

But here’s the shocking part: GitLab had 5 different backup methods in place. And when disaster struck, all 5 failed them.

What GitLab Had in Place (The Backup Strategy That Failed)

GitLab thought they were well-prepared with multiple backup layers:

  1. Regular PostgreSQL dumps - Daily pg_dump uploads to S3
  2. LVM snapshots - Regular disk-level snapshots for staging refreshes
  3. Database replication - Secondary database server for failover
  4. Azure disk snapshots - Cloud-level backup system
  5. GitLab.com replica - Additional secondary server

On paper, this looks comprehensive. Multiple methods, different technologies, various restore points. What could go wrong?

How Every Backup Method Failed

The Primary Failure

At 9:00 PM UTC, database replication stopped due to high load. An engineer started the process to rebuild the secondary database by deleting its data directory and re-syncing from the primary.

When the initial attempt failed, a second engineer repeated the process. In the pressure of the moment, they ran the delete command against the primary database instead of the secondary.

The rm -rf command deleted 300GB of data in seconds before being stopped.

When They Turned to Backups, Reality Hit

Backup Method #1: PostgreSQL dumps to S3

  • Status: Failing silently for months
  • Why: Version mismatch between pg_dump (9.2) and database (9.6)
  • Last good backup: None recent enough

Backup Method #2: LVM snapshots

  • Status: Only had one 6-hour-old snapshot
  • Why: Used for staging refreshes, not disaster recovery
  • Data loss: 6 hours of user data

Backup Method #3: Database replication

  • Status: The secondary was empty (being rebuilt when incident occurred)
  • Why: They were in the middle of fixing replication when disaster struck

Backup Method #4: Azure disk snapshots

  • Status: Only configured for NFS server, not database server
  • Why: Misconfigured scope of backup coverage

Backup Method #5: GitLab.com replica

  • Status: Had stale data, not current
  • Why: Also affected by the replication issues

The only usable backup was a 6-hour-old LVM snapshot that happened to exist because an engineer was testing something earlier that day. It was luck, not planning, that prevented 24 hours of data loss.

The Real Lessons: It’s Not About Having Backups

GitLab’s experience reveals the critical difference between having backup systems and having validated backup systems.

Lesson 1: Silent Failures Are the Deadliest

The PostgreSQL dumps had been failing for months without anyone noticing. The backup process reported success, uploaded empty files, and continued its schedule.

What this teaches: Monitor backup completion AND content validation. A successful backup process doesn’t mean you have recoverable data.

Lesson 2: Backup Testing Must Mirror Real Recovery Scenarios

GitLab had never performed a full disaster recovery test at scale. They knew their individual backup methods worked in isolation but had never validated the complete recovery process under pressure.

What this teaches: Regular disaster recovery drills should simulate actual failure conditions, not just happy-path testing.

Lesson 3: Backup Coverage Assumptions Are Dangerous

Multiple backup methods gave GitLab false confidence. They assumed their various systems provided overlapping coverage, but in reality, they had gaps and single points of failure.

What this teaches: Map your backup coverage explicitly. Don’t assume multiple methods equal redundancy.

The Backup Validation Framework

Based on GitLab’s painful lessons and subsequent improvements, here’s a framework for ensuring your backups work when you need them:

1. Validate Backup Content, Not Just Process

The Problem: Success logs don’t guarantee recoverable data.

The Solution:

  • Automated integrity checks on backup files
  • Regular test restores to verify data completeness
  • Content validation scripts that check critical data structures
  • Alerts for backup file size anomalies

Implementation:

# Example: Validate PostgreSQL backup integrity
pg_restore --list backup.dump | grep -c "TABLE DATA" > restore_validation.log
# Compare count to expected tables

2. Test Recovery Under Realistic Conditions

The Problem: Testing individual components doesn’t validate the complete recovery process.

The Solution:

  • Monthly full disaster recovery drills
  • Test recovery on production-sized datasets
  • Practice recovery under time pressure
  • Validate performance of restored systems

Implementation:

  • Schedule quarterly “surprise” recovery tests
  • Document recovery time for each backup method
  • Test recovery to different environments (cloud, on-premise)

3. Map and Monitor Coverage Gaps

The Problem: Assuming multiple backup methods provide complete coverage.

The Solution:

  • Document exactly what each backup method covers
  • Identify and address coverage gaps
  • Monitor dependencies between backup systems
  • Regular audits of backup scope vs. actual system coverage

Implementation: Create a backup coverage matrix:

System ComponentBackup MethodFrequencyLast TestedCoverage Gaps
User databasepg_dump + LVMDaily + 6hrLast weekNone
File uploadsS3 syncHourlyYesterdayFile permissions
ConfigurationGit repoOn changeLast monthEnvironment variables

4. Assign Ownership and Accountability

The Problem: Nobody specifically responsible for backup system health.

The Solution:

  • Single owner for each backup system
  • Authority to “stop the line” if backups are at risk
  • Regular backup health reports to leadership
  • Backup success as part of team metrics

Red Flags That Your Backups Might Fail You

Based on GitLab’s experience, watch for these warning signs:

Process Red Flags:

  • Backup success measured only by process completion
  • No regular testing of restore procedures
  • Backup validation is manual or infrequent
  • Different backup methods managed by different teams

Technical Red Flags:

  • Backup files not tested for integrity
  • Restore procedures not documented or practiced
  • Backup systems with unmonitored dependencies
  • No alerting for backup content validation failures

Organizational Red Flags:

  • No single owner responsible for backup system health
  • Backup testing postponed due to “more urgent” work
  • Recovery time objectives not defined or tested
  • Leadership not regularly briefed on backup system health

Questions to Ask Your Team

Before you experience your own GitLab moment, validate your backup strategy:

About your current backups:

  • “When did we last perform a complete disaster recovery test?”
  • “What’s our actual, tested recovery time for each backup method?”
  • “How do we know our backup files contain complete, recoverable data?”
  • “Who has the authority to stop deployments if backups are failing?”

About your coverage:

  • “What exactly does each backup method cover?”
  • “What are the dependencies between our backup systems?”
  • “How do we monitor for silent backup failures?”
  • “What’s our process for validating backup content?”

About your testing:

  • “How often do we test recovery under realistic conditions?”
  • “What’s the largest dataset we’ve successfully restored?”
  • “How long does it take to restore from each backup method?”
  • “What’s our plan if multiple backup methods fail simultaneously?”

The Business Impact of Getting This Right

GitLab’s transparency about their incident provides real numbers:

Cost of the incident:

  • 18 hours of complete service unavailability
  • 6 hours of permanent data loss affecting thousands of users
  • Immeasurable damage to user trust and company reputation
  • Months of engineering time implementing proper backup validation

Cost of prevention:

  • Regular backup validation testing: ~4 hours per month
  • Automated backup integrity checking: 1-2 weeks of initial setup
  • Disaster recovery drills: ~8 hours per quarter
  • Proper monitoring and alerting: 1 week of setup

The prevention cost is less than 1% of the incident cost, and that’s before considering the reputation damage and customer churn.

Your Next Steps

  1. Audit your backup content validation - Can you prove your backups contain recoverable data?
  2. Schedule a disaster recovery drill - Test your complete recovery process under realistic conditions
  3. Map your backup coverage - Document exactly what each method covers and identify gaps
  4. Assign backup ownership - Give someone explicit responsibility for backup system health

GitLab’s painful experience became a gift to the entire industry through their transparent postmortem. Their lesson is simple but critical: having backups and having working backups are two completely different things.

Don’t wait for your own disaster to learn this lesson.


Need help implementing a robust backup validation strategy? I’ve helped dozens of companies audit their backup systems and implement validation frameworks that actually work under pressure. Contact me to discuss how we can ensure your backups will be there when you need them most.

Want more war stories and lessons from real production incidents? Subscribe to our technical leadership newsletter for monthly deep-dives into how things break and how to prevent it.