These three acronyms get used interchangeably, and they shouldn’t. They represent three different layers of service reliability management, and understanding the distinction matters for how you operate your systems and how you communicate reliability to your customers.

SLI: What You Measure

A Service Level Indicator (SLI) is a quantitative measurement of some aspect of your service’s behavior. It’s a metric. Common SLIs include:

  • Availability: percentage of requests that return a successful response
  • Latency: how long requests take (usually measured at the 50th, 95th, and 99th percentiles)
  • Error rate: percentage of requests that return errors
  • Throughput: number of requests processed per second

The key requirement: an SLI has to be measurable, objective, and meaningful to users. “Server CPU usage” is a metric but not a great SLI — users don’t care about your CPU. “Percentage of page loads that complete in under 2 seconds” is a good SLI because it directly correlates with user experience.

SLO: What You Target

A Service Level Objective (SLO) is your internal target for an SLI. It’s a goal, not a contract. Examples:

  • 99.9% of API requests will return a successful response (availability SLO)
  • 95% of requests will complete in under 200ms (latency SLO)
  • 99.95% of checkout transactions will succeed (business-critical SLO)

The SLO is the number your engineering team manages toward. It drives decisions: do we invest in redundancy? Is this latency regression worth fixing? Should we slow down feature development to focus on reliability?

Google popularized the concept of an “error budget” alongside SLOs. If your SLO is 99.9% availability, you have a 0.1% error budget — roughly 43 minutes of downtime per month. As long as you’re within budget, ship features. When the budget is exhausted, focus on reliability. This converts the abstract concept of “reliability vs. velocity” into a concrete, measurable trade-off.

SLA: What You Promise

A Service Level Agreement (SLA) is a contractual commitment to your customers, backed by financial consequences if you miss it. If your SLA says 99.9% uptime and you deliver 99.5%, the contract specifies a remedy — typically service credits, penalty payments, or contract termination rights.

Critical distinction: your SLA should always be less aggressive than your internal SLO. If your SLO is 99.9%, your SLA might be 99.5%. The gap between them is your safety margin — it gives you room to have a bad month without triggering contractual penalties.

Not every service needs an SLA. Internal services, early-stage products, and free tiers typically don’t have contractual commitments. But if you’re selling to enterprises, an SLA is usually a procurement requirement.

How They Work Together

Think of it as a stack:

SLI (bottom): the raw measurements. Your monitoring system collects these continuously.

SLO (middle): the internal targets based on those measurements. Your engineering team uses these to make prioritization decisions.

SLA (top): the external commitments based on your confidence in meeting the SLOs. Your sales and legal teams use these in contracts.

The flow goes: you measure SLIs, set SLOs based on what your users need and what your architecture can deliver, and publish SLAs that you’re confident you can meet based on your SLO track record.

Who Needs What

Every team running a production service should have SLIs. You can’t manage what you don’t measure. Even a two-person startup should know their error rate and latency distribution.

SLOs become valuable at 10+ engineers or when you’re running multiple services. They formalize the reliability targets that are otherwise implicit (“we just try to keep things up”) and create a framework for making trade-off decisions.

SLAs are a business concern, not an engineering concern. You need them when your customers require contractual reliability commitments — typically enterprise sales, regulated industries, or platform services that other businesses depend on.

Common Mistakes

Setting SLOs without data. Your SLO should be based on what your system actually delivers and what your users actually need — not an aspirational number pulled from thin air. Measure first, set targets second.

Making SLOs too aggressive. A 99.99% SLO sounds impressive but allows only 4.3 minutes of downtime per month. Every additional nine costs exponentially more in engineering investment. Most services should start at 99.5% or 99.9% and tighten only when the business demands it.

Confusing SLOs with SLAs. Your engineering team should never be managing toward an SLA target. They should manage toward the SLO, which should be tighter than the SLA. If your SLO and SLA are the same number, you have no margin for error.

Measuring the wrong SLIs. Infrastructure metrics (CPU, memory, disk) are operational data, not service level indicators. SLIs should reflect user experience. “The server is running” doesn’t mean “the user can complete a purchase.”

The Verdict

SLIs, SLOs, and SLAs are one of those frameworks that sounds like bureaucratic overhead until you need it — usually the first time you have a serious reliability incident and nobody agrees on how bad it was or what “good enough” looks like. The framework forces clarity: what do we measure, what do we target, what do we promise. Get the SLIs right first. Set SLOs that match what your users need. And only commit to SLAs that you have a track record of exceeding.


Related: Engineering Metrics That Actually Matter | Observability and Monitoring for Growing Teams