DevOps / SRE Engineer Hub

SLOs and Error Budgets: A Tier-1-Sourced Field Guide (2026)

In short

An SLO is the internal reliability target a service commits to; an SLI is the metric that measures it; an SLA is the customer-facing contractual variant with consequences attached. The SRE Workbook canon (sre.google/workbook/implementing-slos and alerting-on-slos) is the operational reference. The 2026 canonical alert pattern is multi-window multi-burn-rate: a fast-burn page (2% of monthly budget consumed in 1 hour, sustained over 5 minutes and 1 hour windows) plus a slow-burn ticket (10% in 6 hours, sustained over 30 minutes and 6 hours). The error-budget policy — what actually happens when the budget is exhausted — is the load-bearing artifact, not the SLO number itself.

Key takeaways

  • The SRE Workbook (sre.google/workbook/implementing-slos and alerting-on-slos) is the canonical 2026 reference. The original SRE Book chapters 3 ('Embracing Risk') and 4 ('Service Level Objectives') established the framework; the Workbook ships the operational mechanics.
  • SLI is the measurement (e.g. 'fraction of HTTP requests returning 2xx in under 200ms'); SLO is the internal target (99.9% of those over 28 days); SLA is the customer-facing contract with monetary or service-credit consequences. SLA targets are always strictly looser than SLOs — the gap is the operational margin.
  • The error budget = 1 - SLO. A 99.9% SLO over 30 days yields ~43 minutes of budget; a 99.99% SLO yields ~4.3 minutes. The order-of-magnitude jump from three-nines to four-nines is the most expensive engineering decision in the field — typically 10x infrastructure cost and a categorically different operational posture.
  • The 2026 canonical alert is multi-window multi-burn-rate per SRE Workbook 'Alerting on SLOs.' A single threshold (e.g. 'page if error rate > 1%') is a defect: it pages on transient noise and misses slow burns. The right pattern uses two windows per severity (a long window for confidence, a short window for recency).
  • SLI choice categories per the Workbook: availability, latency, throughput, correctness, and freshness. The single most common defect is choosing availability when the user experience is actually latency-bound — a 200ms user-flow that returns 200 OK after 30 seconds is, from the user's perspective, a failure.
  • The error-budget policy — the document that says what actually happens when the budget is exhausted (feature freeze, mandatory reliability work, on-call escalation) — is the load-bearing artifact. Charity Majors's framing on Honeycomb's blog: an SLO without a written error-budget policy is decoration.
  • Customer-facing SLAs should be set at one or two nines below internal SLOs to preserve operational margin. A service with a 99.9% internal SLO might publish a 99.5% SLA. The gap is the buffer that absorbs measurement error, edge cases, and one-off incidents without triggering customer credits.

SLI vs SLO vs SLA — the SRE Workbook hierarchy

The SLI / SLO / SLA hierarchy is the foundation of the Google SRE practice and the canonical 2026 framework. The original definitions come from the SRE Book chapter 4 ('Service Level Objectives') and were operationalized in the SRE Workbook chapter on implementing SLOs. The three terms are often conflated in practice, and the conflation is the root of most SLO programs that fail.

  • SLI (Service Level Indicator). A carefully defined quantitative measure of some aspect of the service. Per the Workbook: 'an SLI is a ratio of two numbers — the number of good events divided by the total number of events.' Concrete example: 'the fraction of HTTP requests to the /api/profile endpoint that return 2xx status codes with a server-measured latency under 200ms.' The SLI is a metric, not a target.
  • SLO (Service Level Objective). A target value or range of values for an SLI, usually expressed as a percentage over a time window. Concrete: 'the SLI defined above will be at or above 99.9% over a rolling 28-day window.' The SLO is internal — it is the engineering team's commitment to itself and to product partners. The SLO drives error budgets, alerting, and the engineering-vs-reliability prioritization.
  • SLA (Service Level Agreement). A contractual commitment to customers, typically in the terms of service, with explicit consequences (service credits, refunds, escalation paths) when the target is missed. SLAs are external. SLAs are always strictly looser than the corresponding internal SLO. Per Workbook chapter 2: 'the gap between SLO and SLA is your safety margin.'

The hierarchy maps directly to the operational stack:

  1. SLIs are the dashboards. Every team should have an SLI dashboard with the ratio computed continuously, the target line drawn on the graph, and the historical 28-day window visible.
  2. SLOs drive alerting. The multi-window multi-burn-rate alerts (next section) are computed against the SLO, not against arbitrary thresholds. An alert that fires when error rate exceeds 1% with no reference to the SLO is a defect.
  3. SLAs drive customer communication and contractual obligations. The SLA dashboard is what the customer success team and the legal team see. The internal SLO dashboard is what the engineering team sees. They should rarely be the same.

The order-of-magnitude framing matters. A 99.9% SLO over 30 days yields ~43 minutes of budget per month. A 99.99% SLO yields ~4.3 minutes. A 99.999% SLO yields ~26 seconds — less than a single TLS handshake at the tail. The cost of incremental nines is roughly 10x per nine in infrastructure, redundancy, and operational maturity. Most consumer-facing services should be at 99.9%; payments and authentication may justify 99.99%; almost nothing in non-critical infrastructure justifies 99.999%. The SRE Book chapter 3 ('Embracing Risk') frames this directly: chasing more nines than the user can perceive or the business needs is engineering waste.

Choosing SLIs that align to user experience

The single most common defect in early SLO programs is choosing SLIs that measure the wrong thing. The SRE Workbook chapter on implementing SLOs categorizes SLI types into five families and gives specific guidance on when each applies:

  • Availability SLIs. 'What fraction of requests succeeded?' The simplest and most-defaulted-to family. Defect mode: a service that returns 200 OK with a 30-second latency is 'available' by this metric but is, from the user's perspective, broken. Availability SLIs work best for write-path APIs and webhook-style endpoints where success is binary.
  • Latency SLIs. 'What fraction of requests completed within X ms?' The right SLI for any user-facing flow where the user is waiting on a result. Workbook recommendation: define latency SLIs at multiple percentiles (p50, p95, p99) rather than a single threshold; the long tail is where user-experience failures hide. A 99.9% SLO at p95 < 200ms is a much stronger commitment than 'average latency under 200ms.'
  • Throughput SLIs. 'What fraction of time the service handled at least N RPS?' Used for batch and pipeline systems where the user-experience question is 'are we keeping up?' Less common for synchronous HTTP services.
  • Correctness SLIs. 'What fraction of responses returned the right answer?' Critical for any system with derived data, ML inference, or eventual consistency. Often skipped because it requires ground-truth comparison; when it can be measured (e.g. shadow-traffic comparison, downstream verification), it is the highest-signal SLI for the system.
  • Freshness SLIs. 'What fraction of data is younger than T seconds?' The right SLI for caches, replicas, search indexes, and any system where stale-but-served is a real failure mode. The Workbook flags freshness as the most-undermeasured SLI in modern systems.

The selection mechanic from the Workbook: walk through the user's flow step by step, and for each step, ask 'what does the user experience as a failure here?' If the answer is 'I get an error' — availability. If 'it takes too long' — latency. If 'I get the wrong answer' — correctness. If 'I see stale data' — freshness. The SLI follows the user's failure mode, not the system's instrumentation convenience.

Honeycomb's observability-2.0 framing (honeycomb.io/blog) extends this: the modern high-cardinality observability stack lets you compute SLIs over arbitrary slices (per-customer, per-endpoint, per-region) without re-instrumenting. The Workbook treatment predates wide-cardinality and assumes pre-aggregated metrics; modern practice computes the SLI from event-level data and slices on demand. The unit of work is the same — good_events / total_events — but the slice can be 'requests from customer X to endpoint Y in region Z' rather than 'all requests.'

The most common SLI design errors per Charity Majors's writing on charity.wtf and the Honeycomb engineering blog: (1) defining 'success' as 'no 5xx' when the system has structured error responses that return 200 (a defect at the application layer, masked by the SLI); (2) measuring at the load-balancer rather than at the user (missing CDN failures, DNS issues, TLS errors); (3) including health-check traffic in the denominator (artificially inflating the success rate); (4) choosing too short a measurement window (a 1-hour window is too noisy to be a stable target — 28 days is the canonical default).

Multi-window multi-burn-rate alerting (the canonical 2026 pattern)

The multi-window multi-burn-rate (MWMBR) alert pattern is the canonical 2026 alerting design for SLOs, established in the SRE Workbook chapter 'Alerting on SLOs.' It replaces both the naive 'page if error rate exceeds threshold' alert (too noisy, fires on transient blips) and the simple 'fast-burn / slow-burn' two-tier pattern (better, but still suffers from the long detection window of slow burns). The MWMBR pattern uses two time windows per severity tier — a long window for confidence that the burn is real, and a short window for recency that ensures the alert is timely.

The canonical recipe per the SRE Workbook, for a 99.9% SLO with a 28-day window (error budget = 0.1%):

  • Page (fast burn): 2% of budget in 1 hour. Burn-rate threshold = 14.4. Long window 1 hour, short window 5 minutes. Fires when both windows exceed the threshold simultaneously.
  • Page (medium burn): 5% of budget in 6 hours. Burn-rate threshold = 6. Long window 6 hours, short window 30 minutes.
  • Ticket (slow burn): 10% of budget in 3 days. Burn-rate threshold = 1. Long window 3 days, short window 6 hours.

The two-window-per-severity pattern is the load-bearing detail. The long window gives you confidence that the burn is sustained, not a transient. The short window ensures the alert is timely — without it, a recently-resolved incident would still trigger because the long window includes the burn. The alert fires only when both conditions hold.

A production Prometheus alert config implementing the canonical MWMBR pattern (per the SRE Workbook recipe and the Prometheus alerting-rules documentation):

groups:
  - name: slo_alerts
    rules:
      # Fast burn — 2% of monthly budget in 1 hour. Page on-call.
      - alert: ProfileAPIErrorBudgetFastBurn
        expr: |
          (
            sum(rate(http_requests_errors_total{service="profile-api"}[1h]))
            / sum(rate(http_requests_total{service="profile-api"}[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_errors_total{service="profile-api"}[5m]))
            / sum(rate(http_requests_total{service="profile-api"}[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: page
          slo: profile_api_availability
        annotations:
          summary: "Profile API burning 2% of monthly error budget in 1 hour"
          runbook: "https://runbooks/profile-api-fast-burn"
      # Slow burn — 10% of monthly budget in 3 days. File ticket, do not page.
      - alert: ProfileAPIErrorBudgetSlowBurn
        expr: |
          (
            sum(rate(http_requests_errors_total{service="profile-api"}[3d]))
            / sum(rate(http_requests_total{service="profile-api"}[3d]))
          ) > (1 * 0.001)
          and
          (
            sum(rate(http_requests_errors_total{service="profile-api"}[6h]))
            / sum(rate(http_requests_total{service="profile-api"}[6h]))
          ) > (1 * 0.001)
        for: 15m
        labels:
          severity: ticket
          slo: profile_api_availability

The threshold computation: burn rate = (error rate) / (1 - SLO). A burn rate of 1 means the budget will be exhausted exactly at the end of the SLO window. A burn rate of 14.4 means the entire monthly budget burns in 1/14.4 of the window, or about 50 hours. The thresholds in the Workbook table (14.4, 6, 1) are calibrated to give roughly the same false-positive and false-negative rates across burn rates while preserving useful detection time.

The Prometheus alerting-rules documentation (prometheus.io/docs/practices/rules) gives the technical mechanics: use recording rules to precompute the burn rate as a single time-series, then alert on the precomputed series. This avoids recomputing the long-window query on every evaluation. The recording-rule pattern also makes the alert query inspectable in Grafana — operators can see the burn rate trending toward threshold before the alert fires, which is itself a useful operational signal.

The MWMBR pattern's empirical effectiveness: per the SRE Workbook, MWMBR catches roughly 80% of real SLO burns at full severity within 5 minutes, while reducing false positives by an order of magnitude versus single-threshold alerts. The pattern is the canonical 2026 baseline; modern SLO platforms (Nobl9, Sloth, Pyrra, Datadog SLOs, Honeycomb SLOs) all implement variants of MWMBR by default.

Error-budget policies and engineering freezes

The SLO number itself is not the load-bearing artifact of an SLO program. The load-bearing artifact is the error-budget policy — the written document that says what actually happens when the budget is exhausted. Charity Majors's framing on the Honeycomb blog and in conference talks is direct: 'an SLO without a written error-budget policy is decoration.' The SRE Workbook chapter 5 ('Alerting on SLOs') and chapter 4 of the SRE Book both make the same point with less editorial: the budget is a tool for making prioritization decisions, and the decisions only get made if the policy is written down and agreed to in advance.

A canonical error-budget policy template, derived from the SRE Workbook examples and adapted for typical 2026 SaaS engineering teams:

# Error Budget Policy: Profile API
# SLO: 99.9% availability over 28-day rolling window
# Budget: ~43 minutes / month

When error budget is healthy (>25% remaining):
  - Normal feature development continues
  - Reliability work prioritized at >= 20% of team capacity (Workbook default)

When error budget is degraded (10-25% remaining):
  - Engineering manager and on-call lead review the burn cause
  - Reliability work increases to >= 40% of team capacity
  - Net-new feature launches require explicit director sign-off

When error budget is exhausted (0-10% remaining):
  - HARD FREEZE on net-new feature releases
  - 100% of engineering capacity goes to reliability work and SLO burn root-cause
  - On-call incident review with director within 48 hours
  - Resume normal cadence only when budget recovers above 25%

The mechanics that make the policy work in practice:

  1. Pre-agreement. The policy is signed by the engineering director, the product manager, and the on-call lead before the SLO program starts. The freeze trigger is not a negotiation that happens during an incident; it is a pre-committed decision. This is the single most important property of the policy. A policy that is renegotiated each time the budget is exhausted has no force.
  2. The freeze is structural, not punitive. The Workbook is explicit on this. The freeze is not a punishment for shipping bad code; it is a forcing function that reroutes engineering capacity to fix the underlying reliability issue. The conversation with the product manager is 'we cannot ship the next feature on top of a service that is failing — let's fix the foundation first,' not 'engineering messed up.'
  3. The freeze applies to net-new features, not to all changes. Bug fixes, security patches, and reliability work are explicitly exempt. The freeze targets the work that adds new product surface, not the work that maintains existing surface. Some teams also exempt experiments behind feature flags as long as the flag is off by default.
  4. The 'silver bullet' exception. Some policies allow a single director-level override per quarter for a launch that is genuinely time-critical (a regulatory deadline, a contractually-committed customer go-live). The override is logged. The Workbook recommends keeping the override count visible at the leadership tier; if it is being used routinely, the SLO is wrong.
  5. Budget reset boundaries. The 28-day rolling window is the modern default; some teams use calendar-month windows for simpler reporting. Rolling windows are operationally better — they don't 'reset to full' on the first of the month and create a perverse incentive to exhaust the budget before the reset.

The SRE Workbook gives an operational framing for the social mechanics of error-budget policies: 'reliability is a feature, and like any feature it competes for engineering capacity.' The policy is the mechanism by which reliability wins that competition predictably when the data says it should, rather than as an ad-hoc plea from the on-call team. The teams that succeed at SLO programs treat the policy as a contract between engineering and product; the teams that fail treat the SLO as a metric to report on.

The Charity Majors / Honeycomb extension of the framing: in the modern observability stack, the budget should drive not only feature freezes but also the depth of post-incident investigation. A small budget burn (5–10%) gets a 1-page postmortem; a large burn (50%+) or budget exhaustion gets a full retrospective with cross-functional attendance and a written set of action items tracked to closure. The investigation depth scales to the user-experience cost.

Frequently asked questions

What's the right starting SLO for a new user-facing service?
99.9% availability over a 28-day rolling window is the SRE Workbook's recommended default for general-purpose user-facing services. The math: 43 minutes of budget per month, which is enough to absorb realistic incident patterns (one major incident plus several small blips) without constant freezes, but tight enough to drive real engineering investment in reliability. Don't start at 99.99% — the operational maturity required (multi-region failover, chaos engineering, sub-minute detection) is rarely worth it for a non-critical service.
How do I pick between latency and availability as the primary SLI?
Walk the user's flow. If the user is waiting on a result (page load, API response, search query), the failure mode is latency — pick a latency SLI at p95 or p99 with a threshold the user would notice (typically 200–500ms). If the user fires-and-forgets (webhook, async job submission, log ingest), the failure mode is availability — did the request succeed? Most teams need both, but the primary alerting SLI should match the dominant user failure mode. The SRE Workbook chapter on SLI selection has a decision tree.
What threshold values should I use in multi-window multi-burn-rate alerts?
The SRE Workbook publishes a calibrated table for the canonical recipe at sre.google/workbook/alerting-on-slos. For a 28-day window with the standard '2% of budget in 1 hour' fast-burn page: long window 1 hour, short window 5 minutes, threshold 14.4 × (1 - SLO). For the slow-burn ticket at '10% in 3 days': long window 3 days, short window 6 hours, threshold 1 × (1 - SLO). Don't reinvent the thresholds — use the published values, which are calibrated for sane false-positive and false-negative rates.
How do I handle SLOs for a service with bursty traffic?
Compute the SLI as a ratio (good_events / total_events) rather than as a count. A ratio is invariant to traffic volume — a service with 10 errors per 1000 requests has the same SLI as a service with 1000 errors per 100000. The SRE Workbook is explicit: never define the SLI as 'fewer than N errors per minute' — that breaks under traffic spikes. For services with very low traffic (e.g. < 10 RPS), consider a longer SLI window (1 hour rather than 5 minutes) to reduce statistical noise; the Workbook discusses the bottom-of-traffic edge cases.
Should I have one SLO per service, or many SLOs per service?
The SRE Workbook recommends one user-journey SLO per critical user flow, not one per endpoint. A service with three critical flows (sign-up, search, checkout) might have three SLOs, each composed from the SLIs of the endpoints involved in that flow. Per-endpoint SLOs are an anti-pattern: they create alerting noise without aligning to user experience. The exception: a service that serves heterogeneous workloads (batch + interactive) can split SLOs by workload because the failure modes differ.
How does an error-budget policy differ from a runbook?
The runbook tells the on-call engineer what to do during an incident; the error-budget policy tells the engineering team what to do across the quarter when the budget is depleted. Different time horizons, different audiences, different decisions. The runbook is a tactical artifact owned by the on-call team. The error-budget policy is a strategic artifact owned by engineering leadership and signed by product. The SRE Workbook treats them as separate documents with separate review cadences.
What's the relationship between SLOs and customer-facing SLAs?
SLAs should be one or two nines below the internal SLO to preserve operational margin. A 99.9% internal SLO might map to a 99.5% public SLA. The gap absorbs measurement error, edge cases (a customer's network, a CDN issue, a region failover), and the occasional bad week. SLAs typically have monetary or service-credit consequences attached and trigger contractual review processes; treating the SLA as the engineering target rather than the SLO is a common defect that undercuts the operational margin.
When should I use a 99.99% SLO instead of 99.9%?
When the user-experience cost of a failure is genuinely 10x higher than at 99.9%, and when the team has the operational maturity to support it. Concrete: payments, authentication, financial-transaction APIs, and life-safety systems. The cost of the additional nine is roughly 10x: multi-region active-active, redundancy at every tier, sub-minute detection, formal chaos engineering, and a 24/7 dedicated on-call rotation. The SRE Book chapter 3 ('Embracing Risk') is direct: 'a service can be too reliable.' If users can't perceive the additional reliability, you've spent the engineering budget on something the user doesn't value.
How does observability-2.0 (high-cardinality, event-level) change SLO practice?
It moves SLI computation from pre-aggregated metrics to event-level data, which lets you slice the SLI on demand (per-customer, per-region, per-endpoint, per-feature-flag). The math is unchanged — still good_events / total_events — but the slice can match the actual user experience rather than a pre-decided aggregation. Honeycomb's observability-2.0 essays cover this directly. Operationally, the multi-window multi-burn-rate alerts still apply; the burn rate is computed over the slice the user cares about, not a service-wide rollup.
How do I run the conversation with product when the budget is exhausted?
The conversation is structural, not punitive. Per the SRE Workbook chapter 5: 'reliability is a feature competing for capacity.' The framing: we have a pre-agreed policy, the budget says foundations need attention, the next feature builds on those foundations, the freeze is the mechanism that fixes the foundation before we add weight. If product is surprised by the freeze, the policy was not pre-agreed correctly and the program needs to be redone. Charity Majors's writing on this is direct: the freeze conversation should never start with 'we have a problem' — it should start with 'we agreed in advance that this is what we do, and the data says we do it now.'

Sources

  1. Google SRE Workbook — 'Implementing SLOs.' The canonical operational reference for the SLI/SLO/SLA hierarchy and SLI selection.
  2. Google SRE Workbook — 'Alerting on SLOs.' The canonical 2026 multi-window multi-burn-rate alerting recipe with published threshold tables.
  3. Google SRE Book chapter 3 — 'Embracing Risk.' The 'a service can be too reliable' argument and the cost-of-nines framing.
  4. Google SRE Book chapter 4 — 'Service Level Objectives.' The original SLI/SLO/SLA definitions and the canonical framework.
  5. Honeycomb engineering blog — observability-2.0 essays on event-level SLI computation and modern high-cardinality SLO practice.
  6. Prometheus documentation — recording rules and alerting rules best practices, including the precomputed-burn-rate pattern referenced by the SRE Workbook.

About the author. Blake Crosley founded ResumeGeni and writes about site reliability engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.