SLOs, SLIs, and Error Budgets

February 13, 2026 | SRE Observability Reliability

Practical examples and budget policies.

SLOs, SLIs, and Error Budgets: A Practical Guide

Site Reliability Engineering (SRE) principles help teams balance reliability with feature velocity. SLOs (Service Level Objectives), SLIs (Service Level Indicators), and error budgets provide a data-driven framework for making reliability decisions.

Key Definitions

  • SLI (Service Level Indicator) — A quantitative measurement of service behavior (e.g., latency, error rate, availability)
  • SLO (Service Level Objective) — A target value for an SLI (e.g., "99.9% of requests complete in under 200ms")
  • SLA (Service Level Agreement) — A contractual commitment with penalties (typically weaker than SLOs)
  • Error Budget — The allowed unreliability (100% - SLO). A 99.9% SLO has a 0.1% error budget.

Choosing SLIs

Service TypeRecommended SLIs
User-facing APIAvailability, latency (P50, P95, P99), error rate
Data pipelineFreshness (time since last update), correctness, throughput
Storage systemAvailability, durability, latency
Batch jobCompletion rate, runtime, data quality

Practical SLO Examples

# API Service SLOs
Availability: 99.9% of requests return non-5xx responses (30-day window)
Latency: 99% of requests complete in < 200ms (P99)
Latency: 99.9% of requests complete in < 1000ms (P99.9)

# Translation to error budget:
# 99.9% availability = 43.2 minutes of downtime per 30 days
# 99% P99 latency = 1% of requests can exceed 200ms

Measuring SLIs with Prometheus

# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

# Latency SLI (P99 under 200ms)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[30d])) by (le))

# Error budget remaining
1 - (
  (1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))))
  /
  (1 - 0.999)  # 0.999 is the SLO target
)

Error Budget Policy

Define what happens when the error budget is consumed:

  • >50% budget remaining: Normal feature development velocity
  • 25-50% remaining: Increase code review rigor, add more testing
  • 10-25% remaining: Freeze non-critical changes, focus on reliability improvements
  • 0% remaining: Complete feature freeze until budget replenishes; all engineering effort on reliability

SLO Dashboard

Create a Grafana dashboard showing:

  • Current SLI values vs SLO targets
  • Error budget remaining (percentage and time)
  • Error budget burn rate (how fast are we consuming it?)
  • 30-day trend of SLI performance

Common Mistakes

  1. Setting SLOs too high — 99.99% sounds good but means only 4.3 minutes of downtime per month. Is your team resourced for that?
  2. Not having error budget policies — SLOs without consequences are just dashboards
  3. Measuring the wrong SLIs — Focus on user-facing metrics, not infrastructure metrics
  4. Not involving stakeholders — Product managers should help set SLOs based on business impact

Eazy SaaS Tip: We help our clients start with 3 SLOs (availability, P99 latency, error rate) and build error budget dashboards. Teams that adopt error budgets report 40% fewer incidents because reliability work gets prioritized before budgets run out, not after outages happen.