SLOs, SLIs, and Error Budgets

February 13, 2026 | SRE Observability Reliability

Practical examples and budget policies.

SLOs, SLIs, and Error Budgets: A Practical Guide

Site Reliability Engineering (SRE) principles help teams balance reliability with feature velocity. SLOs (Service Level Objectives), SLIs (Service Level Indicators), and error budgets provide a data-driven framework for making reliability decisions.

Key Definitions

SLI (Service Level Indicator) — A quantitative measurement of service behavior (e.g., latency, error rate, availability)
SLO (Service Level Objective) — A target value for an SLI (e.g., "99.9% of requests complete in under 200ms")
SLA (Service Level Agreement) — A contractual commitment with penalties (typically weaker than SLOs)
Error Budget — The allowed unreliability (100% - SLO). A 99.9% SLO has a 0.1% error budget.

Choosing SLIs

Service Type	Recommended SLIs
User-facing API	Availability, latency (P50, P95, P99), error rate
Data pipeline	Freshness (time since last update), correctness, throughput
Storage system	Availability, durability, latency
Batch job	Completion rate, runtime, data quality

Practical SLO Examples

# API Service SLOs
Availability: 99.9% of requests return non-5xx responses (30-day window)
Latency: 99% of requests complete in < 200ms (P99)
Latency: 99.9% of requests complete in < 1000ms (P99.9)

# Translation to error budget:
# 99.9% availability = 43.2 minutes of downtime per 30 days
# 99% P99 latency = 1% of requests can exceed 200ms

Measuring SLIs with Prometheus

# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

# Latency SLI (P99 under 200ms)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[30d])) by (le))

# Error budget remaining
1 - (
  (1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))))
  /
  (1 - 0.999)  # 0.999 is the SLO target
)

Error Budget Policy

Define what happens when the error budget is consumed:

>50% budget remaining: Normal feature development velocity
25-50% remaining: Increase code review rigor, add more testing
10-25% remaining: Freeze non-critical changes, focus on reliability improvements
0% remaining: Complete feature freeze until budget replenishes; all engineering effort on reliability

SLO Dashboard

Create a Grafana dashboard showing:

Current SLI values vs SLO targets
Error budget remaining (percentage and time)
Error budget burn rate (how fast are we consuming it?)
30-day trend of SLI performance

Common Mistakes

Setting SLOs too high — 99.99% sounds good but means only 4.3 minutes of downtime per month. Is your team resourced for that?
Not having error budget policies — SLOs without consequences are just dashboards
Measuring the wrong SLIs — Focus on user-facing metrics, not infrastructure metrics
Not involving stakeholders — Product managers should help set SLOs based on business impact

Eazy SaaS Tip: We help our clients start with 3 SLOs (availability, P99 latency, error rate) and build error budget dashboards. Teams that adopt error budgets report 40% fewer incidents because reliability work gets prioritized before budgets run out, not after outages happen.

← Back to Blog