SLOs, SLIs, and Error Budgets
February 13, 2026
|
SRE
Observability
Reliability
Practical examples and budget policies.
SLOs, SLIs, and Error Budgets: A Practical Guide
Site Reliability Engineering (SRE) principles help teams balance reliability with feature velocity. SLOs (Service Level Objectives), SLIs (Service Level Indicators), and error budgets provide a data-driven framework for making reliability decisions.
Key Definitions
- SLI (Service Level Indicator) — A quantitative measurement of service behavior (e.g., latency, error rate, availability)
- SLO (Service Level Objective) — A target value for an SLI (e.g., "99.9% of requests complete in under 200ms")
- SLA (Service Level Agreement) — A contractual commitment with penalties (typically weaker than SLOs)
- Error Budget — The allowed unreliability (100% - SLO). A 99.9% SLO has a 0.1% error budget.
Choosing SLIs
| Service Type | Recommended SLIs |
|---|---|
| User-facing API | Availability, latency (P50, P95, P99), error rate |
| Data pipeline | Freshness (time since last update), correctness, throughput |
| Storage system | Availability, durability, latency |
| Batch job | Completion rate, runtime, data quality |
Practical SLO Examples
# API Service SLOs
Availability: 99.9% of requests return non-5xx responses (30-day window)
Latency: 99% of requests complete in < 200ms (P99)
Latency: 99.9% of requests complete in < 1000ms (P99.9)
# Translation to error budget:
# 99.9% availability = 43.2 minutes of downtime per 30 days
# 99% P99 latency = 1% of requests can exceed 200msMeasuring SLIs with Prometheus
# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
# Latency SLI (P99 under 200ms)
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[30d])) by (le))
# Error budget remaining
1 - (
(1 - (sum(rate(http_requests_total{status!~"5.."}[30d])) / sum(rate(http_requests_total[30d]))))
/
(1 - 0.999) # 0.999 is the SLO target
)Error Budget Policy
Define what happens when the error budget is consumed:
- >50% budget remaining: Normal feature development velocity
- 25-50% remaining: Increase code review rigor, add more testing
- 10-25% remaining: Freeze non-critical changes, focus on reliability improvements
- 0% remaining: Complete feature freeze until budget replenishes; all engineering effort on reliability
SLO Dashboard
Create a Grafana dashboard showing:
- Current SLI values vs SLO targets
- Error budget remaining (percentage and time)
- Error budget burn rate (how fast are we consuming it?)
- 30-day trend of SLI performance
Common Mistakes
- Setting SLOs too high — 99.99% sounds good but means only 4.3 minutes of downtime per month. Is your team resourced for that?
- Not having error budget policies — SLOs without consequences are just dashboards
- Measuring the wrong SLIs — Focus on user-facing metrics, not infrastructure metrics
- Not involving stakeholders — Product managers should help set SLOs based on business impact
Eazy SaaS Tip: We help our clients start with 3 SLOs (availability, P99 latency, error rate) and build error budget dashboards. Teams that adopt error budgets report 40% fewer incidents because reliability work gets prioritized before budgets run out, not after outages happen.