On-Call Runbook Design Guide
February 13, 2026
|
SRE
Incident Response
Structure, templates, and scenarios.
On-Call Runbook Design Guide
A runbook is only useful if someone can follow it at 3 AM while half-asleep. Great runbooks are concise, prescriptive, and tested regularly. They reduce incident response time from hours to minutes by providing clear, step-by-step resolution paths.
Runbook Structure Template
## Alert: [Alert Name]
### Severity: P1/P2/P3
### Last Updated: YYYY-MM-DD
### Owner: [Team Name]
## Overview
Brief description of what this alert means and its impact.
## Quick Check (30 seconds)
1. Check [dashboard link] for current status
2. Check [service health endpoint]
3. If resolved, close the incident
## Diagnosis Steps
Step-by-step investigation...
## Resolution Steps
Clear actions to fix the issue...
## Escalation
When and who to escalate to...
## Post-Incident
Checklist after resolution...Example: API High Error Rate Runbook
## Alert: API Error Rate > 1%
### Severity: P2
### Owner: Platform Team
## Overview
The API service is returning more than 1% 5xx errors.
Customer impact: Some API calls are failing.
## Quick Check
1. Open Grafana dashboard: [link]
2. Check if error rate is still elevated
3. If auto-recovered, monitor for 10 min then close
## Diagnosis
1. **Check deployment timeline**
- Was there a recent deployment? `kubectl rollout history deployment/api`
- If yes, consider rollback (see Resolution step 1)
2. **Check database connectivity**
- `kubectl exec -it api-pod -- pg_isready -h postgres`
- If database is down, see [Database Runbook]
3. **Check resource exhaustion**
- `kubectl top pods -l app=api`
- If OOMKilled: `kubectl describe pod [pod-name]`
- If CPU throttled: increase limits (see Resolution step 3)
4. **Check logs for error patterns**
- `kubectl logs -l app=api --tail=100 | grep ERROR`
- Common patterns:
- "connection refused" → database or dependency down
- "timeout" → downstream service slow
- "out of memory" → increase memory limits
## Resolution
1. **Rollback recent deployment**
```
kubectl rollout undo deployment/api
kubectl rollout status deployment/api
```
2. **Restart pods (temporary fix)**
```
kubectl rollout restart deployment/api
```
3. **Scale up if resource constrained**
```
kubectl scale deployment/api --replicas=5
```
## Escalation
- If not resolved in 30 minutes: page Senior SRE
- If data loss suspected: page Engineering Manager
- If customer-facing outage >1 hour: page VP EngineeringRunbook Best Practices
- Keep it prescriptive — "Run this command" not "investigate the logs"
- Include copy-pasteable commands — Minimize typing at 3 AM
- Link to dashboards — Every runbook should link to the relevant Grafana dashboard
- Version control — Store runbooks in Git, review changes in PRs
- Test regularly — Run Game Days where on-call engineers practice using runbooks
- Keep them short — If a runbook is longer than 2 pages, split it into sub-runbooks
Game Day Testing
Quarterly Game Days validate your runbooks:
- Inject a controlled failure (kill a pod, throttle a database)
- On-call engineer follows the runbook to diagnose and resolve
- Measure time to detect, diagnose, and resolve
- Update the runbook based on gaps discovered
Runbook Review Checklist
- Last updated within 90 days?
- All links still working?
- Commands tested on current infrastructure?
- Escalation contacts still accurate?
- Covers the most common failure modes?
Eazy SaaS Tip: We create runbooks alongside every monitoring alert we deploy. No alert goes live without a corresponding runbook. Our clients report a 60% reduction in mean time to resolve (MTTR) after adopting structured runbooks with quarterly Game Day testing.