On-Call Runbook Design Guide

February 13, 2026 | SRE Incident Response

Structure, templates, and scenarios.

On-Call Runbook Design Guide

A runbook is only useful if someone can follow it at 3 AM while half-asleep. Great runbooks are concise, prescriptive, and tested regularly. They reduce incident response time from hours to minutes by providing clear, step-by-step resolution paths.

Runbook Structure Template

## Alert: [Alert Name]
### Severity: P1/P2/P3
### Last Updated: YYYY-MM-DD
### Owner: [Team Name]

## Overview
Brief description of what this alert means and its impact.

## Quick Check (30 seconds)
1. Check [dashboard link] for current status
2. Check [service health endpoint]
3. If resolved, close the incident

## Diagnosis Steps
Step-by-step investigation...

## Resolution Steps
Clear actions to fix the issue...

## Escalation
When and who to escalate to...

## Post-Incident
Checklist after resolution...

Example: API High Error Rate Runbook

## Alert: API Error Rate > 1%
### Severity: P2
### Owner: Platform Team

## Overview
The API service is returning more than 1% 5xx errors.
Customer impact: Some API calls are failing.

## Quick Check
1. Open Grafana dashboard: [link]
2. Check if error rate is still elevated
3. If auto-recovered, monitor for 10 min then close

## Diagnosis
1. **Check deployment timeline**
   - Was there a recent deployment? `kubectl rollout history deployment/api`
   - If yes, consider rollback (see Resolution step 1)

2. **Check database connectivity**
   - `kubectl exec -it api-pod -- pg_isready -h postgres`
   - If database is down, see [Database Runbook]

3. **Check resource exhaustion**
   - `kubectl top pods -l app=api`
   - If OOMKilled: `kubectl describe pod [pod-name]`
   - If CPU throttled: increase limits (see Resolution step 3)

4. **Check logs for error patterns**
   - `kubectl logs -l app=api --tail=100 | grep ERROR`
   - Common patterns:
     - "connection refused" → database or dependency down
     - "timeout" → downstream service slow
     - "out of memory" → increase memory limits

## Resolution
1. **Rollback recent deployment**
   ```
   kubectl rollout undo deployment/api
   kubectl rollout status deployment/api
   ```

2. **Restart pods (temporary fix)**
   ```
   kubectl rollout restart deployment/api
   ```

3. **Scale up if resource constrained**
   ```
   kubectl scale deployment/api --replicas=5
   ```

## Escalation
- If not resolved in 30 minutes: page Senior SRE
- If data loss suspected: page Engineering Manager
- If customer-facing outage >1 hour: page VP Engineering

Runbook Best Practices

Keep it prescriptive — "Run this command" not "investigate the logs"
Include copy-pasteable commands — Minimize typing at 3 AM
Link to dashboards — Every runbook should link to the relevant Grafana dashboard
Version control — Store runbooks in Git, review changes in PRs
Test regularly — Run Game Days where on-call engineers practice using runbooks
Keep them short — If a runbook is longer than 2 pages, split it into sub-runbooks

Game Day Testing

Quarterly Game Days validate your runbooks:

Inject a controlled failure (kill a pod, throttle a database)
On-call engineer follows the runbook to diagnose and resolve
Measure time to detect, diagnose, and resolve
Update the runbook based on gaps discovered

Runbook Review Checklist

Last updated within 90 days?
All links still working?
Commands tested on current infrastructure?
Escalation contacts still accurate?
Covers the most common failure modes?

Eazy SaaS Tip: We create runbooks alongside every monitoring alert we deploy. No alert goes live without a corresponding runbook. Our clients report a 60% reduction in mean time to resolve (MTTR) after adopting structured runbooks with quarterly Game Day testing.

← Back to Blog