On-Call Runbook Design Guide

February 13, 2026 | SRE Incident Response

Structure, templates, and scenarios.

On-Call Runbook Design Guide

A runbook is only useful if someone can follow it at 3 AM while half-asleep. Great runbooks are concise, prescriptive, and tested regularly. They reduce incident response time from hours to minutes by providing clear, step-by-step resolution paths.

Runbook Structure Template

## Alert: [Alert Name]
### Severity: P1/P2/P3
### Last Updated: YYYY-MM-DD
### Owner: [Team Name]

## Overview
Brief description of what this alert means and its impact.

## Quick Check (30 seconds)
1. Check [dashboard link] for current status
2. Check [service health endpoint]
3. If resolved, close the incident

## Diagnosis Steps
Step-by-step investigation...

## Resolution Steps
Clear actions to fix the issue...

## Escalation
When and who to escalate to...

## Post-Incident
Checklist after resolution...

Example: API High Error Rate Runbook

## Alert: API Error Rate > 1%
### Severity: P2
### Owner: Platform Team

## Overview
The API service is returning more than 1% 5xx errors.
Customer impact: Some API calls are failing.

## Quick Check
1. Open Grafana dashboard: [link]
2. Check if error rate is still elevated
3. If auto-recovered, monitor for 10 min then close

## Diagnosis
1. **Check deployment timeline**
   - Was there a recent deployment? `kubectl rollout history deployment/api`
   - If yes, consider rollback (see Resolution step 1)

2. **Check database connectivity**
   - `kubectl exec -it api-pod -- pg_isready -h postgres`
   - If database is down, see [Database Runbook]

3. **Check resource exhaustion**
   - `kubectl top pods -l app=api`
   - If OOMKilled: `kubectl describe pod [pod-name]`
   - If CPU throttled: increase limits (see Resolution step 3)

4. **Check logs for error patterns**
   - `kubectl logs -l app=api --tail=100 | grep ERROR`
   - Common patterns:
     - "connection refused" → database or dependency down
     - "timeout" → downstream service slow
     - "out of memory" → increase memory limits

## Resolution
1. **Rollback recent deployment**
   ```
   kubectl rollout undo deployment/api
   kubectl rollout status deployment/api
   ```

2. **Restart pods (temporary fix)**
   ```
   kubectl rollout restart deployment/api
   ```

3. **Scale up if resource constrained**
   ```
   kubectl scale deployment/api --replicas=5
   ```

## Escalation
- If not resolved in 30 minutes: page Senior SRE
- If data loss suspected: page Engineering Manager
- If customer-facing outage >1 hour: page VP Engineering

Runbook Best Practices

  1. Keep it prescriptive — "Run this command" not "investigate the logs"
  2. Include copy-pasteable commands — Minimize typing at 3 AM
  3. Link to dashboards — Every runbook should link to the relevant Grafana dashboard
  4. Version control — Store runbooks in Git, review changes in PRs
  5. Test regularly — Run Game Days where on-call engineers practice using runbooks
  6. Keep them short — If a runbook is longer than 2 pages, split it into sub-runbooks

Game Day Testing

Quarterly Game Days validate your runbooks:

  • Inject a controlled failure (kill a pod, throttle a database)
  • On-call engineer follows the runbook to diagnose and resolve
  • Measure time to detect, diagnose, and resolve
  • Update the runbook based on gaps discovered

Runbook Review Checklist

  • Last updated within 90 days?
  • All links still working?
  • Commands tested on current infrastructure?
  • Escalation contacts still accurate?
  • Covers the most common failure modes?

Eazy SaaS Tip: We create runbooks alongside every monitoring alert we deploy. No alert goes live without a corresponding runbook. Our clients report a 60% reduction in mean time to resolve (MTTR) after adopting structured runbooks with quarterly Game Day testing.