Alerting Strategy That Reduces Noise
February 13, 2026
|
Monitoring
Alerting
SRE
Severity levels, routing, and quality review.
Building an Alerting Strategy That Reduces Noise
Alert fatigue is real. When engineers receive hundreds of alerts daily, they start ignoring them — and that's when outages happen. A well-designed alerting strategy routes critical issues to the right people while filtering out noise. Here's how to build one.
The Alert Quality Problem
Common symptoms of poor alerting:
- Engineers mute alert channels
- On-call rotations are dreaded rather than manageable
- Real incidents are missed because they're buried in noise
- Every metric has an alert, but nobody knows which ones matter
Severity Classification
| Severity | Definition | Response Time | Notification |
|---|---|---|---|
| P1 - Critical | Customer-facing outage | Immediate | Page on-call + escalation |
| P2 - High | Degraded service, workaround exists | 30 minutes | Page on-call |
| P3 - Medium | Non-critical component down | Next business day | Slack channel |
| P4 - Low | Informational, trend warning | Next sprint | Dashboard only |
Alert Design Principles
- Alert on symptoms, not causes — Alert on "API error rate > 1%" not "CPU > 80%". High CPU that doesn't impact users doesn't need a page.
- Every alert must be actionable — If you can't take action, it's a metric, not an alert.
- Include runbook links — Every alert should link to a runbook with investigation steps.
- Set appropriate thresholds — Use percentile-based thresholds (P95/P99) not averages.
- Use multi-window evaluation — Require sustained threshold breach (3 consecutive periods) to avoid transient spikes.
Alertmanager Routing Example
route:
receiver: 'default-slack'
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# P1: Page immediately
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 0s
repeat_interval: 15m
# P2: Page during business hours, Slack off-hours
- match:
severity: high
receiver: 'pagerduty-high'
repeat_interval: 1h
# P3: Slack only
- match:
severity: medium
receiver: 'slack-warnings'
repeat_interval: 12h
# Inhibit downstream alerts when upstream is down
inhibit_rules:
- source_match:
alertname: 'ClusterDown'
target_match_re:
alertname: 'Pod.*|Node.*'
equal: ['cluster']Alert Quality Review Process
Review alerts quarterly using this framework:
- Volume: How many alerts fired? Target: <20 per week per team
- Actionable rate: What % required human action? Target: >80%
- False positive rate: What % were noise? Target: <10%
- Time to acknowledge: How quickly were alerts responded to?
- Missing alerts: Were there incidents without corresponding alerts?
Reducing Alert Noise
- Inhibition rules — Suppress downstream alerts when the root cause is known
- Grouping — Aggregate related alerts into a single notification
- Dead man's switch — Alert on missing data (the absence of expected heartbeat signals)
- Maintenance windows — Silence alerts during planned maintenance
- Alert deduplication — Prevent multiple teams from investigating the same issue
Eazy SaaS Tip: We implement a "new alert probation" policy for our clients. New alerts run in informational mode for 2 weeks before going live. This allows tuning thresholds based on real data, eliminating false positives before they cause alert fatigue.