Alerting Strategy That Reduces Noise

February 13, 2026 | Monitoring Alerting SRE

Severity levels, routing, and quality review.

Building an Alerting Strategy That Reduces Noise

Alert fatigue is real. When engineers receive hundreds of alerts daily, they start ignoring them — and that's when outages happen. A well-designed alerting strategy routes critical issues to the right people while filtering out noise. Here's how to build one.

The Alert Quality Problem

Common symptoms of poor alerting:

Engineers mute alert channels
On-call rotations are dreaded rather than manageable
Real incidents are missed because they're buried in noise
Every metric has an alert, but nobody knows which ones matter

Severity Classification

Severity	Definition	Response Time	Notification
P1 - Critical	Customer-facing outage	Immediate	Page on-call + escalation
P2 - High	Degraded service, workaround exists	30 minutes	Page on-call
P3 - Medium	Non-critical component down	Next business day	Slack channel
P4 - Low	Informational, trend warning	Next sprint	Dashboard only

Alert Design Principles

Alert on symptoms, not causes — Alert on "API error rate > 1%" not "CPU > 80%". High CPU that doesn't impact users doesn't need a page.
Every alert must be actionable — If you can't take action, it's a metric, not an alert.
Include runbook links — Every alert should link to a runbook with investigation steps.
Set appropriate thresholds — Use percentile-based thresholds (P95/P99) not averages.
Use multi-window evaluation — Require sustained threshold breach (3 consecutive periods) to avoid transient spikes.

Alertmanager Routing Example

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
  # P1: Page immediately
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    group_wait: 0s
    repeat_interval: 15m

  # P2: Page during business hours, Slack off-hours
  - match:
      severity: high
    receiver: 'pagerduty-high'
    repeat_interval: 1h

  # P3: Slack only
  - match:
      severity: medium
    receiver: 'slack-warnings'
    repeat_interval: 12h

  # Inhibit downstream alerts when upstream is down
inhibit_rules:
- source_match:
    alertname: 'ClusterDown'
  target_match_re:
    alertname: 'Pod.*|Node.*'
  equal: ['cluster']

Alert Quality Review Process

Review alerts quarterly using this framework:

Volume: How many alerts fired? Target: <20 per week per team
Actionable rate: What % required human action? Target: >80%
False positive rate: What % were noise? Target: <10%
Time to acknowledge: How quickly were alerts responded to?
Missing alerts: Were there incidents without corresponding alerts?

Reducing Alert Noise

Inhibition rules — Suppress downstream alerts when the root cause is known
Grouping — Aggregate related alerts into a single notification
Dead man's switch — Alert on missing data (the absence of expected heartbeat signals)
Maintenance windows — Silence alerts during planned maintenance
Alert deduplication — Prevent multiple teams from investigating the same issue

Eazy SaaS Tip: We implement a "new alert probation" policy for our clients. New alerts run in informational mode for 2 weeks before going live. This allows tuning thresholds based on real data, eliminating false positives before they cause alert fatigue.

← Back to Blog