Alerting Strategy That Reduces Noise

February 13, 2026 | Monitoring Alerting SRE

Severity levels, routing, and quality review.

Building an Alerting Strategy That Reduces Noise

Alert fatigue is real. When engineers receive hundreds of alerts daily, they start ignoring them — and that's when outages happen. A well-designed alerting strategy routes critical issues to the right people while filtering out noise. Here's how to build one.

The Alert Quality Problem

Common symptoms of poor alerting:

  • Engineers mute alert channels
  • On-call rotations are dreaded rather than manageable
  • Real incidents are missed because they're buried in noise
  • Every metric has an alert, but nobody knows which ones matter

Severity Classification

SeverityDefinitionResponse TimeNotification
P1 - CriticalCustomer-facing outageImmediatePage on-call + escalation
P2 - HighDegraded service, workaround exists30 minutesPage on-call
P3 - MediumNon-critical component downNext business daySlack channel
P4 - LowInformational, trend warningNext sprintDashboard only

Alert Design Principles

  1. Alert on symptoms, not causes — Alert on "API error rate > 1%" not "CPU > 80%". High CPU that doesn't impact users doesn't need a page.
  2. Every alert must be actionable — If you can't take action, it's a metric, not an alert.
  3. Include runbook links — Every alert should link to a runbook with investigation steps.
  4. Set appropriate thresholds — Use percentile-based thresholds (P95/P99) not averages.
  5. Use multi-window evaluation — Require sustained threshold breach (3 consecutive periods) to avoid transient spikes.

Alertmanager Routing Example

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
  # P1: Page immediately
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    group_wait: 0s
    repeat_interval: 15m

  # P2: Page during business hours, Slack off-hours
  - match:
      severity: high
    receiver: 'pagerduty-high'
    repeat_interval: 1h

  # P3: Slack only
  - match:
      severity: medium
    receiver: 'slack-warnings'
    repeat_interval: 12h

  # Inhibit downstream alerts when upstream is down
inhibit_rules:
- source_match:
    alertname: 'ClusterDown'
  target_match_re:
    alertname: 'Pod.*|Node.*'
  equal: ['cluster']

Alert Quality Review Process

Review alerts quarterly using this framework:

  • Volume: How many alerts fired? Target: <20 per week per team
  • Actionable rate: What % required human action? Target: >80%
  • False positive rate: What % were noise? Target: <10%
  • Time to acknowledge: How quickly were alerts responded to?
  • Missing alerts: Were there incidents without corresponding alerts?

Reducing Alert Noise

  • Inhibition rules — Suppress downstream alerts when the root cause is known
  • Grouping — Aggregate related alerts into a single notification
  • Dead man's switch — Alert on missing data (the absence of expected heartbeat signals)
  • Maintenance windows — Silence alerts during planned maintenance
  • Alert deduplication — Prevent multiple teams from investigating the same issue

Eazy SaaS Tip: We implement a "new alert probation" policy for our clients. New alerts run in informational mode for 2 weeks before going live. This allows tuning thresholds based on real data, eliminating false positives before they cause alert fatigue.