Kubernetes Monitoring Metrics Guide

February 13, 2026 | Kubernetes Monitoring Grafana

Essential metrics with dashboards.

Essential Kubernetes Monitoring Metrics

Monitoring Kubernetes effectively requires understanding which metrics matter at each layer: cluster, node, pod, and container. This guide identifies the essential metrics for each layer and provides ready-to-use PromQL queries for your Grafana dashboards.

Cluster-Level Metrics

MetricAlert ThresholdWhy It Matters
Node countBelow minimumCluster capacity issues
Pod scheduling failures> 0 for 5 minResource exhaustion
API server latencyP99 > 1sControl plane health
etcd leader changes> 0etcd stability
# Cluster CPU allocation percentage
sum(kube_pod_container_resource_requests{resource="cpu"}) / sum(kube_node_status_allocatable{resource="cpu"}) * 100

# Unschedulable pods
kube_pod_status_phase{phase="Pending"} > 0

Node-Level Metrics

MetricAlert ThresholdWhy It Matters
CPU utilization> 85% for 15 minCompute pressure
Memory utilization> 90% for 10 minOOM risk
Disk utilization> 85%Eviction threshold
Network errors> 0 sustainedNetwork issues
# Node CPU utilization
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Node memory utilization
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

# Disk pressure
100 * (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

Pod-Level Metrics

MetricAlert ThresholdWhy It Matters
Restart count> 3 in 1 hourCrashLoopBackOff
CPU throttling> 25%CPU limits too tight
Memory vs limit> 90%OOMKill imminent
Ready statusNot ready > 2 minHealth check failures
# Pod restart rate
increase(kube_pod_container_status_restarts_total[1h]) > 3

# CPU throttling percentage
sum(rate(container_cpu_cfs_throttled_periods_total[5m])) by (pod)
/ sum(rate(container_cpu_cfs_periods_total[5m])) by (pod) * 100

# Memory usage vs limit
container_memory_working_set_bytes / kube_pod_container_resource_limits{resource="memory"} * 100

Application-Level Metrics (RED Method)

The RED method (Rate, Errors, Duration) captures the user experience:

# Request Rate
sum(rate(http_requests_total[5m])) by (service)

# Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) * 100

# Duration (P95 latency)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Dashboard Layout

Organize your Grafana dashboard in layers:

  1. Row 1: Cluster Overview — Total CPU/Memory allocation, pod count, node count
  2. Row 2: Node Health — Per-node CPU, memory, disk, network
  3. Row 3: Workload Metrics — Deployment replica status, pod restarts, HPA activity
  4. Row 4: Application RED — Request rate, error rate, latency per service
  5. Row 5: Alerts — Active alerts panel and recent alert history

Eazy SaaS Tip: We provide a Grafana dashboard template with all these metrics pre-configured. Import it once, and you get comprehensive K8s monitoring in 5 minutes. We update the template quarterly as Kubernetes evolves.