Kubernetes Monitoring Metrics Guide

February 13, 2026 | Kubernetes Monitoring Grafana

Essential metrics with dashboards.

Essential Kubernetes Monitoring Metrics

Monitoring Kubernetes effectively requires understanding which metrics matter at each layer: cluster, node, pod, and container. This guide identifies the essential metrics for each layer and provides ready-to-use PromQL queries for your Grafana dashboards.

Cluster-Level Metrics

Metric	Alert Threshold	Why It Matters
Node count	Below minimum	Cluster capacity issues
Pod scheduling failures	> 0 for 5 min	Resource exhaustion
API server latency	P99 > 1s	Control plane health
etcd leader changes	> 0	etcd stability

# Cluster CPU allocation percentage
sum(kube_pod_container_resource_requests{resource="cpu"}) / sum(kube_node_status_allocatable{resource="cpu"}) * 100

# Unschedulable pods
kube_pod_status_phase{phase="Pending"} > 0

Node-Level Metrics

Metric	Alert Threshold	Why It Matters
CPU utilization	> 85% for 15 min	Compute pressure
Memory utilization	> 90% for 10 min	OOM risk
Disk utilization	> 85%	Eviction threshold
Network errors	> 0 sustained	Network issues

# Node CPU utilization
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Node memory utilization
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))

# Disk pressure
100 * (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

Pod-Level Metrics

Metric	Alert Threshold	Why It Matters
Restart count	> 3 in 1 hour	CrashLoopBackOff
CPU throttling	> 25%	CPU limits too tight
Memory vs limit	> 90%	OOMKill imminent
Ready status	Not ready > 2 min	Health check failures

# Pod restart rate
increase(kube_pod_container_status_restarts_total[1h]) > 3

# CPU throttling percentage
sum(rate(container_cpu_cfs_throttled_periods_total[5m])) by (pod)
/ sum(rate(container_cpu_cfs_periods_total[5m])) by (pod) * 100

# Memory usage vs limit
container_memory_working_set_bytes / kube_pod_container_resource_limits{resource="memory"} * 100

Application-Level Metrics (RED Method)

The RED method (Rate, Errors, Duration) captures the user experience:

# Request Rate
sum(rate(http_requests_total[5m])) by (service)

# Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) * 100

# Duration (P95 latency)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Dashboard Layout

Organize your Grafana dashboard in layers:

Row 1: Cluster Overview — Total CPU/Memory allocation, pod count, node count
Row 2: Node Health — Per-node CPU, memory, disk, network
Row 3: Workload Metrics — Deployment replica status, pod restarts, HPA activity
Row 4: Application RED — Request rate, error rate, latency per service
Row 5: Alerts — Active alerts panel and recent alert history

Eazy SaaS Tip: We provide a Grafana dashboard template with all these metrics pre-configured. Import it once, and you get comprehensive K8s monitoring in 5 minutes. We update the template quarterly as Kubernetes evolves.

← Back to Blog