Kubernetes Monitoring Metrics Guide
February 13, 2026
|
Kubernetes
Monitoring
Grafana
Essential metrics with dashboards.
Essential Kubernetes Monitoring Metrics
Monitoring Kubernetes effectively requires understanding which metrics matter at each layer: cluster, node, pod, and container. This guide identifies the essential metrics for each layer and provides ready-to-use PromQL queries for your Grafana dashboards.
Cluster-Level Metrics
| Metric | Alert Threshold | Why It Matters |
|---|---|---|
| Node count | Below minimum | Cluster capacity issues |
| Pod scheduling failures | > 0 for 5 min | Resource exhaustion |
| API server latency | P99 > 1s | Control plane health |
| etcd leader changes | > 0 | etcd stability |
# Cluster CPU allocation percentage
sum(kube_pod_container_resource_requests{resource="cpu"}) / sum(kube_node_status_allocatable{resource="cpu"}) * 100
# Unschedulable pods
kube_pod_status_phase{phase="Pending"} > 0Node-Level Metrics
| Metric | Alert Threshold | Why It Matters |
|---|---|---|
| CPU utilization | > 85% for 15 min | Compute pressure |
| Memory utilization | > 90% for 10 min | OOM risk |
| Disk utilization | > 85% | Eviction threshold |
| Network errors | > 0 sustained | Network issues |
# Node CPU utilization
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Node memory utilization
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
# Disk pressure
100 * (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})Pod-Level Metrics
| Metric | Alert Threshold | Why It Matters |
|---|---|---|
| Restart count | > 3 in 1 hour | CrashLoopBackOff |
| CPU throttling | > 25% | CPU limits too tight |
| Memory vs limit | > 90% | OOMKill imminent |
| Ready status | Not ready > 2 min | Health check failures |
# Pod restart rate
increase(kube_pod_container_status_restarts_total[1h]) > 3
# CPU throttling percentage
sum(rate(container_cpu_cfs_throttled_periods_total[5m])) by (pod)
/ sum(rate(container_cpu_cfs_periods_total[5m])) by (pod) * 100
# Memory usage vs limit
container_memory_working_set_bytes / kube_pod_container_resource_limits{resource="memory"} * 100Application-Level Metrics (RED Method)
The RED method (Rate, Errors, Duration) captures the user experience:
# Request Rate
sum(rate(http_requests_total[5m])) by (service)
# Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) * 100
# Duration (P95 latency)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))Dashboard Layout
Organize your Grafana dashboard in layers:
- Row 1: Cluster Overview — Total CPU/Memory allocation, pod count, node count
- Row 2: Node Health — Per-node CPU, memory, disk, network
- Row 3: Workload Metrics — Deployment replica status, pod restarts, HPA activity
- Row 4: Application RED — Request rate, error rate, latency per service
- Row 5: Alerts — Active alerts panel and recent alert history
Eazy SaaS Tip: We provide a Grafana dashboard template with all these metrics pre-configured. Import it once, and you get comprehensive K8s monitoring in 5 minutes. We update the template quarterly as Kubernetes evolves.