Prometheus + Grafana K8s Monitoring

February 13, 2026 | Kubernetes Prometheus Grafana

kube-prometheus-stack with alerts.

Prometheus + Grafana: Complete K8s Monitoring

The kube-prometheus-stack is the gold standard for Kubernetes monitoring. It deploys Prometheus for metrics collection, Grafana for visualization, and Alertmanager for notifications — all pre-configured with dashboards and alerts for Kubernetes components.

Installation with Helm

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword=SecureP@ss \
  --set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=10Gi

What You Get Out of the Box

  • 20+ Grafana dashboards — Cluster overview, node metrics, pod metrics, namespace metrics
  • 100+ Prometheus recording rules — Pre-computed metrics for dashboard performance
  • 50+ alerting rules — KubePodCrashLooping, NodeNotReady, TargetDown, etc.
  • ServiceMonitor CRDs — Automatic service discovery for metrics endpoints

Adding Custom ServiceMonitors

Scrape metrics from your own applications:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-service
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: api-service
  namespaceSelector:
    matchNames:
    - production
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Configuring Alertmanager

Route alerts to Slack, PagerDuty, or email:

alertmanager:
  config:
    global:
      slack_api_url: 'https://hooks.slack.com/services/xxx'
    route:
      receiver: 'slack-critical'
      routes:
      - match:
          severity: critical
        receiver: 'pagerduty'
      - match:
          severity: warning
        receiver: 'slack-warnings'
    receivers:
    - name: 'slack-critical'
      slack_configs:
      - channel: '#alerts-critical'
        title: '{{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    - name: 'pagerduty'
      pagerduty_configs:
      - service_key: 'your-pagerduty-key'

Essential Dashboards

  • Cluster Overview — CPU, memory, disk usage across all nodes
  • Namespace Metrics — Resource consumption per namespace for cost allocation
  • Pod Metrics — Individual pod CPU, memory, network, restarts
  • Node Exporter — Detailed host metrics (disk I/O, network, filesystem)
  • CoreDNS — DNS query rates, errors, and latency

Custom PromQL Queries

# CPU usage by namespace
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)

# Memory usage percentage per node
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Pod restart rate (last hour)
increase(kube_pod_container_status_restarts_total[1h]) > 0

# Request rate by service
sum(rate(http_requests_total[5m])) by (service)

Storage Sizing

Prometheus storage requirements depend on cardinality (number of unique time series):

  • Small cluster (10 nodes, 100 pods): ~50,000 series, 20GB for 30 days
  • Medium cluster (50 nodes, 500 pods): ~250,000 series, 100GB for 30 days
  • Large cluster (200 nodes, 2000 pods): ~1M series, 400GB for 30 days

For long-term storage, integrate with Thanos or Cortex for unlimited retention with S3 backend.

Eazy SaaS Tip: We deploy kube-prometheus-stack in every Kubernetes cluster with custom dashboards for the 4 golden signals (latency, traffic, errors, saturation) plus cost allocation by namespace. This gives teams instant visibility and reduces mean time to detection from hours to minutes.