Prometheus + Grafana K8s Monitoring
kube-prometheus-stack with alerts.
Prometheus + Grafana: Complete K8s Monitoring
The kube-prometheus-stack is the gold standard for Kubernetes monitoring. It deploys Prometheus for metrics collection, Grafana for visualization, and Alertmanager for notifications — all pre-configured with dashboards and alerts for Kubernetes components.
Installation with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.adminPassword=SecureP@ss \
--set alertmanager.alertmanagerSpec.storage.volumeClaimTemplate.spec.resources.requests.storage=10GiWhat You Get Out of the Box
- 20+ Grafana dashboards — Cluster overview, node metrics, pod metrics, namespace metrics
- 100+ Prometheus recording rules — Pre-computed metrics for dashboard performance
- 50+ alerting rules — KubePodCrashLooping, NodeNotReady, TargetDown, etc.
- ServiceMonitor CRDs — Automatic service discovery for metrics endpoints
Adding Custom ServiceMonitors
Scrape metrics from your own applications:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-service
namespace: monitoring
spec:
selector:
matchLabels:
app: api-service
namespaceSelector:
matchNames:
- production
endpoints:
- port: metrics
interval: 30s
path: /metricsConfiguring Alertmanager
Route alerts to Slack, PagerDuty, or email:
alertmanager:
config:
global:
slack_api_url: 'https://hooks.slack.com/services/xxx'
route:
receiver: 'slack-critical'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
title: '{{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'Essential Dashboards
- Cluster Overview — CPU, memory, disk usage across all nodes
- Namespace Metrics — Resource consumption per namespace for cost allocation
- Pod Metrics — Individual pod CPU, memory, network, restarts
- Node Exporter — Detailed host metrics (disk I/O, network, filesystem)
- CoreDNS — DNS query rates, errors, and latency
Custom PromQL Queries
# CPU usage by namespace
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)
# Memory usage percentage per node
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Pod restart rate (last hour)
increase(kube_pod_container_status_restarts_total[1h]) > 0
# Request rate by service
sum(rate(http_requests_total[5m])) by (service)Storage Sizing
Prometheus storage requirements depend on cardinality (number of unique time series):
- Small cluster (10 nodes, 100 pods): ~50,000 series, 20GB for 30 days
- Medium cluster (50 nodes, 500 pods): ~250,000 series, 100GB for 30 days
- Large cluster (200 nodes, 2000 pods): ~1M series, 400GB for 30 days
For long-term storage, integrate with Thanos or Cortex for unlimited retention with S3 backend.
Eazy SaaS Tip: We deploy kube-prometheus-stack in every Kubernetes cluster with custom dashboards for the 4 golden signals (latency, traffic, errors, saturation) plus cost allocation by namespace. This gives teams instant visibility and reduces mean time to detection from hours to minutes.