Kubernetes HPA Setup and Tuning
HPA v2 with custom Prometheus metrics.
Horizontal Pod Autoscaler (HPA) Overview
The Horizontal Pod Autoscaler automatically scales the number of pod replicas based on observed metrics. HPA v2 supports CPU, memory, and custom Prometheus metrics — enabling sophisticated scaling strategies that match your application's actual load patterns.
Basic CPU-Based HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70This scales the api-service between 2 and 10 replicas, adding pods when average CPU utilization exceeds 70%.
Multi-Metric Scaling
Combine CPU and memory metrics for more accurate scaling decisions:
spec:
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80When multiple metrics are specified, HPA calculates the desired replica count for each metric and uses the highest value.
Custom Metrics with Prometheus
Scale based on application-specific metrics like request queue depth or active connections:
spec:
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"This requires the Prometheus Adapter to be installed, which exposes Prometheus metrics as Kubernetes custom metrics:
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--set prometheus.url=http://prometheus-server.monitoring.svc \
--set prometheus.port=9090Scaling Behavior Configuration
Control how aggressively HPA scales up and down:
spec:
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60This configuration allows fast scale-up (double pods every 60 seconds) but slow scale-down (remove only 10% of pods per minute, stabilizing for 5 minutes). This prevents flapping during traffic fluctuations.
HPA + Cluster Autoscaler
HPA and Cluster Autoscaler work together:
- HPA detects high CPU and creates new pod replicas
- New pods enter Pending state if no node has capacity
- Cluster Autoscaler detects Pending pods and provisions new nodes
- Once nodes are ready, Pending pods are scheduled
The total scale-up time is typically 3-5 minutes: HPA reaction (15-30s) + node provisioning (2-4min) + pod startup (10-30s).
Monitoring HPA
# Check HPA status
kubectl get hpa
kubectl describe hpa api-hpa
# Watch scaling events
kubectl get events --field-selector reason=SuccessfulRescaleKey metrics to alert on:
- HPA at maxReplicas — You've hit your scaling ceiling; increase maxReplicas or optimize the application
- Frequent scaling events — May indicate flapping; adjust stabilization windows
- Unschedulable pods — Cluster Autoscaler may need configuration or node pool expansion
Eazy SaaS Tip: We recommend starting with CPU-based HPA and gradually adding custom metrics as you understand your application's scaling characteristics. Over-engineering autoscaling from day one often creates more problems than it solves.