Kubernetes HPA Setup and Tuning

February 13, 2026 | Kubernetes Autoscaling HPA

HPA v2 with custom Prometheus metrics.

Horizontal Pod Autoscaler (HPA) Overview

The Horizontal Pod Autoscaler automatically scales the number of pod replicas based on observed metrics. HPA v2 supports CPU, memory, and custom Prometheus metrics — enabling sophisticated scaling strategies that match your application's actual load patterns.

Basic CPU-Based HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

This scales the api-service between 2 and 10 replicas, adding pods when average CPU utilization exceeds 70%.

Multi-Metric Scaling

Combine CPU and memory metrics for more accurate scaling decisions:

spec:
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

When multiple metrics are specified, HPA calculates the desired replica count for each metric and uses the highest value.

Custom Metrics with Prometheus

Scale based on application-specific metrics like request queue depth or active connections:

spec:
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

This requires the Prometheus Adapter to be installed, which exposes Prometheus metrics as Kubernetes custom metrics:

helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --set prometheus.url=http://prometheus-server.monitoring.svc \
  --set prometheus.port=9090

Scaling Behavior Configuration

Control how aggressively HPA scales up and down:

spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

This configuration allows fast scale-up (double pods every 60 seconds) but slow scale-down (remove only 10% of pods per minute, stabilizing for 5 minutes). This prevents flapping during traffic fluctuations.

HPA + Cluster Autoscaler

HPA and Cluster Autoscaler work together:

  1. HPA detects high CPU and creates new pod replicas
  2. New pods enter Pending state if no node has capacity
  3. Cluster Autoscaler detects Pending pods and provisions new nodes
  4. Once nodes are ready, Pending pods are scheduled

The total scale-up time is typically 3-5 minutes: HPA reaction (15-30s) + node provisioning (2-4min) + pod startup (10-30s).

Monitoring HPA

# Check HPA status
kubectl get hpa
kubectl describe hpa api-hpa

# Watch scaling events
kubectl get events --field-selector reason=SuccessfulRescale

Key metrics to alert on:

  • HPA at maxReplicas — You've hit your scaling ceiling; increase maxReplicas or optimize the application
  • Frequent scaling events — May indicate flapping; adjust stabilization windows
  • Unschedulable pods — Cluster Autoscaler may need configuration or node pool expansion

Eazy SaaS Tip: We recommend starting with CPU-based HPA and gradually adding custom metrics as you understand your application's scaling characteristics. Over-engineering autoscaling from day one often creates more problems than it solves.