Chaos Engineering: Testing Resilience

February 13, 2026 | Reliability Chaos Engineering SRE

FIS, Litmus, and Game Days.

Chaos Engineering: Testing Resilience in Production

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. Instead of waiting for failures to happen, you proactively inject failures in a controlled manner to discover weaknesses before they cause outages.

Chaos Engineering Principles

Start with a hypothesis — "If we kill one pod, the service will continue operating with no user impact"
Minimize blast radius — Start small, expand gradually
Run in production — Staging environments don't have real traffic patterns
Automate experiments — Manual chaos doesn't scale
Stop on unexpected results — Have abort conditions defined

AWS Fault Injection Service (FIS)

AWS's native chaos engineering service:

aws fis create-experiment-template \
  --description "Terminate 30% of ECS tasks" \
  --targets '{
    "ecsTask": {
      "resourceType": "aws:ecs:task",
      "resourceTags": {"Environment": "production"},
      "selectionMode": "PERCENT(30)"
    }
  }' \
  --actions '{
    "stopTasks": {
      "actionId": "aws:ecs:stop-task",
      "parameters": {},
      "targets": {"Tasks": "ecsTask"}
    }
  }' \
  --stop-conditions '[{
    "source": "aws:cloudwatch:alarm",
    "value": "arn:aws:cloudwatch:...:alarm:HighErrorRate"
  }]' \
  --role-arn arn:aws:iam::xxx:role/FISRole

Common FIS Experiments

Experiment	What It Tests	Expected Outcome
Terminate EC2 instance	ASG self-healing	New instance launches, traffic rebalances
Stop ECS tasks (30%)	Service redundancy	Remaining tasks handle load, no errors
CPU stress on nodes	Autoscaling and throttling	HPA scales up, latency stays acceptable
Network latency injection	Timeout handling	Circuit breakers trigger, graceful degradation
AZ failure simulation	Multi-AZ resilience	Traffic shifts to healthy AZ automatically

Litmus Chaos for Kubernetes

# Install Litmus
helm install litmus litmuschaos/litmus \
  --namespace litmus \
  --create-namespace

# Pod delete experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-chaos
spec:
  appinfo:
    appns: production
    applabel: app=api-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: "60"
        - name: CHAOS_INTERVAL
          value: "10"
        - name: FORCE
          value: "true"

Game Day Framework

Organize chaos experiments as scheduled Game Days:

Planning (1 week before)
- Define experiments and hypotheses
- Identify observability requirements (dashboards, alerts)
- Define abort conditions and rollback procedures
- Notify stakeholders and customer support
Execution (Game Day)
- Run experiments one at a time
- Monitor dashboards in real-time
- Document observations and surprises
- Stop immediately if abort conditions are met
Review (next day)
- Document findings and action items
- Prioritize remediation work
- Update runbooks based on observations
- Schedule follow-up experiments

Progressive Chaos Maturity

Level	Activities
Level 1: Basic	Manual pod kills, instance termination in staging
Level 2: Automated	Scheduled experiments in staging with FIS/Litmus
Level 3: Production	Controlled production experiments with abort conditions
Level 4: Continuous	Chaos experiments integrated into CI/CD pipelines

Eazy SaaS Tip: We start every client at Level 1 — simple pod kills and instance terminations in staging. Most teams discover at least 3 critical resilience gaps in their first Game Day: missing health checks, incorrect timeout settings, or inadequate replica counts. Fixing these before they cause production outages is invaluable.

← Back to Blog