Chaos Engineering: Testing Resilience

February 13, 2026 | Reliability Chaos Engineering SRE

FIS, Litmus, and Game Days.

Chaos Engineering: Testing Resilience in Production

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. Instead of waiting for failures to happen, you proactively inject failures in a controlled manner to discover weaknesses before they cause outages.

Chaos Engineering Principles

  1. Start with a hypothesis — "If we kill one pod, the service will continue operating with no user impact"
  2. Minimize blast radius — Start small, expand gradually
  3. Run in production — Staging environments don't have real traffic patterns
  4. Automate experiments — Manual chaos doesn't scale
  5. Stop on unexpected results — Have abort conditions defined

AWS Fault Injection Service (FIS)

AWS's native chaos engineering service:

aws fis create-experiment-template \
  --description "Terminate 30% of ECS tasks" \
  --targets '{
    "ecsTask": {
      "resourceType": "aws:ecs:task",
      "resourceTags": {"Environment": "production"},
      "selectionMode": "PERCENT(30)"
    }
  }' \
  --actions '{
    "stopTasks": {
      "actionId": "aws:ecs:stop-task",
      "parameters": {},
      "targets": {"Tasks": "ecsTask"}
    }
  }' \
  --stop-conditions '[{
    "source": "aws:cloudwatch:alarm",
    "value": "arn:aws:cloudwatch:...:alarm:HighErrorRate"
  }]' \
  --role-arn arn:aws:iam::xxx:role/FISRole

Common FIS Experiments

ExperimentWhat It TestsExpected Outcome
Terminate EC2 instanceASG self-healingNew instance launches, traffic rebalances
Stop ECS tasks (30%)Service redundancyRemaining tasks handle load, no errors
CPU stress on nodesAutoscaling and throttlingHPA scales up, latency stays acceptable
Network latency injectionTimeout handlingCircuit breakers trigger, graceful degradation
AZ failure simulationMulti-AZ resilienceTraffic shifts to healthy AZ automatically

Litmus Chaos for Kubernetes

# Install Litmus
helm install litmus litmuschaos/litmus \
  --namespace litmus \
  --create-namespace

# Pod delete experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-chaos
spec:
  appinfo:
    appns: production
    applabel: app=api-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: "60"
        - name: CHAOS_INTERVAL
          value: "10"
        - name: FORCE
          value: "true"

Game Day Framework

Organize chaos experiments as scheduled Game Days:

  1. Planning (1 week before)
    • Define experiments and hypotheses
    • Identify observability requirements (dashboards, alerts)
    • Define abort conditions and rollback procedures
    • Notify stakeholders and customer support
  2. Execution (Game Day)
    • Run experiments one at a time
    • Monitor dashboards in real-time
    • Document observations and surprises
    • Stop immediately if abort conditions are met
  3. Review (next day)
    • Document findings and action items
    • Prioritize remediation work
    • Update runbooks based on observations
    • Schedule follow-up experiments

Progressive Chaos Maturity

LevelActivities
Level 1: BasicManual pod kills, instance termination in staging
Level 2: AutomatedScheduled experiments in staging with FIS/Litmus
Level 3: ProductionControlled production experiments with abort conditions
Level 4: ContinuousChaos experiments integrated into CI/CD pipelines

Eazy SaaS Tip: We start every client at Level 1 — simple pod kills and instance terminations in staging. Most teams discover at least 3 critical resilience gaps in their first Game Day: missing health checks, incorrect timeout settings, or inadequate replica counts. Fixing these before they cause production outages is invaluable.