Chaos Engineering: Testing Resilience
February 13, 2026
|
Reliability
Chaos Engineering
SRE
FIS, Litmus, and Game Days.
Chaos Engineering: Testing Resilience in Production
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions. Instead of waiting for failures to happen, you proactively inject failures in a controlled manner to discover weaknesses before they cause outages.
Chaos Engineering Principles
- Start with a hypothesis — "If we kill one pod, the service will continue operating with no user impact"
- Minimize blast radius — Start small, expand gradually
- Run in production — Staging environments don't have real traffic patterns
- Automate experiments — Manual chaos doesn't scale
- Stop on unexpected results — Have abort conditions defined
AWS Fault Injection Service (FIS)
AWS's native chaos engineering service:
aws fis create-experiment-template \
--description "Terminate 30% of ECS tasks" \
--targets '{
"ecsTask": {
"resourceType": "aws:ecs:task",
"resourceTags": {"Environment": "production"},
"selectionMode": "PERCENT(30)"
}
}' \
--actions '{
"stopTasks": {
"actionId": "aws:ecs:stop-task",
"parameters": {},
"targets": {"Tasks": "ecsTask"}
}
}' \
--stop-conditions '[{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:...:alarm:HighErrorRate"
}]' \
--role-arn arn:aws:iam::xxx:role/FISRoleCommon FIS Experiments
| Experiment | What It Tests | Expected Outcome |
|---|---|---|
| Terminate EC2 instance | ASG self-healing | New instance launches, traffic rebalances |
| Stop ECS tasks (30%) | Service redundancy | Remaining tasks handle load, no errors |
| CPU stress on nodes | Autoscaling and throttling | HPA scales up, latency stays acceptable |
| Network latency injection | Timeout handling | Circuit breakers trigger, graceful degradation |
| AZ failure simulation | Multi-AZ resilience | Traffic shifts to healthy AZ automatically |
Litmus Chaos for Kubernetes
# Install Litmus
helm install litmus litmuschaos/litmus \
--namespace litmus \
--create-namespace
# Pod delete experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: api-chaos
spec:
appinfo:
appns: production
applabel: app=api-service
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"
- name: CHAOS_INTERVAL
value: "10"
- name: FORCE
value: "true"Game Day Framework
Organize chaos experiments as scheduled Game Days:
- Planning (1 week before)
- Define experiments and hypotheses
- Identify observability requirements (dashboards, alerts)
- Define abort conditions and rollback procedures
- Notify stakeholders and customer support
- Execution (Game Day)
- Run experiments one at a time
- Monitor dashboards in real-time
- Document observations and surprises
- Stop immediately if abort conditions are met
- Review (next day)
- Document findings and action items
- Prioritize remediation work
- Update runbooks based on observations
- Schedule follow-up experiments
Progressive Chaos Maturity
| Level | Activities |
|---|---|
| Level 1: Basic | Manual pod kills, instance termination in staging |
| Level 2: Automated | Scheduled experiments in staging with FIS/Litmus |
| Level 3: Production | Controlled production experiments with abort conditions |
| Level 4: Continuous | Chaos experiments integrated into CI/CD pipelines |
Eazy SaaS Tip: We start every client at Level 1 — simple pod kills and instance terminations in staging. Most teams discover at least 3 critical resilience gaps in their first Game Day: missing health checks, incorrect timeout settings, or inadequate replica counts. Fixing these before they cause production outages is invaluable.