AWS Disaster Recovery Planning
Four DR patterns with cost comparison.
AWS Disaster Recovery Planning
Disaster recovery is not optional — it's a question of when, not if, you'll need it. AWS provides four DR patterns with different cost-recovery time trade-offs. Choosing the right pattern depends on your Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Key Definitions
- RTO (Recovery Time Objective) — Maximum acceptable time from disaster to full recovery
- RPO (Recovery Point Objective) — Maximum acceptable data loss measured in time
Four DR Patterns
| Pattern | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ (lowest) | Low |
| Pilot Light | Minutes-Hours | Minutes | $$ | Medium |
| Warm Standby | Minutes | Seconds | $$$ | Medium-High |
| Multi-Site Active-Active | Near-zero | Near-zero | $$$$ (highest) | High |
Pattern 1: Backup & Restore
Back up data to S3, restore infrastructure from IaC when needed:
# Automated backups
- RDS: Automated snapshots, cross-region copy
- EBS: AWS Backup with cross-region vault
- S3: Cross-region replication (CRR)
- Application: AMI snapshots stored in DR region
# Recovery process:
1. Deploy infrastructure with Terraform/CloudFormation
2. Restore database from latest snapshot
3. Update DNS to point to DR region
# Total time: 4-8 hoursCost: S3 storage + snapshot storage only. No running infrastructure in DR region.
Pattern 2: Pilot Light
Keep core infrastructure running (database replication) but scale up everything else during recovery:
# Always running in DR region:
- RDS Read Replica (cross-region, async replication)
- Core networking (VPC, subnets, security groups)
# Scaled to zero in DR region:
- Application servers (ASG min=0)
- Load balancers (can be pre-created)
# Recovery process:
1. Promote RDS read replica to primary
2. Scale up ASG to desired capacity
3. Update DNS to DR region ALB
# Total time: 15-30 minutesPattern 3: Warm Standby
Run a scaled-down but fully functional copy in the DR region:
# DR region (running at reduced capacity):
- Aurora Global Database (sync replication, <1s RPO)
- Application servers (ASG min=1, smaller instances)
- ALB + target groups (fully configured)
- Route 53 health checks (ready for failover)
# Recovery process:
1. Aurora Global failover (automated, ~1 minute)
2. Scale up ASG to production capacity
3. Route 53 automatic DNS failover
# Total time: 2-5 minutesPattern 4: Multi-Site Active-Active
Both regions handle production traffic simultaneously:
# Both regions at full capacity:
- Aurora Global Database with write forwarding
- Full application tier in both regions
- Route 53 latency-based routing
# Failover:
- Route 53 automatically routes all traffic to healthy region
- No manual intervention required
# Total time: ~30 seconds (DNS TTL)DR Testing
A DR plan that hasn't been tested doesn't work. Schedule quarterly DR drills:
- Tabletop exercise — Walk through the recovery plan verbally
- Component test — Test individual recovery procedures (database restore, AMI launch)
- Full simulation — Execute complete failover to DR region
- Chaos engineering — Inject failures in production and validate automatic recovery
Cost Optimization
- Use S3 Intelligent-Tiering for backup storage
- Right-size DR instances — Use smaller instances in warm standby
- Automate scale-up — Lambda functions to launch resources on failover
- Use Savings Plans — Commit to base capacity that covers both regions
Eazy SaaS Tip: Most SMBs achieve an excellent balance with the Pilot Light pattern — $50-100/month for DR capability with 15-30 minute recovery time. We set up automated failover with Route 53 health checks and test quarterly. For critical applications, we upgrade to Warm Standby with Aurora Global Database.