AWS Disaster Recovery Planning

February 13, 2026 | AWS DR Reliability

Four DR patterns with cost comparison.

AWS Disaster Recovery Planning

Disaster recovery is not optional — it's a question of when, not if, you'll need it. AWS provides four DR patterns with different cost-recovery time trade-offs. Choosing the right pattern depends on your Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

Key Definitions

  • RTO (Recovery Time Objective) — Maximum acceptable time from disaster to full recovery
  • RPO (Recovery Point Objective) — Maximum acceptable data loss measured in time

Four DR Patterns

PatternRTORPOCostComplexity
Backup & RestoreHoursHours$ (lowest)Low
Pilot LightMinutes-HoursMinutes$$Medium
Warm StandbyMinutesSeconds$$$Medium-High
Multi-Site Active-ActiveNear-zeroNear-zero$$$$ (highest)High

Pattern 1: Backup & Restore

Back up data to S3, restore infrastructure from IaC when needed:

# Automated backups
- RDS: Automated snapshots, cross-region copy
- EBS: AWS Backup with cross-region vault
- S3: Cross-region replication (CRR)
- Application: AMI snapshots stored in DR region

# Recovery process:
1. Deploy infrastructure with Terraform/CloudFormation
2. Restore database from latest snapshot
3. Update DNS to point to DR region
# Total time: 4-8 hours

Cost: S3 storage + snapshot storage only. No running infrastructure in DR region.

Pattern 2: Pilot Light

Keep core infrastructure running (database replication) but scale up everything else during recovery:

# Always running in DR region:
- RDS Read Replica (cross-region, async replication)
- Core networking (VPC, subnets, security groups)

# Scaled to zero in DR region:
- Application servers (ASG min=0)
- Load balancers (can be pre-created)

# Recovery process:
1. Promote RDS read replica to primary
2. Scale up ASG to desired capacity
3. Update DNS to DR region ALB
# Total time: 15-30 minutes

Pattern 3: Warm Standby

Run a scaled-down but fully functional copy in the DR region:

# DR region (running at reduced capacity):
- Aurora Global Database (sync replication, <1s RPO)
- Application servers (ASG min=1, smaller instances)
- ALB + target groups (fully configured)
- Route 53 health checks (ready for failover)

# Recovery process:
1. Aurora Global failover (automated, ~1 minute)
2. Scale up ASG to production capacity
3. Route 53 automatic DNS failover
# Total time: 2-5 minutes

Pattern 4: Multi-Site Active-Active

Both regions handle production traffic simultaneously:

# Both regions at full capacity:
- Aurora Global Database with write forwarding
- Full application tier in both regions
- Route 53 latency-based routing

# Failover:
- Route 53 automatically routes all traffic to healthy region
- No manual intervention required
# Total time: ~30 seconds (DNS TTL)

DR Testing

A DR plan that hasn't been tested doesn't work. Schedule quarterly DR drills:

  1. Tabletop exercise — Walk through the recovery plan verbally
  2. Component test — Test individual recovery procedures (database restore, AMI launch)
  3. Full simulation — Execute complete failover to DR region
  4. Chaos engineering — Inject failures in production and validate automatic recovery

Cost Optimization

  • Use S3 Intelligent-Tiering for backup storage
  • Right-size DR instances — Use smaller instances in warm standby
  • Automate scale-up — Lambda functions to launch resources on failover
  • Use Savings Plans — Commit to base capacity that covers both regions

Eazy SaaS Tip: Most SMBs achieve an excellent balance with the Pilot Light pattern — $50-100/month for DR capability with 15-30 minute recovery time. We set up automated failover with Route 53 health checks and test quarterly. For critical applications, we upgrade to Warm Standby with Aurora Global Database.