Backup Verification and Recovery Testing
February 13, 2026
|
DR
Backup
Reliability
Automated verification and quarterly drills.
Backup Verification and Recovery Testing
A backup that hasn't been tested is not a backup — it's a hope. Organizations that don't regularly verify their backups discover corruption, missing data, or broken restore processes during the worst possible moment: an actual disaster. This guide covers automated verification and quarterly recovery drills.
The Backup Verification Problem
- Silent corruption — Backup files may be incomplete or corrupted without any indication
- Schema drift — Backup format may not be compatible with current software versions
- Missing data — Backup scope may not include all critical databases or file systems
- Untested procedures — The team may not know how to restore, or procedures may be outdated
Automated Backup Verification
Database Backup Verification
# Automated RDS snapshot verification (Lambda function)
import boto3
import time
def verify_rds_backup(event, context):
rds = boto3.client('rds')
# Restore latest snapshot to a temporary instance
snapshots = rds.describe_db_snapshots(
DBInstanceIdentifier='prod-db',
SnapshotType='automated'
)['DBSnapshots']
latest = sorted(snapshots, key=lambda x: x['SnapshotCreateTime'])[-1]
rds.restore_db_instance_from_db_snapshot(
DBInstanceIdentifier='backup-verify-temp',
DBSnapshotIdentifier=latest['DBSnapshotIdentifier'],
DBInstanceClass='db.t3.medium'
)
# Wait for instance to be available
waiter = rds.get_waiter('db_instance_available')
waiter.wait(DBInstanceIdentifier='backup-verify-temp')
# Run verification queries
verify_data_integrity('backup-verify-temp')
# Clean up
rds.delete_db_instance(
DBInstanceIdentifier='backup-verify-temp',
SkipFinalSnapshot=True
)S3 Backup Integrity Check
# Verify S3 backup files
def verify_s3_backups():
s3 = boto3.client('s3')
# Check that backup files exist for today
today = datetime.now().strftime('%Y/%m/%d')
objects = s3.list_objects_v2(
Bucket='backups',
Prefix=f'database/{today}'
)
if objects['KeyCount'] == 0:
alert("No backup files found for today!")
return False
# Verify file sizes (catch truncated backups)
for obj in objects['Contents']:
if obj['Size'] < 1000: # Suspiciously small
alert(f"Backup file {obj['Key']} is only {obj['Size']} bytes")
return False
# Verify checksum
for obj in objects['Contents']:
head = s3.head_object(Bucket='backups', Key=obj['Key'])
if 'x-amz-checksum-sha256' not in head:
alert(f"Missing checksum for {obj['Key']}")
return TrueQuarterly Recovery Drill
Schedule a full recovery drill every quarter:
Drill Plan Template
- Scope definition — Which systems are being tested?
- Recovery scenario — What failure are we simulating? (AZ failure, database corruption, ransomware)
- Success criteria — What must work for the drill to pass?
- Timeline — Expected recovery time for each step
- Participants — Who needs to be involved?
Drill Execution
| Step | Activity | Expected Time | Actual Time |
|---|---|---|---|
| 1 | Identify latest backup | 5 min | ___ |
| 2 | Provision recovery infrastructure | 15 min | ___ |
| 3 | Restore database from backup | 30 min | ___ |
| 4 | Deploy application to recovery infra | 15 min | ___ |
| 5 | Verify application functionality | 20 min | ___ |
| 6 | Verify data integrity | 15 min | ___ |
| 7 | DNS cutover (simulated) | 5 min | ___ |
AWS Backup for Centralized Management
aws backup create-backup-plan --backup-plan '{
"BackupPlanName": "ProductionBackup",
"Rules": [{
"RuleName": "DailyBackup",
"ScheduleExpression": "cron(0 5 ? * * *)",
"TargetBackupVaultName": "production-vault",
"Lifecycle": {
"MoveToColdStorageAfterDays": 30,
"DeleteAfterDays": 365
},
"CopyActions": [{
"DestinationBackupVaultArn": "arn:aws:backup:eu-west-1:xxx:backup-vault:dr-vault",
"Lifecycle": {
"DeleteAfterDays": 90
}
}]
}]
}'Verification Metrics Dashboard
- Backup completion rate — Target: 100%
- Last successful verification — Should be within 7 days
- Recovery Time (actual) — Track trend over drills
- Recovery Point (actual) — Verify backup freshness
- Drill results — Pass/fail history
Common Drill Findings
- Credentials expired — Backup service account passwords rotated but not updated
- Missing database — New database added to production but not to backup policy
- Slow restore — Large database restore takes 4 hours, not the expected 1 hour
- Missing runbook steps — Team discovers gaps in recovery documentation
- Version mismatch — Backup was taken on v14, but restore target is v15
Eazy SaaS Tip: We automate backup verification for every client using Lambda functions that restore and validate backups nightly. Combined with quarterly recovery drills, our clients have verified, tested backups at all times — not just backup files that they hope work.