Is the Backup Strategy Well-Defined and Comprehensive?¶
Type: DeepDive
Category: Availability
Audience: SREs, data engineers, system owners responsible for recovery readiness
π What This Perspective Covers¶
Having a backup is not the same as having a reliable, restorable backup.
This perspective checks whether the backup plan is clearly defined, covering all critical system statesβnot just the database.
β οΈ Failure Patterns¶
- Only DB is backed up β app state, blobs, configs lost
- Snapshots are scheduled but never tested
- Encrypted backups without stored keys
- Retention policy too short for real incident investigations
- Backups stored in same region as production
β What Should Be Backed Up?¶
- Application DB: RDS, CloudSQL, etc.
- Blob storage: S3, GCS, user uploads
- Message queues and in-flight events
- Secrets and configuration state
- Infrastructure-as-code (Terraform, Helm, etc.)
- Audit logs and billing data
β Smarter Backup Design¶
- List backup scope by component: what, when, how, where
- Set retention periods by legal, business, and recovery needs
- Store backups cross-region and secure (access-controlled, encrypted)
- Define access recovery procedure: who can restore what
- Monitor backup job success and validate integrity regularly
π§ Principle¶
Backups are promises.
A broken promise at recovery time is a system failure.
β FAQ¶
-
Q: Isnβt snapshotting enough?
A: Snapshots are one layer. You need multiple layers, verified. -
Q: How do we know if our backup is working?
A: Restore it. Periodically. Under real failure conditions.