Is the Backup Strategy Well-Defined and Comprehensive?¶
Type: DeepDive
Category: Availability
Audience: SREs, data engineers, system owners responsible for recovery readiness
đ What This Perspective Covers¶
Having a backup is not the same as having a reliable, restorable backup.
This perspective checks whether the backup plan is clearly defined, covering all critical system statesânot just the database.
â ď¸ Failure Patterns¶
- Only DB is backed up â app state, blobs, configs lost
- Snapshots are scheduled but never tested
- Encrypted backups without stored keys
- Retention policy too short for real incident investigations
- Backups stored in same region as production
â What Should Be Backed Up?¶
- Application DB: RDS, CloudSQL, etc.
- Blob storage: S3, GCS, user uploads
- Message queues and in-flight events
- Secrets and configuration state
- Infrastructure-as-code (Terraform, Helm, etc.)
- Audit logs and billing data
â Smarter Backup Design¶
- List backup scope by component: what, when, how, where
- Set retention periods by legal, business, and recovery needs
- Store backups cross-region and secure (access-controlled, encrypted)
- Define access recovery procedure: who can restore what
- Monitor backup job success and validate integrity regularly
đ§ Principle¶
Backups are promises.
A broken promise at recovery time is a system failure.
â FAQ¶
-
Q: Isnât snapshotting enough?
A: Snapshots are one layer. You need multiple layers, verified. -
Q: How do we know if our backup is working?
A: Restore it. Periodically. Under real failure conditions.