Skip to content

Is the Backup Strategy Well-Defined and Comprehensive?

Type: DeepDive
Category: Availability
Audience: SREs, data engineers, system owners responsible for recovery readiness


🔍 What This Perspective Covers

Having a backup is not the same as having a reliable, restorable backup.

This perspective checks whether the backup plan is clearly defined, covering all critical system states—not just the database.


⚠️ Failure Patterns

  • Only DB is backed up → app state, blobs, configs lost
  • Snapshots are scheduled but never tested
  • Encrypted backups without stored keys
  • Retention policy too short for real incident investigations
  • Backups stored in same region as production

✅ What Should Be Backed Up?

  • Application DB: RDS, CloudSQL, etc.
  • Blob storage: S3, GCS, user uploads
  • Message queues and in-flight events
  • Secrets and configuration state
  • Infrastructure-as-code (Terraform, Helm, etc.)
  • Audit logs and billing data

✅ Smarter Backup Design

  • List backup scope by component: what, when, how, where
  • Set retention periods by legal, business, and recovery needs
  • Store backups cross-region and secure (access-controlled, encrypted)
  • Define access recovery procedure: who can restore what
  • Monitor backup job success and validate integrity regularly

🧠 Principle

Backups are promises.
A broken promise at recovery time is a system failure.


❓ FAQ

  • Q: Isn’t snapshotting enough?
    A: Snapshots are one layer. You need multiple layers, verified.

  • Q: How do we know if our backup is working?
    A: Restore it. Periodically. Under real failure conditions.