Is Downtime Minimized Where Unavoidable?¶
Type: DeepDive
Category: Release
Audience: SREs, backend engineers, infra architects managing service reliability
đ What This Perspective Covers¶
Some releases will require downtime.
But not all downtime is equal.
This perspective examines how to minimize service disruption, both in duration and impact.
Example Scenarios
- DB migrations with exclusive locks
- Monolithic systems with no hot-reload mechanism
- Large-scale batch updates that cannot stream
- Deployments needing cross-node coordination resets
â ď¸ Failure Patterns¶
- All users experience full downtime for minutes or hours
- No estimation or communication of window timing
- Scheduled at peak traffic times
- Restoration needs manual intervention or coordination
â Smarter Downtime Planning¶
- Can the system be partially up (read-only mode, admin-only)?
- Split change into multiple smaller steps with partial releases
- Schedule based on traffic analytics, not guesswork
- Provide countdowns or banners in UI to prepare users
- Have automated recovery and alerting tied to restart conditions
đ§ Principle¶
Downtime is sometimes inevitable.
But user surprise and prolonged recovery are not.
â FAQ¶
-
Q: Can we guarantee zero downtime?
A: Not always. But minimizing blast radius is always possible. -
Q: Is partial service worse than full downtime?
A: Dependsâtransparent partial availability is often better than full lockout.