Do You Have a Fallback Plan for Asynchronous Failures?¶
Type: DeepDive
Category: Data
Audience: Engineers designing event pipelines, retries, and error handling
đ What This Is Actually About¶
Async failures arenât rare.
Theyâre just delayed.
So the question is:
- What happens when retries fail?
- What if downstream data is already inconsistent?
- How do you stop cascading retries from compounding failure?
â ď¸ Failure Patterns¶
- Retry loops that cause double-inserts
- No dead letter queueâjust silent drops
- Inconsistent intermediate states during retries
- Failures that reprocess already-corrected data
â Good Fallback Strategies¶
- Dead-letter queues with alerting and visibility
- State versioning or timestamps to detect reprocessing conflicts
- Explicit deduplication checks on mutation events
- Manual override or quarantine paths for human repair
đ§ Principle¶
Your system should fail visibly, not repeatedly.
And when it failsâ
it should do so in a way that helps you recover meaningfully.
â FAQ¶
-
Q: Canât we just retry until it works?
A: What if it never does? -
Q: Should fallback always mean human involvement?
A: Noâbut someone should know it happened.