Are Monitoring Targets Well-Defined and Alerts Properly Configured?¶
Type: Structure, DeepDive
Category: Non-functional
Audience: SREs, platform engineers, backend leads, reliability owners
🔍 What This Perspective Covers¶
If a system fails silently, it fails completely.
This perspective checks whether your monitoring targets are explicitly defined, aligned with business risk, and wired to clear, actionable alerting.
Monitoring Must-Haves
- Request rate, error rate, latency (RED metrics)
- Resource saturation: CPU, memory, DB pool, disk
- External API health and SLA tracking
- Queue depth and job retries
- User-visible behavior: blank screens, login failures, broken workflows
⚠️ Failure Patterns¶
- Infra is monitored, but application issues go undetected
- No distinction between “warning” and “urgent” alerts
- Notification fatigue: too many noisy or flapping alerts
- Alerts lack context: responder unsure what triggered it
- Monitored metrics not tied to service-level indicators (SLIs)
âś… Smarter Observability Strategy¶
- Define what “bad looks like” in metrics and logs
- Align alerts to user pain, not just system status
- Include actionable info: what’s broken, who’s impacted, where to look
- Review alert history to eliminate dead weight
- Correlate alerts with incident causes and recovery timelines
đź§ Principle¶
If it’s not being watched,
it’s already broken—you just don’t know it yet.
âť“ FAQ¶
-
Q: How many alerts is too many?
A: If humans stop reading them, you’ve gone too far. Less is more—if better targeted. -
Q: Who defines what to monitor?
A: The team that owns the feature. Monitoring is a product concern.