Every application has errors. The question isn't whether errors happen — it's whether the errors you're seeing are normal or a sign of something seriously wrong.
Here's how to think about error rates, what thresholds actually make sense, and when you should actually panic.
The Problem with Error Rate Monitoring
Most teams approach error rate monitoring backwards:
- They set an arbitrary threshold (like "alert if >1% errors")
- They get alerted constantly for noise
- They either ignore alerts or turn them off entirely
- They miss the real issues when they happen
Understanding Error Types
HTTP Status Code Errors
| Range | Type | Typical Cause | Severity |
|---|---|---|---|
| 4xx | Client errors | Bad requests, auth failures, not found | Usually low (client problem) |
| 5xx | Server errors | App crashes, database failures, timeouts | High (your problem) |
Application-Level Errors
- Exceptions — Unhandled errors in your code
- Failed operations — Database queries, external API calls
- Validation failures — Business logic rejections
- Timeouts — Operations that took too long
What's a "Normal" Error Rate?
There's no universal answer, but here are some benchmarks:
| Error Rate | Assessment | Action |
|---|---|---|
| <0.01% | Excellent | Monitor for changes |
| 0.01% - 0.1% | Good | Normal background noise |
| 0.1% - 1% | Acceptable | Investigate if sustained |
| 1% - 5% | Concerning | Requires attention |
| >5% | Critical | Immediate investigation |
Setting Meaningful Error Rate Alerts
1. Use Relative Change, Not Absolute Thresholds
Instead of "alert if >1% errors", use "alert if error rate increases by 50% from baseline":
# Bad: Static threshold
if error_rate > 0.01: alert()
# Good: Relative to baseline
baseline = get_baseline_error_rate() # e.g., last 7 days average
if error_rate > baseline * 1.5: alert()
2. Require Sustained Duration
A single minute of elevated errors doesn't need a wake-up call. Require sustained elevation:
# Alert only if elevated for 3+ consecutive minutes
if error_rate > threshold for 3 minutes: alert()
3. Separate Signal from Noise
Not all errors are equal. Filter out known noise:
- Health checks — Often return errors during startup
- Bot traffic — Scanners hitting non-existent endpoints
- Legacy clients — Old app versions with known issues
- Expected failures — Rate limits, authentication failures
4. Alert by Error Category
| Error Category | Alert Threshold | Response Time |
|---|---|---|
| 500 (Internal Server Error) | Any sustained increase | Immediate |
| 502/503 (Gateway/Service Unavailable) | >0.1% sustained | Immediate |
| 504 (Timeout) | >1% sustained | Within 15 minutes |
| 429 (Rate Limited) | Sustained high volume | Within hours |
| 401/403 (Auth failures) | Spike detection | Investigate pattern |
| 404 (Not Found) | Usually don't alert | Review logs periodically |
When to Actually Panic
- Error rate suddenly jumps >10x baseline
- 5xx errors affecting >5% of requests
- Errors spreading across multiple endpoints
- Database connection errors or timeouts
- Errors after a deployment (rollback candidate)
Signs It's Probably Not an Emergency
- Errors are isolated to one endpoint
- Error rate is elevated but stable (not increasing)
- Errors correlate with traffic spike (capacity issue, not bug)
- Only 4xx errors (client-side issues)
- Errors are from known bad actors (bots, scanners)
Error Rate Monitoring Best Practices
1. Track Error Rate Over Time
Store error rates with enough granularity to spot trends. Hourly or 5-minute buckets work well for most applications.
2. Correlate with Deployments
Tag your metrics with deployment versions. When errors spike, you'll immediately know if it's related to a recent change.
3. Include Context in Alerts
Don't just send "error rate elevated". Include:
- Current rate vs baseline
- Affected endpoints
- Error types (status codes, exception types)
- Recent deployments or changes
- Sample error messages
4. Have Error Budgets
If you have an SLA, track error budget consumption:
Error Budget = SLA Target - Actual Uptime
Example: 99.9% SLA
- Monthly budget: 43.8 minutes of allowed errors
- If you've used 30 minutes this month, you have 13.8 minutes left
- Alert when budget drops below 30% remaining
5. Reduce, Don't Just Monitor
Error monitoring is useless if you don't act on it:
- Fix the top 3 error sources each week
- Address intermittent errors before they become outages
- Use errors to identify technical debt
Common Error Rate Monitoring Mistakes
Mistake 1: Alerting on All Errors
You'll drown in noise. Filter and categorize before alerting.
Mistake 2: Using Static Thresholds Only
A 0.5% error rate might be normal for your app but catastrophic for another. Use relative thresholds based on your baseline.
Mistake 3: Ignoring 4xx Errors
While less urgent than 5xx, sustained 4xx errors can indicate API changes, broken clients, or security issues.
Mistake 4: Not Tracking Error Trends
A slowly increasing error rate over weeks is often more dangerous than a sudden spike — it indicates degrading system health.
Mistake 5: No Baseline
You can't know if errors are elevated if you don't know what's normal. Establish baselines during stable periods.
Error Rate Monitoring Checklist
- ☐ Track error rate by status code (4xx vs 5xx)
- ☐ Establish baseline error rates during stable periods
- ☐ Set alerts for relative changes, not just absolute thresholds
- ☐ Require sustained duration before alerting
- ☐ Filter known noise sources
- ☐ Correlate errors with deployments
- ☐ Include context in alert messages
- ☐ Track error budget consumption
- ☐ Review top error sources weekly
- ☐ Have a runbook for error rate incidents
Monitor Your Error Rates with OpsPulse
Track uptime and response codes alongside your application metrics. Get alerted when error patterns change, not on every individual error.
Start Free Monitoring →Summary
Effective error rate monitoring comes down to:
- Know your baseline — What's normal for your application?
- Use relative thresholds — Alert on changes, not arbitrary numbers
- Require sustained duration — Don't alert on momentary spikes
- Categorize errors — 5xx needs faster response than 4xx
- Include context — Make alerts actionable
The goal isn't zero errors — it's catching the errors that actually matter.