Monitoring advice usually assumes enterprise resources: dedicated ops teams, complex toolchains, and 24/7 coverage. But most applications are built and maintained by small teams with limited time and budget.
Here's a practical approach to monitoring that works when you don't have a dedicated SRE team.
Core Principle: Start Simple
Start with:
- Is it up? — External uptime monitoring
- Is it broken? — Error tracking
- Is it slow? — Basic latency metrics
Add complexity only when you have a specific need, not because a blog post said you should.
The Monitoring Hierarchy
Level 1: External Uptime (Essential)
Check if your service is reachable from the internet.
- What: HTTP requests to key endpoints
- Frequency: Every 1-5 minutes
- Alert: When endpoint is unreachable for 2-3 checks
- Tool: OpsPulse, UptimeRobot, Pingdom
Level 2: Health Checks (Essential)
Internal check that your application is healthy.
- What: /health endpoint checking critical dependencies
- Frequency: Every 10-30 seconds
- Consumer: Load balancer, orchestrator
- Tool: Built into your application
Level 3: Error Tracking (Essential)
Know when your application throws errors.
- What: Exception tracking with context
- Alert: New errors, spike in error rate
- Tool: Sentry, Bugsnag, Honeybadger
Level 4: Metrics (Important)
Track request counts, latency, and business metrics.
- What: Request rate, error rate, latency percentiles
- Alert: Anomalies, trends, thresholds
- Tool: Statsd + hosted metrics, or Prometheus
Level 5: Logs (Important)
Centralize logs for debugging.
- What: Application logs, access logs
- Alert: Pattern matching (errors, security events)
- Tool: Cloud logging, Loki, ELK (if you have resources)
Level 6: Distributed Tracing (Advanced)
Trace requests across services.
- When: 10+ services with complex interactions
- Tool: Jaeger, Zipkin, Honeycomb
What to Monitor
The Golden Signals (Google SRE)
| Signal | What It Tells You |
|---|---|
| Latency | How long requests take |
| Traffic | How many requests you're handling |
| Errors | Rate of failed requests |
| Saturation | How full your resources are (CPU, memory, disk) |
For Small Teams, Simplify to:
- Can users reach my service? (External uptime)
- Are requests failing? (Error rate)
- Is it fast enough? (p95/p99 latency)
Alerting Best Practices
Alert on Symptoms, Not Causes
- Bad: Alert on CPU > 80%
- Good: Alert on latency > 2s (which might be caused by CPU)
Why? High CPU might be fine during peak hours. Slow responses are never fine.
Require Consecutive Failures
Don't alert on a single failed check. Require 2-3 consecutive failures to avoid noise from transient issues.
Include Context in Alerts
Bad: "High error rate"
Good: "API error rate 5.2% (normal <1%) for user-service
Started: 14:32 UTC
Affected endpoints: /login, /register
Dashboard: https://grafana.example.com/d/user-service"
Route Alerts Appropriately
- P1 (wake up): Complete outage, data loss risk
- P2 (during work hours): Degraded performance, partial issues
- P3 (ticket): Minor issues, trends worth watching
Common Monitoring Mistakes
Mistake 1: Alert Fatigue
Problem: So many alerts that you ignore them all.
Fix: Audit alerts monthly. Remove or silence alerts that don't require action.
Mistake 2: Monitoring from Inside Only
Problem: All checks run from your infrastructure. DNS issues, network problems, and regional outages go undetected.
Fix: Add external monitoring from multiple regions.
Mistake 3: No Runbooks
Problem: Alert fires, but nobody knows what to do.
Fix: Every alert should link to a runbook with remediation steps.
Mistake 4: Dashboard Overload
Problem: 50 dashboards, none useful during incidents.
Fix: One "system overview" dashboard. Link to detailed views from there.
Mistake 5: Monitoring Everything
Problem: Tracking metrics nobody looks at.
Fix: Monitor what matters. If you wouldn't change behavior based on a metric, don't track it.
The Small Team Monitoring Stack
Minimum Viable Stack
- External uptime: OpsPulse (free for 3 monitors)
- Error tracking: Sentry (free tier available)
- Health checks: Built into your app
- Logs: Whatever your hosting provider includes
When You Grow
- Add metrics: Prometheus + Grafana or hosted solution
- Add centralized logs: Loki, papertrail, or cloud logging
- Add on-call: PagerDuty or Opsgenie (when you have a team)
Monitoring Checklist
Essential (Do First)
- ☐ External uptime monitoring for key endpoints
- ☐ Health check endpoint for load balancer
- ☐ Error tracking with context
- ☐ Basic alerting (email or chat)
Important (Do Soon)
- ☐ Request metrics (count, latency, errors)
- ☐ Centralized logs
- ☐ Runbooks for common alerts
- ☐ Status page for users
Advanced (Do When Needed)
- ☐ Distributed tracing
- ☐ Business metrics tracking
- ☐ SLO/SLI framework
- ☐ On-call rotation and escalation
Start with External Uptime Monitoring
OpsPulse provides simple, no-noise uptime monitoring. Know when your service is down before users tell you.
Start Free Monitoring →Summary
Effective monitoring for small teams:
- Start simple: Uptime, errors, latency
- Add complexity when needed: Not before
- Alert on symptoms: User-facing issues, not internal metrics
- Reduce noise: Better to miss an alert than ignore all alerts
- Include context: Alerts should be actionable
- Monitor externally: Know if users can reach you
The goal isn't perfect monitoring. The goal is enough monitoring that you can sleep at night and respond quickly when things break.