Microservices monitoring advice usually assumes enterprise scale: service meshes, distributed tracing, sophisticated observability platforms. But most teams running microservices are small teams with a handful of services.
Here's how to monitor microservices when you don't have a dedicated ops team.
The Problem with Microservices Monitoring Advice
Typical advice for microservices monitoring includes:
- Implement distributed tracing (Jaeger, Zipkin)
- Deploy a service mesh (Istio, Linkerd)
- Use centralized logging (ELK, Loki)
- Set up comprehensive metrics (Prometheus + Grafana)
- Implement correlation IDs
What You Actually Need
For small teams with a handful of services, focus on:
- Health checks — Is each service running?
- Request tracking — How many requests? How fast? Any errors?
- Error logging — When things fail, what went wrong?
- External monitoring — Is the whole system reachable?
Level 1: Basic Health Checks
Every service should have a health endpoint:
GET /health
{
"status": "healthy",
"service": "user-api",
"version": "1.2.3",
"uptime_seconds": 86400,
"dependencies": {
"database": "ok",
"cache": "ok"
}
}
What to Check
- Process is alive — Can respond to HTTP requests
- Critical dependencies — Database, cache, message queue
- Basic functionality — Not deeply broken
Keep It Fast
Health checks should return in <1 second. Don't run expensive queries or deep checks.
Level 2: Request Metrics
Track basic request metrics for each service:
| Metric | Why It Matters |
|---|---|
| Request count | Is traffic normal? |
| Error count | Is something broken? |
| Response time (p95, p99) | Is performance degrading? |
| Active requests | Is the service overloaded? |
Simple Implementation
// Middleware to track requests
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
const status = res.statusCode;
metrics.increment('requests.total', { service: 'user-api', status: Math.floor(status/100) + 'xx' });
metrics.histogram('request.duration_ms', duration, { service: 'user-api' });
});
next();
});
Level 3: Error Tracking
When errors happen, you need to know what failed and why:
What to Log
- Error message — What went wrong?
- Stack trace — Where in the code?
- Request context — What request caused it?
- Service name — Which service?
- Timestamp — When?
Tools for Small Teams
- Sentry — Error tracking with context
- Bugsnag — Error monitoring and alerting
- Honeybadger — Simple error tracking
Level 4: Request Correlation
When a request spans multiple services, you need to connect the dots:
// Generate or propagate request ID
app.use((req, res, next) => {
const requestId = req.headers['x-request-id'] || generateId();
req.requestId = requestId;
res.setHeader('x-request-id', requestId);
next();
});
// Pass to downstream services
const response = await fetch('http://orders-api/orders', {
headers: { 'x-request-id': req.requestId }
});
Now you can search logs for a specific request ID and see its journey across services.
Level 5: External Monitoring
Internal metrics tell you if your services are running. External monitoring tells you if users can reach them.
What to Monitor Externally
- API gateway / load balancer — Entry point to your services
- Key endpoints — Login, critical API paths
- Health endpoints — Each service's /health
Why External Monitoring Matters
- DNS issues (your servers are fine, but nobody can find them)
- Network problems (your region is isolated)
- Load balancer failures (services healthy but unreachable)
- SSL certificate expiration
Monitoring Architecture for Small Teams
What to Build
External Monitoring
↓
[ Load Balancer ]
/ | \
[API-1] [API-2] [API-3]
\ | /
\ | /
[ Shared Database ]
↑
[ Error Tracking ]
Minimum Viable Stack
- Health checks: Built into each service
- Metrics: Statsd → hosted service OR simple Prometheus
- Error tracking: Sentry/Bugsnag
- External monitoring: OpsPulse (uptime checks)
- Logging: Structured logs to files → cloud logging service
Common Mistakes
Mistake 1: Over-Instrumenting
Problem: Tracking every possible metric, creating noise.
Fix: Start with request count, error count, response time. Add more when you have a specific need.
Mistake 2: Ignoring External Monitoring
Problem: All monitoring is internal. You don't know when external users can't reach you.
Fix: Add external uptime checks for your public endpoints.
Mistake 3: Complex Tooling Too Early
Problem: Deploying Istio, Jaeger, and full observability stack for 3 services.
Fix: Start simple. Add complexity when you have the team and the need.
Mistake 4: No Request Correlation
Problem: Can't trace a request across services when debugging.
Fix: Add request IDs early. It's simple and pays off immediately.
Microservices Monitoring Checklist
Each Service
- ☐ Health endpoint (/health)
- ☐ Request metrics (count, latency, errors)
- ☐ Request ID propagation
- ☐ Structured logging with service name
- ☐ Error tracking integration
System-Wide
- ☐ External monitoring for public endpoints
- ☐ Centralized log aggregation
- ☐ Metrics dashboard (even if simple)
- ☐ Alert routing (email, Slack, PagerDuty)
- ☐ Runbook for common issues
Monitor Your Microservices Externally
OpsPulse provides external uptime monitoring for your API gateway and individual services. Know when users can't reach you, not just when your services are running.
Start Free Monitoring →When to Add More Complexity
Add distributed tracing when:
- You have 10+ services with complex interactions
- You frequently need to debug cross-service issues
- You have someone who can maintain the tracing infrastructure
Add a service mesh when:
- You need mutual TLS between services
- You want automatic retries and circuit breaking
- You're running Kubernetes and can manage the complexity
Add comprehensive metrics when:
- You need to optimize performance
- You're debugging capacity issues
- You have SLOs to meet
Summary
For small teams with microservices:
- Health checks first — Every service should report its status
- Basic metrics — Request count, errors, latency
- Error tracking — Know when and why things fail
- Request IDs — Connect requests across services
- External monitoring — Verify users can reach you
You can always add complexity later. Start with what gives you visibility today.