Your database is the heart of your application. When it slows down, everything slows down. When it fails, everything fails. Yet most small teams don't monitor their databases until something breaks.
Here's what you actually need to track — and what thresholds to set — to catch database issues before they become outages.
Why Database Monitoring Matters
Database problems have a nasty habit of starting small and escalating quickly:
- Slow queries start affecting page load times
- Connection leaks gradually exhaust your pool
- Storage creep eventually hits limits
- Replication lag serves stale data to users
Essential Database Metrics to Monitor
You don't need to track everything. Focus on these core metrics:
1. Connection Metrics
| Metric | What It Means | Alert Threshold |
|---|---|---|
| Active connections | Current open connections | >80% of max_connections |
| Idle connections | Connections not in use | >50% of pool (possible leak) |
| Connection wait time | Time waiting for available connection | >100ms |
| Connection errors | Failed connection attempts | Any sustained errors |
2. Query Performance
| Metric | What It Means | Alert Threshold |
|---|---|---|
| Query latency (p99) | 99th percentile query time | >1 second |
| Slow query count | Queries exceeding threshold | Sustained increase |
| Query throughput | Queries per second | Unusual drop or spike |
| Query errors | Failed queries | Any sustained errors |
3. Storage Metrics
| Metric | What It Means | Alert Threshold |
|---|---|---|
| Disk usage % | Storage consumed | >80% (critical: >90%) |
| Table size growth | Largest tables over time | Unusual growth rate |
| Index bloat | Unused space in indexes | >30% bloat |
| Table bloat | Dead tuples / unused space | >20% dead tuples |
4. Replication & Availability
| Metric | What It Means | Alert Threshold |
|---|---|---|
| Replication lag | Seconds behind primary | >5 seconds |
| Replication status | Is replica connected? | Disconnected |
| Primary/replica roles | Unexpected role changes | Any change |
5. Resource Utilization
| Metric | What It Means | Alert Threshold |
|---|---|---|
| CPU usage | Database process CPU | Sustained >80% |
| Memory usage | Buffer pool / cache hit rate | <95% cache hit rate |
| Disk I/O | Read/write latency | >20ms latency |
Setting Up Database Monitoring
Option 1: Built-in Database Tools
Most databases have built-in monitoring capabilities:
-- PostgreSQL: Check active connections
SELECT count(*) FROM pg_stat_activity
WHERE state = 'active';
-- PostgreSQL: Find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC LIMIT 10;
-- MySQL: Check process list
SHOW PROCESSLIST;
-- MySQL: InnoDB status
SHOW ENGINE INNODB STATUS;
Run these periodically and log the results for trend analysis.
Option 2: Database-Specific Exporters
For comprehensive monitoring, use exporters that expose database metrics:
- PostgreSQL: postgres_exporter (Prometheus)
- MySQL: mysqld_exporter (Prometheus)
- Redis: redis_exporter
- MongoDB: mongodb_exporter
Option 3: APM/Database Monitoring Services
Managed services provide database monitoring out of the box:
- Managed database providers (RDS, Cloud SQL, etc.) include monitoring
- APM tools (Datadog, New Relic) have database integrations
- Specialized tools (VividCortex, SolarWinds Database Performance Analyzer)
Common Database Monitoring Mistakes
Mistake 1: Only Monitoring Overall Health
"Database is up" isn't enough. A database can be "up" but struggling with connections, slow queries, or disk space. Monitor specific metrics, not just availability.
Mistake 2: Alerting Too Aggressively
A single slow query doesn't need a 2 AM wake-up call. Alert on sustained issues, not momentary spikes. Use "for" durations in your alerts (e.g., "connection pool >80% for 5 minutes").
Mistake 3: Not Tracking Trends
Yesterday's "normal" might be tomorrow's crisis. Track metrics over time to spot gradual changes (table growth, query slowdown, connection creep).
Mistake 4: Ignoring Connection Pooling
Connection pool exhaustion is one of the most common database issues. Monitor your pool (PgBouncer, ProxySQL, etc.) alongside the database itself.
Mistake 5: Not Correlating with App Metrics
Database issues often show up first in application metrics (response time, error rate). Correlate database metrics with app performance for faster debugging.
Database Monitoring Checklist
Immediate (Set Up Today)
- ☐ Alert when disk usage >80%
- ☐ Alert when connection count >80% of max
- ☐ Enable slow query logging
- ☐ Track query latency (p99)
Short-Term (This Week)
- ☐ Set up replication lag monitoring
- ☐ Track cache hit rate
- ☐ Monitor connection pool utilization
- ☐ Create dashboard for key metrics
Ongoing
- ☐ Review slow query logs weekly
- ☐ Track table/index growth trends
- ☐ Monitor for connection leaks
- ☐ Periodically review alert thresholds
Alert Fatigue Prevention
Database monitoring can generate a lot of noise. Here's how to keep it manageable:
1. Use Smart Thresholds
Don't alert on absolute values. Alert on sustained issues:
# Bad: Alert immediately
connection_pool > 80%
# Good: Alert if sustained
connection_pool > 80% for 5 minutes
2. Aggregate Related Alerts
If connection count is high, slow queries will likely follow. Group related alerts to reduce noise.
3. Use Severity Levels
- Critical: Database down, replication broken, disk >95%
- Warning: Approaching limits, performance degradation
- Info: Notable events (maintenance, backup completion)
4. Deduplicate Alerts
If the same alert fires every minute for an hour, you don't need 60 notifications. Deduplicate and send status updates instead.
Monitor Your Database Endpoints with OpsPulse
Track database health alongside your application uptime. Get alerted on connection issues, slow queries, and storage limits before they become outages.
Start Free Monitoring →Summary
Effective database monitoring for small teams focuses on:
- Connections — Track pool utilization and wait times
- Query performance — Monitor latency and slow queries
- Storage — Alert before you run out of space
- Replication — Catch lag before users see stale data
- Resources — CPU, memory, and I/O matter
Start with the basics, add sophistication as you grow, and always optimize for signal over noise.