Service Level Objectives (SLOs) and Service Level Indicators (SLIs) sound like enterprise concepts. But they're actually just structured ways to answer a simple question: "How reliable is my service, and is that good enough?"
Here's how small teams can use SLOs without over-engineering.
The Basics: SLIs, SLOs, and SLAs
SLI (Service Level Indicator)
What you measure. A metric that indicates how well your service is performing.
- Availability (percentage of successful requests)
- Latency (percentage of requests under 200ms)
- Throughput (requests per second)
SLO (Service Level Objective)
Your target. The goal you set for your SLI.
- "99.9% of requests succeed" (availability SLO)
- "95% of requests complete in under 200ms" (latency SLO)
SLA (Service Level Agreement)
Your promise. What you contractually commit to, usually with consequences.
- "If we drop below 99.5% availability, customers get 10% credit"
Why Bother with SLOs?
Without SLOs, reliability is subjective. "The site feels slow" or "we had some downtime" are vague.
With SLOs, you can answer:
- Are we reliable enough? — Compare actual vs target
- Should we focus on features or reliability? — Check your error budget
- Is this incident a big deal? — How much SLO did it burn?
- Are we getting better over time? — Track SLO trends
Choosing What to Measure (SLIs)
Start with One or Two SLIs
For most services, start with:
| SLI | Definition | Why It Matters |
|---|---|---|
| Availability | Successful requests / Total requests | Is the service working? |
| Latency | % requests under threshold (e.g., 200ms) | Is it fast enough? |
Define What Counts
- Successful request: HTTP 2xx or 3xx response
- Failed request: HTTP 5xx (server errors)
- Don't count: HTTP 4xx (client errors like 404)
Setting Realistic SLOs
Start with What You Have
Don't pick 99.99% because it sounds good. Look at your actual performance:
- Measure your current reliability for 2-4 weeks
- Set your initial SLO slightly below current performance
- Gradually tighten as you improve
SLO Benchmarks
| SLO | Downtime/Year | Appropriate For |
|---|---|---|
| 99% | 3.65 days | Internal tools, non-critical services |
| 99.5% | 1.83 days | Standard business applications |
| 99.9% | 8.77 hours | Customer-facing services |
| 99.95% | 4.38 hours | Important services |
| 99.99% | 52.6 minutes | Critical infrastructure |
Error Budgets: Making SLOs Useful
An error budget is how much "unreliability" you can afford while still meeting your SLO.
Example: 99.9% Availability SLO
Time window: 30 days
Total minutes: 43,200
Allowed downtime: 43,200 * 0.1% = 43.2 minutes
Current downtime this month: 20 minutes
Remaining error budget: 23.2 minutes
Using Error Budgets
- Budget healthy: Focus on features, take risks
- Budget low: Focus on reliability, slow down releases
- Budget exhausted: Freeze features, fix reliability issues
Implementing SLOs (Practical Steps)
Step 1: Choose Your SLI
Start with availability: percentage of successful requests.
Step 2: Set Your SLO
Based on current performance, set a target. Example: 99.5% availability over 30 days.
Step 3: Measure It
# Calculate availability from metrics
successful_requests = requests_total - requests_5xx
availability = successful_requests / requests_total
# Or from logs
grep "HTTP/1.1" access.log | \
awk '{print $9}' | \
sort | uniq -c | \
awk '{if($2~/^[23]/) good+=$1; total+=$1} END {print good/total}'
Step 4: Track Over Time
Display SLO performance on a dashboard. Show:
- Current SLO percentage (rolling 30 days)
- Remaining error budget (in minutes)
- Incidents that burned budget
Step 5: Alert on Budget Burn
Don't just alert when SLO is missed. Alert when budget is burning too fast:
- Alert if error rate > 2x normal for 10 minutes
- Alert if error budget will exhaust in 3 days at current rate
Common SLO Mistakes
Mistake 1: Too Many SLOs
Problem: Tracking 10 different SLOs. None are meaningful.
Fix: Start with 1-2 SLOs. Add more only when you have a specific need.
Mistake 2: Unrealistic Targets
Problem: Setting 99.99% when you're at 99%.
Fix: Set achievable targets. Tighten gradually.
Mistake 3: Counting Everything
Problem: Including health checks, monitoring probes, and bot traffic in SLO calculation.
Fix: Count real user traffic only. Filter out synthetic requests.
Mistake 4: No Action When Budget Burns
Problem: Tracking SLOs but not changing behavior when budget is low.
Fix: Define what happens at different budget levels. Actually do it.
SLO Checklist for Small Teams
Getting Started
- ☐ Choose 1-2 SLIs (availability, latency)
- ☐ Define how you count (what's a successful request?)
- ☐ Measure current performance for 2-4 weeks
- ☐ Set initial SLO based on reality
- ☐ Calculate error budget
Tracking
- ☐ Dashboard showing SLO performance
- ☐ Error budget remaining
- ☐ Alert when burning budget too fast
Action
- ☐ Define what to do when budget is low
- ☐ Review SLOs quarterly
- ☐ Adjust targets as you improve
Measure Your SLOs with External Monitoring
OpsPulse provides external uptime monitoring to track availability SLOs. Know your actual uptime from the user's perspective.
Start Free Monitoring →Summary
Setting SLOs for small teams:
- Start simple: 1-2 SLIs (availability, latency)
- Set realistic targets: Based on current performance
- Use error budgets: Guide feature vs reliability tradeoffs
- Alert on budget burn: Not just SLO misses
- Take action: SLOs are useless if you don't change behavior
The goal isn't perfect reliability. The goal is intentional reliability — knowing how reliable you are and deciding if that's good enough.