Being on-call used to mean carrying a pager and hoping it never went off. Today it means your phone buzzes at 3 AM for "critical" alerts about things that aren't actually critical. Here's how to set up on-call rotations that actually work for small teams.
The Problem with On-Call
On-call done wrong destroys teams:
- Burnout — Waking up repeatedly destroys sleep quality
- Resentment — When one person always seems to get the hard shifts
- Alert fatigue — Too many false alarms make people ignore real ones
- Turnover — Engineers leave jobs with bad on-call cultures
Principles for Healthy On-Call
1. Fairness Over Efficiency
A rotation that's slightly suboptimal but perceived as fair is better than an "optimal" one that breeds resentment.
2. Protect Sleep
Minimize overnight pages. Every unnecessary wake-up is borrowed against future performance.
3. Clear Escalation
Everyone should know when to escalate and to whom. No one should feel alone with a crisis.
4. Recovery Time
After an on-call shift with incidents, give people time to recover. Don't schedule them for major work the next day.
Rotation Structures for Small Teams
Team of 2-3
Simple weekly rotation:
- Week 1: Person A is primary, Person B is backup
- Week 2: Person B is primary, Person C is backup
- Week 3: Person C is primary, Person A is backup
- Repeat
Team of 4-6
Weekly primary with rotating backup:
- Primary: Handles all incidents, gets woken up
- Backup: Takes over if primary doesn't respond in 15 minutes
- Everyone else: Protected sleep
Team of 7+
You can afford more sophisticated rotations:
- Follow-the-sun: Different time zones handle their daylight hours
- Specialist rotations: Backend, frontend, infrastructure on separate schedules
- Shorter shifts: 2-3 day rotations instead of weekly
What Warrants a Wake-Up
Not every alert should wake someone up. Use severity levels:
| Severity | Examples | Response |
|---|---|---|
| P1 - Critical | Service down, data loss, security breach | Immediate page, wake up if needed |
| P2 - High | Significant degradation, partial outage | Page during work hours, escalate at night |
| P3 - Medium | Single component failing, elevated errors | Slack notification, next business day |
| P4 - Low | Minor issues, warnings | Email digest, ticket creation |
Setting Up Escalation Policies
Basic Escalation (Small Teams)
Alert fires
→ Page primary
→ Wait 5 minutes
→ If no ack, page backup
→ Wait 10 minutes
→ If no ack, page everyone
Advanced Escalation (Larger Teams)
Alert fires (P1)
→ Page on-call engineer
→ Wait 5 minutes
→ If no ack, page team lead
→ Wait 10 minutes
→ If no ack, page engineering manager
→ Wait 15 minutes
→ If no ack, page VP/Director
Escalation Rules
- Time-box acknowledgment — 5-10 minutes is reasonable
- Include context — Don't just say "alert fired"
- Auto-escalate — Don't make people manually escalate
- Document expectations — What does "acknowledgment" mean?
Compensation and Recognition
Financial Compensation
- On-call stipend — Flat payment for carrying the phone
- Per-incident bonus — Extra pay for each actual incident
- Time-and-a-half — For hours worked during incidents
Non-Financial Recognition
- Comp time — Day off after a rough on-call shift
- Public recognition — Acknowledge incident responses in team meetings
- Reduced workload — Lighter sprint load during on-call week
Common On-Call Anti-Patterns
Anti-Pattern 1: Everyone On-Call Always
"We're a small team, we all need to be available." This means no one ever truly rests. Rotate primary responsibility.
Anti-Pattern 2: Founder Always On-Call
Founders often volunteer to take all the pages. This is unsustainable and creates a single point of failure.
Anti-Pattern 3: No Backup
If the primary doesn't respond, the alert dies. Always have a backup (even if it's the founder).
Anti-Pattern 4: Pager by Popularity
"The senior engineer should handle this." Seniority doesn't mean always being on-call. Everyone should rotate.
Anti-Pattern 5: No Post-Incident Review
Waking up at 3 AM should result in improvements that prevent future wake-ups. Every incident is a learning opportunity.
On-Call Checklist
Before On-Call
- ☐ Review recent incidents and runbooks
- ☐ Test your alerting setup (phone on loud, etc.)
- ☐ Know your escalation path
- ☐ Clear your schedule for quick response
During On-Call
- ☐ Acknowledge alerts quickly
- ☐ Communicate status to the team
- ☐ Escalate when stuck (don't hero)
- ☐ Document everything for post-mortem
After On-Call
- ☐ Write up any incidents
- ☐ Update runbooks with learnings
- ☐ Request comp time if needed
- ☐ Hand off cleanly to next person
Tools for On-Call Management
- PagerDuty — Industry standard, expensive
- Opsgenie — Good for Atlassian shops
- VictorOps / Splunk On-Call — Solid alternative
- OpsPulse — Uptime monitoring with Telegram alerts (for simpler needs)
Start with Better Alerting
Before you set up complex on-call rotations, make sure your alerts are worth responding to. OpsPulse helps reduce false positives with smart thresholds.
Start Free Monitoring →Summary
Healthy on-call rotations for small teams:
- Rotate fairly — Everyone shares the load
- Protect sleep — Only wake up for true P1s
- Have backups — No one should be alone with a crisis
- Compensate — On-call is work, pay for it
- Learn from incidents — Every wake-up should prevent future ones
On-call is a responsibility, not a punishment. Treat it that way and your team will too.