When your monitoring alerts at 3 AM, you don't want to figure out what to do. You want a playbook. Here's a practical incident response process that works for small teams.
The Incident Response Lifecycle
Every incident goes through these phases:
- Detection — Alert fires, someone gets paged
- Triage — Assess severity, assign roles
- Investigation — Find the root cause
- Mitigation — Stop the bleeding
- Resolution — Fix the underlying issue
- Post-mortem — Learn and improve
Phase 1: Detection
When an Alert Fires
- Verify the alert is real (not a false positive)
- Check if it's a known issue (maintenance, expected)
- Assess immediate impact (how many users affected?)
- Decide: incident or false alarm?
Severity Levels
- SEV-1 (Critical): Complete outage, data loss risk, security breach
- SEV-2 (Major): Significant degradation, partial outage
- SEV-3 (Minor): Limited impact, workaround available
- SEV-4 (Low): Cosmetic issues, non-critical bugs
Phase 2: Triage
Assign Roles Immediately
- Incident Commander (IC): Coordinates response, makes decisions
- Communications Lead: Updates stakeholders, users
- Technical Lead: Investigates, implements fixes
Triage Questions
- What's broken? (Be specific)
- When did it start?
- What changed recently? (Deployments, config, traffic)
- How many users are affected?
- Is there a workaround?
- Do we need more help?
Phase 3: Investigation
Start with the Obvious
- Recent changes: Check deploy logs, config changes
- Dependencies: Is a third-party service down?
- Infrastructure: Check server health, network status
- Patterns: Time-based? Traffic-related?
Investigation Checklist
# Check application logs
kubectl logs -f deployment/app --tail=100
# Check error rates
curl -s "https://api.example.com/metrics" | grep errors
# Check recent deployments
git log --oneline -10
# Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"
# Check external dependencies
curl -I https://api.stripe.com/health
Phase 4: Mitigation
Mitigation is about stopping the bleeding, not fixing the root cause. Common mitigation strategies:
Quick Mitigation Options
- Rollback: Revert to previous version (fastest for deploy issues)
- Feature flag off: Disable problematic feature
- Scale up: Add more resources if overloaded
- Failover: Switch to backup system/region
- Rate limit: Reduce load on struggling service
- Circuit break: Stop calling failing dependency
Mitigation Decision Tree
Is it a recent deployment?
→ YES: Rollback
→ NO: Is it resource exhaustion?
→ YES: Scale up or shed load
→ NO: Is it a dependency?
→ YES: Failover or circuit break
→ NO: Investigate further or feature flag off
Phase 5: Resolution
After mitigation, fix the underlying issue:
- Identify root cause — Why did this happen?
- Develop fix — Code change, config update, etc.
- Test fix — In staging first if possible
- Deploy fix — With monitoring
- Verify resolution — Confirm issue is resolved
- Clean up — Remove temporary mitigations
Phase 6: Post-Mortem
Every SEV-1 and SEV-2 deserves a post-mortem. See our Blameless Post-Mortem Guide for details.
Quick Post-Mortem Template
# Incident Summary
- Date/Time:
- Duration:
- Severity:
- Impact:
# Timeline
- [Time] Alert fired
- [Time] Investigation started
- [Time] Root cause identified
- [Time] Mitigation applied
- [Time] Issue resolved
# Root Cause
What happened and why?
# Action Items
- [ ] Prevent recurrence
- [ ] Improve detection
- [ ] Reduce impact
- [ ] Speed up response
Communication During Incidents
Internal Communication
- Incident channel: Create dedicated Slack channel (#incident-YYYY-MM-DD)
- Regular updates: Every 15-30 minutes during active incident
- Status format: "Investigating" → "Identified" → "Mitigating" → "Resolved"
External Communication
- Status page: Update immediately when incident confirmed
- Initial message: "We're investigating reports of [issue]"
- Updates: Every 30-60 minutes during active incident
- Resolution: Detailed explanation after fix
Communication Template
## Status: [Investigating/Identified/Monitoring/Resolved]
**Impact:** [Brief description of user impact]
**Started:** [Timestamp]
**Current Status:** [What we're doing]
**Next Update:** [When to expect next update]
Common Incident Types
Deployment Gone Wrong
- Response: Rollback immediately
- Investigation: What changed in the deploy?
- Prevention: Canary deployments, feature flags
Database Issues
- Response: Check connections, slow queries, locks
- Mitigation: Kill long queries, restart if necessary
- Prevention: Query limits, connection pooling, monitoring
Third-Party Outage
- Response: Check provider status page
- Mitigation: Failover, circuit breaker, cached data
- Prevention: Redundancy, fallbacks, vendor diversification
Traffic Spike
- Response: Check if legitimate or attack
- Mitigation: Scale up, rate limit, shed non-critical load
- Prevention: Autoscaling, load testing, CDN
Incident Response Checklist
Immediate (First 5 Minutes)
- Acknowledge alert
- Verify it's a real incident
- Assign severity level
- Create incident channel (if needed)
- Start investigation
Short-Term (5-30 Minutes)
- Identify scope of impact
- Update status page
- Find recent changes
- Apply mitigation if root cause found
- Notify stakeholders
Medium-Term (30-120 Minutes)
- Continue investigation if not resolved
- Apply fix
- Verify resolution
- Update status page
- Notify stakeholders of resolution
Post-Incident (Within 24-48 Hours)
- Write post-mortem
- Review with team
- Create action items
- Update runbooks
- Archive incident artifacts
Get Alerts That Actually Matter
OpsPulse provides external uptime monitoring with smart thresholds and alert deduplication. Know when your service is down, not when your checks are flaky.
Start Free Monitoring →Summary
Effective incident response follows a clear pattern:
- Detect — Get alerted quickly
- Triage — Assess severity, assign roles
- Investigate — Find root cause systematically
- Mitigate — Stop the bleeding first
- Resolve — Fix the underlying issue
- Learn — Post-mortem and improve
The key is having a process you can follow under pressure, when your brain isn't working at full capacity.