Incident Response Playbook: From Alert to Resolution

A practical guide for small teams to handle incidents systematically

Published: March 20, 2026 • Reading time: 12 minutes

When your monitoring alerts at 3 AM, you don't want to figure out what to do. You want a playbook. Here's a practical incident response process that works for small teams.

The Incident Response Lifecycle

Every incident goes through these phases:

  1. Detection — Alert fires, someone gets paged
  2. Triage — Assess severity, assign roles
  3. Investigation — Find the root cause
  4. Mitigation — Stop the bleeding
  5. Resolution — Fix the underlying issue
  6. Post-mortem — Learn and improve

Phase 1: Detection

When an Alert Fires

First rule: If in doubt, declare an incident. It's better to stand down a false alarm than to ignore a real issue.

Severity Levels

Phase 2: Triage

Assign Roles Immediately

For solo/very small teams: You wear all hats. Focus on: (1) stop the bleeding, (2) communicate status, (3) fix root cause. In that order.

Triage Questions

Phase 3: Investigation

Start with the Obvious

Investigation Checklist

# Check application logs
kubectl logs -f deployment/app --tail=100

# Check error rates
curl -s "https://api.example.com/metrics" | grep errors

# Check recent deployments
git log --oneline -10

# Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Check external dependencies
curl -I https://api.stripe.com/health
Avoid tunnel vision: If you've been investigating for 15 minutes without progress, step back. Get fresh eyes. Describe the problem out loud.

Phase 4: Mitigation

Mitigation is about stopping the bleeding, not fixing the root cause. Common mitigation strategies:

Quick Mitigation Options

Document everything: Every action, every result. You'll need this for the post-mortem.

Mitigation Decision Tree

Is it a recent deployment?
  → YES: Rollback
  → NO: Is it resource exhaustion?
         → YES: Scale up or shed load
         → NO: Is it a dependency?
                → YES: Failover or circuit break
                → NO: Investigate further or feature flag off

Phase 5: Resolution

After mitigation, fix the underlying issue:

  1. Identify root cause — Why did this happen?
  2. Develop fix — Code change, config update, etc.
  3. Test fix — In staging first if possible
  4. Deploy fix — With monitoring
  5. Verify resolution — Confirm issue is resolved
  6. Clean up — Remove temporary mitigations
Don't rush the fix: A bad fix can make things worse. Test thoroughly, deploy carefully.

Phase 6: Post-Mortem

Every SEV-1 and SEV-2 deserves a post-mortem. See our Blameless Post-Mortem Guide for details.

Quick Post-Mortem Template

# Incident Summary
- Date/Time:
- Duration:
- Severity:
- Impact:

# Timeline
- [Time] Alert fired
- [Time] Investigation started
- [Time] Root cause identified
- [Time] Mitigation applied
- [Time] Issue resolved

# Root Cause
What happened and why?

# Action Items
- [ ] Prevent recurrence
- [ ] Improve detection
- [ ] Reduce impact
- [ ] Speed up response

Communication During Incidents

Internal Communication

External Communication

Communication Template

## Status: [Investigating/Identified/Monitoring/Resolved]

**Impact:** [Brief description of user impact]
**Started:** [Timestamp]
**Current Status:** [What we're doing]

**Next Update:** [When to expect next update]

Common Incident Types

Deployment Gone Wrong

Database Issues

Third-Party Outage

Traffic Spike

Incident Response Checklist

Immediate (First 5 Minutes)

Short-Term (5-30 Minutes)

Medium-Term (30-120 Minutes)

Post-Incident (Within 24-48 Hours)

Get Alerts That Actually Matter

OpsPulse provides external uptime monitoring with smart thresholds and alert deduplication. Know when your service is down, not when your checks are flaky.

Start Free Monitoring →

Summary

Effective incident response follows a clear pattern:

  1. Detect — Get alerted quickly
  2. Triage — Assess severity, assign roles
  3. Investigate — Find root cause systematically
  4. Mitigate — Stop the bleeding first
  5. Resolve — Fix the underlying issue
  6. Learn — Post-mortem and improve

The key is having a process you can follow under pressure, when your brain isn't working at full capacity.

Related Resources