Incident Response Playbook: From Alert to Resolution

Published: March 20, 2026 • Reading time: 12 minutes

When your monitoring alerts at 3 AM, you don't want to figure out what to do. You want a playbook. Here's a practical incident response process that works for small teams.

The Incident Response Lifecycle

Every incident goes through these phases:

Detection — Alert fires, someone gets paged
Triage — Assess severity, assign roles
Investigation — Find the root cause
Mitigation — Stop the bleeding
Resolution — Fix the underlying issue
Post-mortem — Learn and improve

Phase 1: Detection

When an Alert Fires

Verify the alert is real (not a false positive)
Check if it's a known issue (maintenance, expected)
Assess immediate impact (how many users affected?)
Decide: incident or false alarm?

First rule: If in doubt, declare an incident. It's better to stand down a false alarm than to ignore a real issue.

Severity Levels

SEV-1 (Critical): Complete outage, data loss risk, security breach
SEV-2 (Major): Significant degradation, partial outage
SEV-3 (Minor): Limited impact, workaround available
SEV-4 (Low): Cosmetic issues, non-critical bugs

Phase 2: Triage

Assign Roles Immediately

Incident Commander (IC): Coordinates response, makes decisions
Communications Lead: Updates stakeholders, users
Technical Lead: Investigates, implements fixes

For solo/very small teams: You wear all hats. Focus on: (1) stop the bleeding, (2) communicate status, (3) fix root cause. In that order.

Triage Questions

What's broken? (Be specific)
When did it start?
What changed recently? (Deployments, config, traffic)
How many users are affected?
Is there a workaround?
Do we need more help?

Phase 3: Investigation

Start with the Obvious

Recent changes: Check deploy logs, config changes
Dependencies: Is a third-party service down?
Infrastructure: Check server health, network status
Patterns: Time-based? Traffic-related?

Investigation Checklist

# Check application logs
kubectl logs -f deployment/app --tail=100

# Check error rates
curl -s "https://api.example.com/metrics" | grep errors

# Check recent deployments
git log --oneline -10

# Check database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Check external dependencies
curl -I https://api.stripe.com/health

Avoid tunnel vision: If you've been investigating for 15 minutes without progress, step back. Get fresh eyes. Describe the problem out loud.

Phase 4: Mitigation

Mitigation is about stopping the bleeding, not fixing the root cause. Common mitigation strategies:

Quick Mitigation Options

Rollback: Revert to previous version (fastest for deploy issues)
Feature flag off: Disable problematic feature
Scale up: Add more resources if overloaded
Failover: Switch to backup system/region
Rate limit: Reduce load on struggling service
Circuit break: Stop calling failing dependency

Document everything: Every action, every result. You'll need this for the post-mortem.

Mitigation Decision Tree

Is it a recent deployment?
  → YES: Rollback
  → NO: Is it resource exhaustion?
         → YES: Scale up or shed load
         → NO: Is it a dependency?
                → YES: Failover or circuit break
                → NO: Investigate further or feature flag off

Phase 5: Resolution

After mitigation, fix the underlying issue:

Identify root cause — Why did this happen?
Develop fix — Code change, config update, etc.
Test fix — In staging first if possible
Deploy fix — With monitoring
Verify resolution — Confirm issue is resolved
Clean up — Remove temporary mitigations

Don't rush the fix: A bad fix can make things worse. Test thoroughly, deploy carefully.

Phase 6: Post-Mortem

Every SEV-1 and SEV-2 deserves a post-mortem. See our Blameless Post-Mortem Guide for details.

Quick Post-Mortem Template

# Incident Summary
- Date/Time:
- Duration:
- Severity:
- Impact:

# Timeline
- [Time] Alert fired
- [Time] Investigation started
- [Time] Root cause identified
- [Time] Mitigation applied
- [Time] Issue resolved

# Root Cause
What happened and why?

# Action Items
- [ ] Prevent recurrence
- [ ] Improve detection
- [ ] Reduce impact
- [ ] Speed up response

Communication During Incidents

Internal Communication

Incident channel: Create dedicated Slack channel (#incident-YYYY-MM-DD)
Regular updates: Every 15-30 minutes during active incident
Status format: "Investigating" → "Identified" → "Mitigating" → "Resolved"

External Communication

Status page: Update immediately when incident confirmed
Initial message: "We're investigating reports of [issue]"
Updates: Every 30-60 minutes during active incident
Resolution: Detailed explanation after fix

Communication Template

## Status: [Investigating/Identified/Monitoring/Resolved]

**Impact:** [Brief description of user impact]
**Started:** [Timestamp]
**Current Status:** [What we're doing]

**Next Update:** [When to expect next update]

Common Incident Types

Deployment Gone Wrong

Response: Rollback immediately
Investigation: What changed in the deploy?
Prevention: Canary deployments, feature flags

Database Issues

Response: Check connections, slow queries, locks
Mitigation: Kill long queries, restart if necessary
Prevention: Query limits, connection pooling, monitoring

Third-Party Outage

Response: Check provider status page
Mitigation: Failover, circuit breaker, cached data
Prevention: Redundancy, fallbacks, vendor diversification

Traffic Spike

Response: Check if legitimate or attack
Mitigation: Scale up, rate limit, shed non-critical load
Prevention: Autoscaling, load testing, CDN

Incident Response Checklist

Immediate (First 5 Minutes)

Acknowledge alert
Verify it's a real incident
Assign severity level
Create incident channel (if needed)
Start investigation

Short-Term (5-30 Minutes)

Identify scope of impact
Update status page
Find recent changes
Apply mitigation if root cause found
Notify stakeholders

Medium-Term (30-120 Minutes)

Continue investigation if not resolved
Apply fix
Verify resolution
Update status page
Notify stakeholders of resolution

Post-Incident (Within 24-48 Hours)

Write post-mortem
Review with team
Create action items
Update runbooks
Archive incident artifacts

Get Alerts That Actually Matter

OpsPulse provides external uptime monitoring with smart thresholds and alert deduplication. Know when your service is down, not when your checks are flaky.

Start Free Monitoring →

Summary

Effective incident response follows a clear pattern:

Detect — Get alerted quickly
Triage — Assess severity, assign roles
Investigate — Find root cause systematically
Mitigate — Stop the bleeding first
Resolve — Fix the underlying issue
Learn — Post-mortem and improve

The key is having a process you can follow under pressure, when your brain isn't working at full capacity.

The Incident Response Lifecycle

Phase 1: Detection

When an Alert Fires

Severity Levels

Phase 2: Triage

Assign Roles Immediately

Triage Questions

Phase 3: Investigation

Start with the Obvious

Investigation Checklist

Phase 4: Mitigation

Quick Mitigation Options

Mitigation Decision Tree

Phase 5: Resolution

Phase 6: Post-Mortem

Quick Post-Mortem Template

Communication During Incidents

Internal Communication

External Communication

Communication Template

Common Incident Types

Deployment Gone Wrong

Database Issues

Third-Party Outage

Traffic Spike

Incident Response Checklist

Immediate (First 5 Minutes)

Short-Term (5-30 Minutes)

Medium-Term (30-120 Minutes)

Post-Incident (Within 24-48 Hours)

Get Alerts That Actually Matter

Summary

Related Resources