Remote Work Tools

How to Set Up Remote Team Incident Response Process 2026

Production incidents don’t wait for business hours. Distributed teams need defined processes for alert routing, on-call escalation, runbook execution, and post-incident reviews. Here’s what works without chaos.

Why Distributed Teams Need Structure

Centralized office:

Distributed team without process:

With structure:

1. Alert Routing: PagerDuty vs OpsGenie

PagerDuty (Better for Large Teams)

Setup flow:

1. Your monitoring (DataDog, New Relic, Prometheus) fires alert
2. Alert webhook hits PagerDuty API
3. PagerDuty routes to on-call engineer
4. If no response in 5 minutes, escalates to backup
5. If still no response, escalates to team lead
6. Engineer gets email + SMS + phone call

Configuration example:

Escalation Policy: Engineering On-Call

Level 1 (0 minutes):
  - On-call engineer primary
  - Trigger: High severity (P1, P2)
  - Notify: SMS (1 min), then phone call (2 min)

Level 2 (5 minutes if engineer doesn't acknowledge):
  - On-call backup engineer
  - Notify: Phone call immediately

Level 3 (10 minutes if backup doesn't acknowledge):
  - Engineering manager
  - Notify: Phone call + SMS

Rotation: Primary on-call for 1 week
          Backup on-call for 1 week

Real example:

2:30 AM UTC — Database query timing out
  ↓ (Prometheus alert fires)
2:30:01 — PagerDuty SMS to Primary: "P1: DB query timeout, 5000 failures/min"
2:30:45 — Engineer wakes, reads SMS, acknowledges alert
2:31:00 — Engineer has runbook open, running investigation
2:35:00 — Root cause found (index missing), fix deployed
2:37:00 — Incident resolved, post-mortem scheduled

Total time: 7 minutes from alert to fix

Alternative without PagerDuty (no notification):

2:30 AM — Alert fires on Slack
2:32 AM — Alert fires on Slack again (10 failures, 50% traffic affected)
6:00 AM — EU person notices alerts, wakes primary engineer
6:05 AM — Engineer starts investigation
6:45 AM — Fix deployed
7:00 AM — Incident resolved, 4.5 hours customer impact

PagerDuty Pricing

Free tier:

Professional ($9/user/month):

Enterprise ($29/user/month):

For most teams: Professional tier is sufficient.

OpsGenie (Better for Small/Cost-Sensitive Teams)

Setup similar to PagerDuty, slightly different UI.

Pricing:

Difference:

PagerDuty: Enterprise standard, better for large ops
OpsGenie: Simpler, lower cost, better for smaller teams

Most teams use PagerDuty for established operations, OpsGenie for startups.

This guide focuses on PagerDuty but concepts apply to OpsGenie equally.


2. On-Call Rotation Schedule

Simple Weekly Rotation

For team of 6 engineers:

Mon-Sun Week 1: Alice (primary), Bob (backup)
Mon-Sun Week 2: Charlie (primary), Dave (backup)
Mon-Sun Week 3: Emma (primary), Frank (backup)
Mon-Sun Week 4: Grace (primary), Alice (backup)

Repeat every 3 weeks (cycles through everyone fairly).

Considerations:

Timezone-Aware Rotation

For distributed team:

Primary on-call: Engineer in currently active timezone
  (Business hours for incident detection are higher)

Backup on-call: Engineer in opposite timezone
  (If primary doesn't respond, backup is in their morning/evening)

Example setup (US + EU team):

Week 1:
  Mon-Fri 00:00-19:00 UTC: Charlie (EU morning/afternoon)
  Fri 19:00-Mon 00:00 UTC: Alice (US afternoon/evening/night)

Week 2:
  Mon-Fri 00:00-19:00 UTC: Emma (EU morning/afternoon)
  Fri 19:00-Mon 00:00 UTC: Frank (US afternoon/evening/night)

Better for on-call experience (fewer middle-of-night wakeups).

Respecting Boundaries

PagerDuty sleep rule:

Quiet hours: 2 AM - 7 AM on-call engineer's local time
  - Alerts still trigger but don't notify (no SMS/call)
  - Escalate to backup immediately instead

Example: Alice on-call in US Pacific (UTC-7)
  Quiet hours: 2-7 AM PT (9 AM - 2 PM UTC)
  Incident at 4 AM PT → escalates to backup immediately

This prevents burnout (on-call is stressful; middle-of-night wakeups are worse).


3. Runbook Template

A runbook is “what to do when X breaks.” 1-page maximum.

Template Structure

# Incident Runbook: Database Connection Pool Exhaustion

## Symptoms
- API returns "Connection timeout" errors
- Database connection count maxed
- Latency spikes on all endpoints

## Diagnosis (< 2 minutes)
1. Log in to Datadog dashboard (link: https://...)
2. Check metric: "postgres_active_connections"
3. If > 90, proceed to resolution
4. Check metric: "query_duration_p99"
5. If > 5s, database is slow (add to slow query runbook)

## Quick Fix (5 minutes)
1. SSH into app-server-1: `ssh ubuntu@app-1.internal`
2. Check connection status: `curl localhost:8080/health`
3. Restart app container: `docker restart app`
4. Verify: Check API returns 200 OK, Datadog shows recovery

If not recovered in 2 minutes, escalate to database team.

## Root Cause Investigation (post-incident)
- Check logs: `grep "Connection pool" /var/log/app.log | tail -100`
- Look for: Query hangs, connection leaks, traffic spike
- Common causes: Slow query, missing index, upstream service failure

## Escalation
If database team on-call unreachable after 3 min, escalate to VP Eng

## Verification Metrics
- Connection count: < 50 (normal)
- Query latency p99: < 200ms
- Error rate: < 0.1%
- All checks green: Incident resolved

## Post-Incident
- Schedule follow-up meeting to investigate root cause
- Implement prevention (e.g., connection pool monitoring)

Real Runbook Examples

Example: Disk Space Exhaustion

# Incident Runbook: Production Disk Space Critical

## Symptoms
- File writes failing (500 errors)
- Datadog alert: "Disk > 95%"
- Log streaming stopping

## Diagnosis (< 2 minutes)
SSH: ssh ubuntu@prod-1
Check disk: `df -h /data`
Identify large files: `du -sh /data/* | sort -h`

## Quick Fix
# Delete old logs (safe)
find /data/logs -type f -mtime +30 -delete

# Restart logging
systemctl restart rsyslog

# Verify
df -h /data (should drop to < 80%)
curl localhost:8080/health (should return 200)

## If Still Critical
Delete container cache: `docker system prune -a`
This is more aggressive, requires verification after

## Escalation
If above steps don't free space, page infra team

Example: Payment Service Failure

# Incident Runbook: Payment Processing Down

## Symptoms
- Checkout fails with "Payment gateway error"
- Stripe webhook queue backing up
- Customer emails arriving

## Diagnosis (< 2 minutes)
Check Stripe API status: https://status.stripe.com/
Check internal status page: https://internal/status/stripe-integration
Check logs: `grep "stripe_error" app.log | tail -20`

## Quick Fix Option 1: Stripe is Down
Wait for Stripe recovery, display banner to customers
Enable "maintenance mode" to prevent orders during outage
https://internal/admin/maintenance-mode

## Quick Fix Option 2: Our Integration is Broken
Restart Stripe sync: `kubectl rollout restart deployment/stripe-sync`
Verify: `curl https://internal/api/stripe-health`
Check queue size: `redis-cli GET stripe:queue:length`

## If Queue Backing Up > 1 hour
Page payments team, consider manual order approval
Escalate to CTO

## Post-Incident
- Review Stripe API logs for error patterns
- Add more detailed error logging to catch next time
- Improve monitoring on queue depth

Runbook Best Practices

  1. One page maximum — longer and people skip it
  2. Times in angles brackets — “Diagnosis (< 2 min)” sets expectations
  3. Exact commands — copy/paste should work
  4. Links to tools — don’t make people search for dashboard URLs
  5. Escalation criteria — “if X hasn’t resolved in 5 min, escalate to Y”
  6. Post-incident section — so you improve next time

4. Incident Communication During Active Incident

Slack Channel Setup

Create: #incidents (or #incident-response for larger teams)

During incident:

  1. Create thread in #incidents with incident ID
  2. One person is “scribe” (writes updates in thread)
  3. Responders post findings/actions to thread
  4. Every 5 minutes scribe posts status update

Example thread:

Thread started: 2026-03-21 02:30 UTC by Alice
Incident ID: INC-2026-3421
Severity: P1 (customers affected)
Status: Investigating

[02:31] Alice: Confirmed database connection exhaustion (492/500 active)
[02:32] Bob: Restarting connection pool service
[02:33] Bob: Pool restarted, connections dropping (now 280/500)
[02:35] Alice: API latency recovering, error rate dropping
[02:37] Status: RESOLVED - all metrics normal, error rate < 0.1%

Root cause: Query optimization missing on bulk user export
Impact: 7 min outage, 2% of transactions failed during window
Post-mortem: Thursday 2pm UTC

Key: Everyone knows status without jumping between channels.

Customer Communication

Public status page setup (tools: StatusPage.io, Atlassian Status, custom):

During incident:

INVESTIGATING — Service partially unavailable
Some customers may experience slow payment processing. Our team is investigating.

[02:31] We've identified unusual database activity
[02:35] We've deployed a fix and are monitoring recovery
[02:37] Service is recovering, all systems normal

Post-incident:

RESOLVED — Full details available in blog post

Root cause: Missing index on bulk export query
Duration: 7 minutes (02:30-02:37 UTC)
Impact: 2% of transactions failed
Prevention: Added database monitoring, index optimization

Full technical post-mortem: https://...

5. Post-Mortem Template

Conducted within 48 hours, while details are fresh.

Format

# Post-Mortem: Database Connection Pool Exhaustion (INC-2026-3421)

## Timeline
02:30 UTC — Prometheus alert fires (DB connections 95%)
02:31 UTC — PagerDuty notifies Alice (on-call engineer)
02:32 UTC — Alice acknowledges, starts investigation
02:33 UTC — Root cause identified: missing index on user export query
02:35 UTC — Index query optimized, redeployed
02:37 UTC — Connections drop, latency recovers
02:45 UTC — All systems stable, incident declared resolved

## Impact
- Duration: 7 minutes
- Affected: ~2% of payment transactions (450 failed)
- Customer-facing: Payment page returned errors
- Team effort: 1 engineer, ~15 min response + fix

## Root Cause
Bulk user export feature added Friday, no performance testing on production dataset.
Query performed full table scan (50M users) instead of indexed lookup.
Query took 45+ seconds per request, exhausted connection pool within minutes.

## Why Wasn't This Caught?
1. Feature had unit tests (passed)
2. Feature had integration tests on staging data (passed, only 10k test users)
3. No performance test against production-scale data
4. No index on table, even though query required it

## Lessons Learned
1. All new queries should have EXPLAIN ANALYZE review
2. Staging environment doesn't match production scale
3. Index recommendations should be automated in code review

## Action Items (Who / When)
1. [Alice] Add database.md runbook for connection pool exhaustion (by Friday)
2. [Bob] Create script to compare staging vs prod data volumes (by next week)
3. [Charlie] Set up automated EXPLAIN ANALYZE checks in CI (by sprint end)
4. [Dave] Review all bulk query code for index coverage (by next week)

## Follow-Up
- Review in 1 week (are action items complete?)
- Monitor bulk export performance daily for next 2 weeks
- Mention in team standup (everyone learns from this)

Post-Mortem Best Practices

What NOT to do:

What TO do:


6. Complete Setup Checklist

Week 1: Foundation

Week 2: Runbooks

Week 3: Communication

Week 4: Validation


Real Metrics to Track

After 2 weeks of process:

Mean Time to Alert (MTTA):
- Before: No process (alerts buried in Slack, ~30 min)
- After: 2 minutes (PagerDuty SMS/call)

Mean Time to Recovery (MTTR):
- Before: 45 minutes (waiting for morning, lack of runbook)
- After: 12 minutes (runbook + prepared engineer)

Time to Escalation:
- Before: No process, unclear
- After: 5 minutes to first backup, 10 to manager

Customer Impact Severity:
- Before: Major incidents often hit customers before team aware
- After: Usually resolved before notification goes out

Common Mistakes

Mistake 1: Runbook too long (3+ pages)

Mistake 2: Post-mortems become blame sessions

Mistake 3: On-call rotation unfair

Mistake 4: No escalation policy

Mistake 5: Runbooks never updated


Conclusion

Incident response structure for distributed teams requires:

  1. Alert routing (PagerDuty/OpsGenie) — ensures right person notified
  2. On-call rotation (fair, timezone-aware)
  3. Runbooks (1 page, exact steps)
  4. Communication (Slack + status page)
  5. Post-mortems (blameless, actionable)

Setup takes 2-4 weeks. Benefits:

Start with top 5 incidents. Grow from there.