Remote Work Tools

Incident Management Setup for a Remote DevOps Team of 5

When your five-person DevOps team is distributed across time zones, incident response becomes significantly harder. Without clear protocols, a production issue at 2 AM means scrambling to find who is on-call, digging through scattered documentation, and making critical decisions in a vacuum. A well-structured incident management setup transforms this chaos into a repeatable, calm response process.

This guide covers the essential components for setting up incident management that works for a small remote DevOps team.

Defining Incident Severity Levels

Establishing clear severity levels upfront prevents over-escalation and ensures appropriate response times. For a team of five, use a four-tier system:

Document these levels in your team wiki and ensure every team member can reference them quickly during an incident.

Building the On-Call Rotation

With five team members, a simple weekly rotating on-call schedule works well. Each person takes one week, then rotates. Here is a basic schedule structure in YAML:

# oncall-schedule.yaml
rotation:
  - name: "Alice"
    timezone: "PST"
    primary_week: [1, 6, 11, 16, 21, 26, 31]
  - name: "Bob"
    timezone: "EST"
    primary_week: [2, 7, 12, 17, 22, 27, 32]
  - name: "Carol"
    timezone: "GMT"
    primary_week: [3, 8, 13, 18, 23, 28, 33]
  - name: "David"
    timezone: "CET"
    primary_week: [4, 9, 14, 19, 24, 29, 34]
  - name: "Eve"
    timezone: "JST"
    primary_week: [5, 10, 15, 20, 25, 30, 35]

The primary on-call handles all initial alerts. The secondary on-call provides backup if the primary is unavailable or overwhelmed. Define clear handoff procedures: the outgoing on-call should summarize active issues and any pending changes to the incoming person.

Creating Effective Runbooks

Runbooks are step-by-step guides for handling specific incidents. They reduce cognitive load during stressful situations and ensure consistent responses regardless of who handles the incident.

Structure each runbook with these sections:

  1. Trigger conditions: When should this runbook be used?
  2. Immediate actions: What to do in the first 60 seconds
  3. Diagnosis steps: How to identify the root cause
  4. Resolution steps: Concrete commands or actions to fix the issue
  5. Verification: How to confirm the issue is resolved
  6. Follow-up: Post-incident tasks and notifications

Here is an example runbook for high CPU usage:

# Runbook: High CPU Usage

## Trigger
- CPU usage exceeds 90% on any production server for more than 5 minutes

## Immediate Actions
1. Check if this is expected (batch job, heavy load)
2. Identify affected servers: `kubectl top nodes`

## Diagnosis
1. Identify processes: `top -c` (Linux) or `Get-Process` (Windows)
2. Check for recent deployments: `kubectl rollout history deployment/your-app`
3. Review logs: `kubectl logs -l app=your-app --tail=100`

## Resolution
1. If deployment issue: `kubectl rollout undo deployment/your-app`
2. If runaway process: `kill -15 <PID>` (graceful) or `kill -9 <PID>` (force)
3. Scale up temporarily: `kubectl scale deployment/your-app --replicas=6`

## Verification
- CPU drops below 70% on affected servers
- Response times return to normal
- No error spikes in logs

## Follow-up
- Document root cause in incident report
- Schedule post-mortem within 48 hours

Build runbooks incrementally. Start with the five most common incident types your team faces, then expand as you encounter new scenarios.

Setting Up Alert Routing

Alert routing ensures the right person receives the right notifications. Use a tiered approach:

Example alert configuration using Prometheus Alertmanager:

# alertmanager.yaml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'oncall-primary'
  routes:
    - match:
        severity: critical
      receiver: 'oncall-primary'
      continue: true
    - match:
        severity: warning
      receiver: 'oncall-secondary'
      continue: true
    - match:
        team: platform
      receiver: 'platform-owner'

receivers:
  - name: 'oncall-primary'
    email:
      to: 'oncall-primary@example.com'
    slack_configs:
      - channel: '#incidents'
  - name: 'oncall-secondary'
    email:
      to: 'oncall-secondary@example.com'
  - name: 'platform-owner'
    email:
      to: 'platform-team@example.com'

Incident Communication Templates

During an incident, clear communication prevents confusion. Prepare templates for common scenarios:

Initial Incident Alert (Slack):

🚨 INCIDENT SEV{{severity}}: {{title}}
Affected: {{services}}
On-call: {{responder}}
Status: Investigating
Update thread: {{thread_link}}

Status Update:

📢 INCIDENT UPDATE #{{incident_id}}
Status: {{investigating|identified|monitoring|resolved}}
Current understanding: {{brief_description}}
Next action: {{next_steps}}
ETA for resolution: {{eta}}

Incident Resolution:

✅ INCIDENT RESOLVED #{{incident_id}}
Duration: {{duration}}
Root cause: {{brief_explanation}}
Follow-up: {{ticket_links}}
Post-mortem: {{date}}

Post-Incident Review Process

After resolving any SEV1 or SEV2 incident, conduct a blameless post-mortem within 48 hours. The goal is identifying systemic improvements, not assigning blame.

Use this template:

  1. Summary: What happened and impact
  2. Timeline: Minute-by-minute events
  3. Root cause: Technical trigger and contributing factors
  4. What went well: Successful responses to highlight
  5. What could improve: Action items with owners and deadlines
  6. Similar risks: Other areas that could have similar issues

Track action items in your project management tool and assign clear owners. Review open action items in each team meeting until resolved.

Putting It All Together

Start by defining your severity levels and documenting them. Build runbooks for your top five most common incidents. Configure alert routing to notify the right people. Practice your incident response in a tabletop exercise before you need it.

With five team members, you have enough scale to provide good coverage without the complexity of larger on-call rotations. The key is consistency: follow your defined processes, update your runbooks after each incident, and continuously improve.

The goal is not eliminating incidents—they will happen. The goal is responding to them calmly, efficiently, and learning from each one.

Built by theluckystrike — More at zovo.one