Claude Code for Incident Escalation Workflow Tutorial

Incident management is a critical part of any production system. When something goes wrong, you need a clear path from detection to resolution, with the right people notified at the right time. In this tutorial, you’ll learn how to build an incident escalation workflow system using Claude Code skills.

Why Build Escalation Workflows with Claude Code?

Traditional incident management tools require extensive configuration, custom integrations, and often come with steep learning curves. Claude Code skills offer a different approach:

Natural language configuration: Define escalation rules in readable YAML or Markdown
Flexible integrations: Connect to any notification system through custom tools
Contextual awareness: Claude understands your incident context and can make smart routing decisions
Learning capability: The system improves over time based on resolution patterns

Prerequisites

Before building your escalation workflow, ensure you have:

Claude Code installed and configured
Basic understanding of YAML syntax
Access to notification channels (Slack, email, PagerDuty, etc.)
A Claude skill for reading configuration files

Building Your First Escalation Skill

Let’s create a skill that handles incident escalation from detection through resolution.

Step 1: Define the Skill Structure

Create a new skill file called incident-escalation.md:

---
name: incident-escalation
description: Handles incident escalation workflows with tiered response times
---

# Incident Escalation Handler

You help manage the full lifecycle of incidents from initial detection through resolution.

## Detecting and Classifying Incidents

When a new incident is reported, you must first classify its severity:

- **SEV1 (Critical)**: Complete service outage, data loss, security breach
- **SEV2 (High)**: Major feature broken, significant performance degradation
- **SEV3 (Medium)**: Minor feature issue, workaround available
- **SEV4 (Low)**: Cosmetic issues, minor inconveniences

## Escalation Timeline Rules

Follow these escalation timelines:

| Severity | Initial Response | Escalation After | Notify |
|----------|-------------------|------------------|--------|
| SEV1 | 15 minutes | 30 minutes | On-call + Manager |
| SEV2 | 1 hour | 2 hours | Team Lead |
| SEV3 | 4 hours | 8 hours | Team Channel |
| SEV4 | 24 hours | 48 hours | Ticket Queue |

Step 2: Create Escalation Configuration

Create a YAML configuration file that defines your escalation rules:

# incident-config.yaml
escalation:
  tiers:
    - name: on_call
      response_time_minutes: 15
      contacts:
        - type: slack
          channel: "#incidents"
        - type: pagerduty
          service: primary
      escalate_after: 30
      
    - name: team_lead
      response_time_minutes: 60
      contacts:
        - type: slack
          user: "{{team_lead_id}}"
      escalate_after: 120
      
    - name: manager
      response_time_minutes: 30
      contacts:
        - type: email
          address: "{{manager_email}}"
      escalate_after: 60

severity_rules:
  SEV1:
    auto_escalate: true
    create_war_room: true
    notify_stakeholders: true
    
  SEV2:
    auto_escalate: true
    create_war_room: false
    notify_stakeholders: false
    
  SEV3:
    auto_escalate: false
    create_war_room: false
    notify_stakeholders: false
    
  SEV4:
    auto_escalate: false
    create_war_room: false
    notify_stakeholders: false

Step 3: Implement the Escalation Logic

Now let’s build the core escalation handling. Create a skill that processes incidents:

#!/usr/bin/env python3
"""Incident escalation processor"""

import yaml
import time
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class Incident:
    title: str
    severity: str
    description: str
    created_at: datetime
    assigned_tier: Optional[str] = None
    status: str = "open"
    escalation_count: int = 0

class EscalationEngine:
    def __init__(self, config_path: str):
        with open(config_path) as f:
            self.config = yaml.safe_load(f)
            
    def get_escalation_tier(self, severity: str) -> dict:
        """Determine which escalation tier applies"""
        severity_rules = self.config['escalation']['tiers']
        
        for tier in severity_rules:
            if severity == "SEV1":
                return tier  # Start at highest tier
            elif severity == "SEV2":
                if tier['name'] in ['on_call', 'team_lead']:
                    return tier
            # ... handle other severities
                    
    def should_escalate(self, incident: Incident) -> bool:
        """Check if incident should be escalated"""
        tier = self.get_escalation_tier(incident.severity)
        escalation_window = tier['escalate_after']
        
        time_elapsed = datetime.now() - incident.created_at
        return time_elapsed > timedelta(minutes=escalation_window)
        
    def escalate(self, incident: Incident) -> Incident:
        """Perform escalation action"""
        incident.escalation_count += 1
        # Add escalation logic here
        return incident

Step 4: Integration with Notification Systems

Here’s how to integrate with Slack for notifications:

import requests
from typing import Dict, Any

class SlackNotifier:
    def __init__(self, webhook_url: str, bot_token: str):
        self.webhook_url = webhook_url
        self.bot_token = bot_token
        
    def send_incident_alert(self, incident: Incident, tier: dict) -> bool:
        """Send incident alert to appropriate channel"""
        message = {
            "text": f"🚨 *INCIDENT ESCALATED*",
            "blocks": [
                {
                    "type": "header",
                    "text": {
                        "type": "plain_text",
                        "text": f"🚨 {incident.severity}: {incident.title}"
                    }
                },
                {
                    "type": "section",
                    "fields": [
                        {"type": "mrkdwn", "text": f"*Description:*\n{incident.description}"},
                        {"type": "mrkdwn", "text": f"*Status:*\n{incident.status}"},
                        {"type": "mrkdwn", "text": f"*Escalation Level:*\n{tier['name']}"},
                        {"type": "mrkdwn", "text": f"*Time Elapsed:*\n{self.get_time_elapsed(incident)}"}
                    ]
                }
            ]
        }
        
        response = requests.post(
            self.webhook_url,
            json=message,
            headers={"Content-Type": "application/json"}
        )
        
        return response.status_code == 200

Practical Example: End-to-End Workflow

Here’s how a complete incident workflow operates:

Incident Detection: A monitoring system or team member reports an issue
Classification: Claude determines severity based on description and system impact
Initial Response: Appropriate on-call personnel are notified immediately
Monitoring: Claude tracks response times and escalation thresholds
Auto-Escalation: If no acknowledgment within threshold, automatically escalate
Resolution: Once resolved, generate post-incident report

## Example Interaction

User: "We're seeing elevated error rates on the payment service"

Claude: I'll help you create an incident for this. Based on the description, this appears to be a SEV2 (High) incident - major feature broken with significant user impact.

Let me gather more details:
- What's the current error rate percentage?
- Is this affecting all payment methods or specific ones?
- How many users are impacted?

Once you provide these details, I'll:
1. Create the incident record
2. Notify the on-call team
3. Set up the escalation timer
4. Begin the incident timeline

Best Practices for Incident Escalation

Always Define Clear Severity Levels

Ambiguous severity classifications lead to under- or over-escalation. Create explicit criteria for each level and train your team to use them consistently.

Automate the Mundane

Use Claude Code to handle routine tasks:

Initial incident triage
Status updates to stakeholders
Escalation timing checks
Post-incident report generation

Maintain an Audit Trail

Every action should be logged:

When incidents are created
Who was notified and when
All status changes
Resolution details

Regular Review and Tuning

Periodically review your escalation patterns:

Are response times improving?
Are escalations appropriate?
Are the right people being notified?

Conclusion

Building an incident escalation workflow with Claude Code gives you flexibility, context-awareness, and continuous improvement capabilities. Start with simple rules, add complexity as your team matures, and always prioritize clear communication over clever automation.

The key is to balance automation with human judgment - let Claude handle the timing and routing, but ensure experienced team members make critical decisions about severity and response strategies.

Built by theluckystrike — More at zovo.one