Claude Code for Incident Management Workflow Tutorial
Incident management is one of the most valuable areas to automate with Claude Code skills. Whether you’re handling service outages, security breaches, or production issues, well-designed Claude skills can reduce response times, ensure consistent processes, and free your team from repetitive triage work. This tutorial walks you through building a complete incident management workflow using Claude skills.
Understanding Incident Management in Claude Code
Before diving into code, let’s establish what an incident management workflow needs to accomplish. Traditional incident response follows a structured lifecycle: detection, triage, mitigation, resolution, and post-incident review. Each stage generates specific artifacts—status updates, escalation notifications, runbook links, and RCA documents.
Claude skills excel at this because they can:
- Parse incoming alerts and categorize them by severity
- Execute diagnostic commands to gather context
- Generate and send notifications to appropriate channels
- Create and update incident documentation in real-time
- Guide responders through runbooks step-by-step
The key is designing skills that handle one responsibility well, then composing them together for complex workflows.
Building Your First Incident Triage Skill
Every incident workflow starts with triage—quickly understanding what happened and how serious it is. Let’s create a skill that accepts an alert and produces a structured incident assessment.
Create a file called incident-triage.md in your skills directory:
---
name: incident-triage
description: Triages incoming incidents by analyzing alert data, determining severity, and recommending initial actions
tools: [read_file, bash, write_file]
trigger: "incident triage"
---
# Incident Triage Skill
You are an experienced on-call engineer performing incident triage. Analyze the provided alert information and produce a structured assessment.
## Input Format
When invoked, you will receive:
- Alert summary from {{alert_summary}}
- Error messages from {{error_messages}}
- Relevant metrics from {{metrics}}
## Your Task
1. **Classify the incident type**: Is this a performance issue, availability failure, security concern, or data problem?
2. **Determine severity** using this matrix:
- SEV1: Complete service outage, data loss, or security breach
- SEV2: Significant degradation affecting major functionality
- SEV3: Minor impact with workaround available
- SEV4: Cosmetic or informational only
3. **Identify affected components** from the error patterns
4. **Recommend initial actions**:
- Which runbook to consult
- Whether immediate escalation is required
- What diagnostic commands to run first
## Output Format
Produce your assessment in this structure:
- **Incident Type**: [classification]
- **Severity**: [SEV1-4]
- **Affected Systems**: [list]
- **Initial Actions**: [numbered list]
- **Escalation Recommendation**: [yes/no and reason]
This skill uses front matter variables ({{alert_summary}}, etc.) to receive dynamic input. When you call this skill from another automation, you pass values for these variables.
Creating an On-Call Escalation Handler
Once triage identifies an incident, you need to notify the right people. The escalation skill handles this by determining who to contact based on severity, time of day, and incident type.
---
name: incident-escalation
description: Handles incident escalation based on severity, on-call schedules, and incident type
tools: [read_file, bash]
trigger: "escalate incident"
---
# Incident Escalation Handler
You manage incident escalation for the platform team. Your job is ensuring the right people are notified quickly.
## On-Call Configuration
You have access to on-call rotation data in `/etc/oncall/rotations.yaml`:
- Primary on-call engineer
- Secondary (backup) engineer
- Manager contact for SEV1 incidents
- Security team alias for security incidents
## Escalation Rules
### By Severity
- **SEV1**: Notify primary + secondary + manager immediately
- **SEV2**: Notify primary; escalate to secondary if no acknowledgment in 5 minutes
- **SEV3**: Notify primary only
- **SEV4**: Log for next business day
### By Type
- **Security incidents**: Also notify security-team@company.com
- **Database issues**: Include dba-team in the notification
- **API failures**: Include API team lead
### Time-Based Rules
- During business hours (9am-6pm local): Use standard escalation
- After hours: Always notify both primary and secondary for SEV2+
## Your Task
1. Read the current on-call rotation to identify who is primary/secondary
2. Determine the appropriate escalation path based on the incident details
3. Format the notification message with:
- Incident summary
- Severity level
- Link to incident doc
- Direct contact info for on-call
4. Execute the appropriate notification command:
./scripts/notify-oncall.sh –severity {{severity}} –type {{incident_type}} –message “{{notification_message}}”
## Output
Confirm the escalation was sent and list all notified parties.
Building a Post-Incident Review Automator
After an incident is resolved, teams need to conduct post-incident reviews (PIRs) or root cause analyses (RCAs). This skill automates gathering the necessary data and generating a template.
---
name: post-incident-review
description: Generates post-incident review documentation by gathering metrics, logs, and timeline data
tools: [read_file, bash, write_file]
trigger: "generate incident review"
---
# Post-Incident Review Generator
You help teams conduct thorough post-incident reviews by automatically gathering relevant data and generating structured documentation.
## Input
- Incident ID: {{incident_id}}
- Incident start time: {{start_time}}
- Incident end time: {{end_time}}
- Affected services: {{affected_services}}
## Data Gathering Tasks
Execute these commands to collect incident data:
1. **Fetch relevant metrics**:
```bash
./scripts/export-metrics.sh --service {{affected_services}} --start {{start_time}} --end {{end_time}}
- Retrieve incident timeline from your ticketing system:
./scripts/get-incident-timeline.py --id {{incident_id}} - Collect relevant logs from the incident window:
./scripts/aggregate-logs.py --services {{affected_services}} --window {{start_time}}-{{end_time}} - Pull alert history leading up to the incident:
./scripts/get-alert-history.sh --services {{affected_services}} --hours 2
Documentation Template
Generate a PIR document with these sections:
Summary
Brief overview of what happened, impact, and duration
Timeline
- Time of first alert
- Time to acknowledge
- Time to mitigation
- Time to resolution
Root Cause Analysis
Technical explanation of the failure
What Went Well
Positive observations and successful mitigations
Action Items
Specific, assignable improvements with owners
Lessons Learned
Process and communication improvements
Output
Write the complete PIR to /incident-reviews/{{incident_id}}-pir.md and confirm the file was created.
## Composing Skills into a Complete Workflow
The real power of Claude skills comes from composing multiple skills together. You can create a master skill that orchestrates the entire incident lifecycle:
```markdown
---
name: incident-commander
description: Orchestrates the complete incident management lifecycle from detection through resolution
tools: [bash]
trigger: "handle incident"
---
# Incident Commander
You coordinate the response to production incidents, orchestrating specialized skills at each stage.
## Workflow
### Stage 1: Triage
Call the incident-triage skill with:
alert_summary: {{alert_summary}} error_messages: {{error_messages}} metrics: {{metrics}}
### Stage 2: Escalation
Based on triage results, call incident-escalation:
severity: [from triage output] incident_type: [from triage output] notification_message: [generated from triage]
### Stage 3: Resolution
Guide the responder through relevant runbooks. Execute diagnostic commands as needed.
### Stage 4: Post-Incident
Once resolved, call post-incident-review:
incident_id: {{incident_id}} start_time: [from incident creation] end_time: [current timestamp] affected_services: [from triage]
## Response Time Targets
- **SEV1**: Full workflow complete within 30 minutes
- **SEV2**: Full workflow complete within 2 hours
- **SEV3**: Resolution within same business day
## Output
Provide a summary of actions taken at each stage and confirm all documentation is complete.
Best Practices for Incident Management Skills
When building your own incident management skills, keep these principles in mind:
Start simple and iterate. Begin with a single skill that handles one scenario well. Add complexity only when you’ve validated the basic flow works.
Separate concerns. One skill should do one thing—triage, escalate, document, or diagnose. Composing skills is easier than debugging monolithic skills.
Always generate artifacts. Every incident should produce documentation. This creates an audit trail and enables future analysis.
Include human judgment points. Automated workflows should flag decisions that need human approval. Don’t let skills make business decisions autonomously.
Test with simulation. Before deploying, simulate incidents and verify your skills respond correctly. Run tabletop exercises where Claude handles the incident.
Conclusion
Claude skills transform incident management from reactive firefighting into structured, reproducible processes. By building skills for triage, escalation, resolution guidance, and post-incident reviews, you create a system that scales with your organization while maintaining consistency.
Start with the triage skill, add escalation handling, then layer in resolution guidance. Before long, you’ll have a complete incident management system that reduces response times and improves outcomes.
The key is treating skills as composable building blocks—each one focused, well-tested, and designed to work with others. Your incident management workflow will only be as strong as its weakest skill, so invest time in making each one robust.