Claude Skills Guide

Incident response is a critical aspect of DevOps and SRE practices. When production issues arise, having an efficient, repeatable workflow can mean the difference between quick resolution and extended downtime. Claude Code (claude) offers powerful capabilities to automate, document, and execute incident response runbooks effectively. This guide explores practical ways to integrate Claude Code into your incident response workflow.

Understanding Claude Code in Incident Response

Claude Code is a CLI tool that brings AI-assisted development to your terminal. Beyond writing code, it can serve as an intelligent companion during incidents—helping you diagnose issues, execute remediation steps, and document findings in real-time.

The key advantage is having an AI that understands your codebase, infrastructure, and previous incidents while guiding you through structured runbook steps.

Setting Up Incident Response Runbooks

Before diving into automation, establish a clean runbook structure. Create a dedicated directory for your incident response documentation:

mkdir -p runbooks/{detection,mitigation,resolution,postmortem}

Each runbook should follow a consistent format:

Integrating Claude Code into Your Workflow

1. Interactive Incident Investigation

When an incident occurs, start an interactive Claude session focused on the issue:

claude --print "We are experiencing high latency on the payment service. 
The error rate has increased to 15%. Current deploy was 2 hours ago.
Help me diagnose the root cause and follow our runbook for service degradation incidents."

This approach provides context from the start, allowing Claude to tailor its guidance to your specific situation.

2. Automated Runbook Execution

Create shell scripts that leverage Claude Code for step-by-step guidance. Here’s an example runbook runner:

#!/bin/bash
# runbook-runner.sh

INCIDENT_TYPE="$1"
RUNBOOK_PATH="runbooks/${INCIDENT_TYPE}.md"

if [ ! -f "$RUNBOOK_PATH" ]; then
    echo "Error: Runbook for '$INCIDENT_TYPE' not found"
    exit 1
fi

echo "=== Starting $INCIDENT_TYPE Incident Runbook ==="
echo ""

# Extract and execute each step
grep "^## Step" "$RUNBOOK_PATH" | while read -r step; do
    echo "$step"
    read -p "Press Enter to execute this step..."
    
    # Get the commands for this step
    sed -n "/$step/,/^## /p" "$RUNBOOK_PATH" | grep -A 50 "^```bash" | head -n -1
done

3. Real-Time Log Analysis

During incidents, you often need to analyze logs quickly. Use Claude to parse and summarize:

# Analyze recent errors from application logs
tail -n 500 /var/log/app/error.log | claude --print "Analyze these logs 
and identify: 1) Most frequent error patterns, 2) Timeline of failures, 
3) Potential root causes based on error messages"

4. Database Incident Procedures

For database-related incidents, create specialized runbooks. Here’s a MySQL connection failure response:

# First check: MySQL service status
systemctl status mysql

# Second check: Connection attempts
mysql -u app_user -p -e "SELECT 1" 2>&1

# Third check: Recent connections
mysql -u root -e "SHOW PROCESSLIST;"

Let Claude help you interpret the output:

mysql -u root -e "SHOW PROCESSLIST;" | claude --print "Analyze this MySQL 
process list. Are there any long-running queries? Locked tables? 
Connection pool exhaustion?"

Building a Claude-Assisted Incident Command System

For larger incidents, establish a structured command system:

#!/bin/bash
# incident-command.sh

echo "=== INCIDENT COMMAND SYSTEM ==="
echo "1. Declare Incident"
echo "2. Execute Runbook"
echo "3. Update Status Page"
echo "4. Coordinate Response"
echo "5. Resolve and Document"

read -p "Select action: " action

case $action in
    1) claude --print "Help me draft an incident declaration 
        notification. Include: severity level, affected services, 
        current impact, and initial response team." ;;
    2) echo "Available runbooks:"; ls runbooks/ ;;
    3) claude --print "Generate a status page update template for 
        our current incident" ;;
    4) claude --print "Create a response coordination checklist 
        for a SEV1 incident" ;;
    5) claude --print "Help me create a post-incident review template 
        that covers: timeline, root cause, impact, action items" ;;
esac

Best Practices for Claude-Assisted Incident Response

Context Preservation

Maintain a shared context file that Claude can reference:

# incident-context.md
## Current Incident: Payment Service Latency
- **Started**: 2026-03-15 14:32 UTC
- **Severity**: SEV2
- **On-Call**: @jane_devops
- **Affected**: payment-api, checkout-service
- **Current Status**: Investigating

Start each incident response session by loading this context:

claude --print "$(cat incident-context.md) - Now help us resolve this incident"

Runbook Versioning

Track changes to your runbooks in Git:

git add runbooks/
git commit -m "Update incident response runbooks"
git tag "runbooks-$(date +%Y%m%d)"

This ensures you can roll back problematic changes and audit evolution.

Post-Incident Learning

After resolving an incident, use Claude to generate a thorough postmortem:

claude --print "Based on our incident notes: $(cat incident-notes.md), 
generate a post-incident review covering: executive summary, timeline, 
root cause analysis, impact assessment, and actionable prevention items"

Actionable Recommendations

  1. Start small: Pick one frequent incident type and create a Claude-assisted runbook. Measure improvement before expanding.

  2. Automate repetitive tasks: If you find yourself typing the same commands during every incident, script them and have Claude explain when to use each.

  3. Maintain runbook hygiene: Review and update runbooks after every incident. Claude can help identify gaps by comparing your actual response to the documented procedure.

  4. Train your team: Ensure all on-call engineers know how to invoke Claude quickly. Consider alias shortcuts:

# Add to ~/.bashrc
alias inc="claude --print"
alias runbook="claude --print 'Help me execute our $1 runbook'"
  1. Practice incident scenarios: Run tabletop exercises where your team uses Claude-assisted runbooks to respond to simulated incidents. This validates both the runbooks and the tooling.

Conclusion

Claude Code transforms incident response from purely manual procedures into an intelligent, assisted workflow. By providing immediate context, suggesting next steps, and helping analyze complex outputs, it reduces cognitive load during high-stress situations.

The key is starting with well-structured runbooks and progressively adding Claude integration where it provides the most value—typically in diagnosis, log analysis, and post-incident documentation. With this approach, you build a more resilient incident response capability that improves over time.

Remember: Claude enhances your team’s expertise but doesn’t replace good engineering judgment. Use it as a powerful tool within a mature incident management framework.