Claude Code for On-Call Runbook Workflow Tutorial

Every developer who has been on-call knows the feeling: it’s 3 AM, something is broken, and you’re scrambling through wiki pages, hunting for that one runbook that might help. Runbooks are meant to be our safety net, but often they become outdated, hard to navigate, or simply too slow to use when every second counts. This is where Claude Code transforms your on-call experience from chaotic scrambling into confident, efficient incident response.

What is Claude Code?

Claude Code is Anthropic’s CLI tool that brings Claude’s reasoning capabilities directly to your terminal. Unlike traditional chatbots, Claude Code operates as an autonomous agent that can execute commands, read and write files, and most importantly—understand your entire codebase. When applied to on-call runbooks, Claude Code becomes your intelligent assistant that knows your infrastructure, understands your services, and can execute remediation steps with your guidance.

The key advantage is that Claude Code doesn’t just follow scripts—it understands context. It knows which services depend on which databases, understands your deployment patterns, and can adapt runbook steps to your specific environment.

Setting Up Your First Claude-Managed Runbook

Before diving into complex workflows, let’s set up a basic runbook structure that Claude Code can work with. The goal is to create runbooks that are both human-readable and machine-executable.

Project Structure

Create a runbook directory in your project:

mkdir -p runbooks/oncall
cd runbooks/oncall

Each runbook should follow a consistent structure:

# Service: Payment API
# Severity: P1
# On-Call: Team Alpha

## Symptoms
- Elevated 5xx error rates
- Latency spikes > 2 seconds
- Customer complaints about failed transactions

## Diagnosis
1. Check service health: `curl -s service/payment/health`
2. Review recent deployments: `git log --oneline -10`
3. Examine error logs: `kubectl logs -l app=payment --tail=100`

## Remediation
1. If recent deployment: `kubectl rollout undo deployment/payment`
2. If database issues: `pg_isready -h db.payment.internal`
3. Scale up if needed: `kubectl scale deployment/payment --replicas=5`

Creating Claude-Enabled Runbook Scripts

The real power comes from creating executable runbooks that Claude Code can run autonomously. Let’s create a Python script that Claude can execute during incidents.

#!/usr/bin/env python3
"""Automated runbook executor for payment service incidents."""

import subprocess
import sys
import json
from datetime import datetime

class OnCallRunner:
    def __init__(self, service_name):
        self.service = service_name
        self.findings = []
        
    def run_command(self, cmd, description):
        """Execute a diagnostic command and record results."""
        print(f"🔍 {description}")
        print(f"   Running: {cmd}")
        try:
            result = subprocess.run(
                cmd, shell=True, capture_output=True, text=True, timeout=30
            )
            output = result.stdout if result.stdout else result.stderr
            self.findings.append({
                "command": cmd,
                "description": description,
                "output": output,
                "success": result.returncode == 0
            })
            print(f"   Result: {'✅' if result.returncode == 0 else '❌'}")
            return result.returncode == 0
        except Exception as e:
            print(f"   Error: {e}")
            return False
    
    def diagnose_payment_issues(self):
        """Run through payment service diagnostics."""
        self.run_command(
            "curl -s https://api.example.com/payment/health",
            "Check payment service health endpoint"
        )
        self.run_command(
            "kubectl get pods -l app=payment -o json",
            "Get payment service pod status"
        )
        self.run_command(
            "kubectl top pods -l app=payment",
            "Check resource usage"
        )
        self.run_command(
            "pg_isready -h db.payment.internal -p 5432",
            "Verify database connectivity"
        )
        
    def generate_report(self):
        """Generate incident report for documentation."""
        report = {
            "service": self.service,
            "timestamp": datetime.utcnow().isoformat(),
            "findings": self.findings
        }
        with open(f"incident-report-{self.service}.json", "w") as f:
            json.dump(report, f, indent=2)
        return report

if __name__ == "__main__":
    runner = OnCallRunner("payment-api")
    runner.diagnose_payment_issues()
    report = runner.generate_report()
    print(f"\n📋 Report saved to incident-report-{runner.service}.json")

Integrating Claude Code with Your Runbooks

Now comes the magic part—letting Claude Code interact with your runbooks. Create a CLAUDE.md file in your project root to give Claude context about your runbooks:

# Claude Code - On-Call Assistant

## Project Context
- **Service**: E-commerce Platform
- **On-Call Rotation**: See PagerDuty schedule
- **Severity Levels**: P1 (Critical), P2 (High), P3 (Medium)

## Available Runbooks
- `runbooks/oncall/payment-service.md` - Payment API issues
- `runbooks/oncall/database.md` - Database connectivity problems
- `runbooks/oncall/cache.md` - Redis cache failures

## Common Issues & Quick Actions

### High Error Rates
1. Check `kubectl get pods` for restart loops
2. Review recent deployments with `git log`
3. Scale up if needed: `kubectl scale deployment <name> --replicas=5`

### Database Connection Issues
1. Verify with `pg_isready`
2. Check connection pool: `curl localhost:9187/metrics`
3. Review pgBouncer logs

## Emergency Contacts
- Platform Team: #platform-oncall
- Database Team: #db-oncall
- Incident Commander: PagerDuty escalation

Interactive Incident Response Workflow

When you’re on-call and something goes wrong, here’s how to work with Claude Code effectively:

Step 1: Initial Assessment

claude "Something is wrong with our payment service. Error rates are up 40%. 
Check the runbook at runbooks/oncall/payment-service.md and help me diagnose 
what's happening."

Claude will read the runbook, understand your service topology, and start running diagnostic commands.

Step 2: Collaborative Diagnosis

Claude doesn’t just run commands—it explains what’s happening:

Based on the runbook and current metrics, I see:
1. Pods are restarting every 2-3 minutes
2. Memory usage spiked after the 14:00 deployment
3. Database connections are timing out

Let me check the recent deployment to confirm the hypothesis...

Step 3: Guided Remediation

Once Claude identifies the issue, it walks you through fixes:

I recommend we roll back the deployment. This will:
- Restore the previous stable version
- Take approximately 30 seconds
- Preserve the current database state

To proceed, run:
kubectl rollout undo deployment/payment-api

Would you like me to execute this, or would you prefer to do it manually?

Best Practices for Claude-Managed Runbooks

Keep Runbooks Living Documents

The biggest failure of traditional runbooks is stagnation. With Claude Code, you can keep runbooks current by:

Version control everything: Store runbooks in git alongside your code
Auto-update from deployments: Add post-deploy hooks that document changes
Review after incidents: Update runbooks based on what actually worked

Use Descriptive Command Patterns

Claude performs better when commands are well-documented:

## Check Service Health
- Purpose: Verify all pods are running and ready
- Command: `kubectl get pods -l app=YOUR_SERVICE -o wide`
- Expected: All pods in Running status with Ready 1/1
- If fails: Check events with `kubectl describe pods -l app=YOUR_SERVICE`

Implement Graduated Severity Responses

Create separate workflows for different severity levels:

def handle_incident(severity, symptoms):
    if severity == "P1":
        # Immediate automated response
        run_automated_diagnostics()
        page_on_call()
        create_incident_ticket()
    elif severity == "P2":
        # Semi-automated response
        run_automated_diagnostics()
        notify_slack_channel()
    else:
        # Manual investigation
        provide_diagnostic_commands()

Conclusion

Claude Code transforms on-call from a stressful, error-prone experience into a structured, confident response process. By combining well-documented runbooks with Claude’s autonomous execution capabilities, you get the best of both worlds: human judgment for critical decisions and automated execution for repetitive tasks.

Start small—pick one service, create a basic runbook, and let Claude help you respond to the next incident. You’ll be surprised how quickly this approach becomes indispensable.

Remember: The goal isn’t to replace human judgment—it’s to augment it with intelligent automation that helps you respond faster and more accurately when it matters most.

Built by theluckystrike — More at zovo.one