Claude Code for On-Call Runbook Workflow Tutorial
Every developer who has been on-call knows the feeling: it’s 3 AM, something is broken, and you’re scrambling through wiki pages, hunting for that one runbook that might help. Runbooks are meant to be our safety net, but often they become outdated, hard to navigate, or simply too slow to use when every second counts. This is where Claude Code transforms your on-call experience from chaotic scrambling into confident, efficient incident response.
What is Claude Code?
Claude Code is Anthropic’s CLI tool that brings Claude’s reasoning capabilities directly to your terminal. Unlike traditional chatbots, Claude Code operates as an autonomous agent that can execute commands, read and write files, and most importantly—understand your entire codebase. When applied to on-call runbooks, Claude Code becomes your intelligent assistant that knows your infrastructure, understands your services, and can execute remediation steps with your guidance.
The key advantage is that Claude Code doesn’t just follow scripts—it understands context. It knows which services depend on which databases, understands your deployment patterns, and can adapt runbook steps to your specific environment.
Setting Up Your First Claude-Managed Runbook
Before diving into complex workflows, let’s set up a basic runbook structure that Claude Code can work with. The goal is to create runbooks that are both human-readable and machine-executable.
Project Structure
Create a runbook directory in your project:
mkdir -p runbooks/oncall
cd runbooks/oncall
Each runbook should follow a consistent structure:
# Service: Payment API
# Severity: P1
# On-Call: Team Alpha
## Symptoms
- Elevated 5xx error rates
- Latency spikes > 2 seconds
- Customer complaints about failed transactions
## Diagnosis
1. Check service health: `curl -s service/payment/health`
2. Review recent deployments: `git log --oneline -10`
3. Examine error logs: `kubectl logs -l app=payment --tail=100`
## Remediation
1. If recent deployment: `kubectl rollout undo deployment/payment`
2. If database issues: `pg_isready -h db.payment.internal`
3. Scale up if needed: `kubectl scale deployment/payment --replicas=5`
Creating Claude-Enabled Runbook Scripts
The real power comes from creating executable runbooks that Claude Code can run autonomously. Let’s create a Python script that Claude can execute during incidents.
#!/usr/bin/env python3
"""Automated runbook executor for payment service incidents."""
import subprocess
import sys
import json
from datetime import datetime
class OnCallRunner:
def __init__(self, service_name):
self.service = service_name
self.findings = []
def run_command(self, cmd, description):
"""Execute a diagnostic command and record results."""
print(f"🔍 {description}")
print(f" Running: {cmd}")
try:
result = subprocess.run(
cmd, shell=True, capture_output=True, text=True, timeout=30
)
output = result.stdout if result.stdout else result.stderr
self.findings.append({
"command": cmd,
"description": description,
"output": output,
"success": result.returncode == 0
})
print(f" Result: {'✅' if result.returncode == 0 else '❌'}")
return result.returncode == 0
except Exception as e:
print(f" Error: {e}")
return False
def diagnose_payment_issues(self):
"""Run through payment service diagnostics."""
self.run_command(
"curl -s https://api.example.com/payment/health",
"Check payment service health endpoint"
)
self.run_command(
"kubectl get pods -l app=payment -o json",
"Get payment service pod status"
)
self.run_command(
"kubectl top pods -l app=payment",
"Check resource usage"
)
self.run_command(
"pg_isready -h db.payment.internal -p 5432",
"Verify database connectivity"
)
def generate_report(self):
"""Generate incident report for documentation."""
report = {
"service": self.service,
"timestamp": datetime.utcnow().isoformat(),
"findings": self.findings
}
with open(f"incident-report-{self.service}.json", "w") as f:
json.dump(report, f, indent=2)
return report
if __name__ == "__main__":
runner = OnCallRunner("payment-api")
runner.diagnose_payment_issues()
report = runner.generate_report()
print(f"\n📋 Report saved to incident-report-{runner.service}.json")
Integrating Claude Code with Your Runbooks
Now comes the magic part—letting Claude Code interact with your runbooks. Create a CLAUDE.md file in your project root to give Claude context about your runbooks:
# Claude Code - On-Call Assistant
## Project Context
- **Service**: E-commerce Platform
- **On-Call Rotation**: See PagerDuty schedule
- **Severity Levels**: P1 (Critical), P2 (High), P3 (Medium)
## Available Runbooks
- `runbooks/oncall/payment-service.md` - Payment API issues
- `runbooks/oncall/database.md` - Database connectivity problems
- `runbooks/oncall/cache.md` - Redis cache failures
## Common Issues & Quick Actions
### High Error Rates
1. Check `kubectl get pods` for restart loops
2. Review recent deployments with `git log`
3. Scale up if needed: `kubectl scale deployment <name> --replicas=5`
### Database Connection Issues
1. Verify with `pg_isready`
2. Check connection pool: `curl localhost:9187/metrics`
3. Review pgBouncer logs
## Emergency Contacts
- Platform Team: #platform-oncall
- Database Team: #db-oncall
- Incident Commander: PagerDuty escalation
Interactive Incident Response Workflow
When you’re on-call and something goes wrong, here’s how to work with Claude Code effectively:
Step 1: Initial Assessment
claude "Something is wrong with our payment service. Error rates are up 40%.
Check the runbook at runbooks/oncall/payment-service.md and help me diagnose
what's happening."
Claude will read the runbook, understand your service topology, and start running diagnostic commands.
Step 2: Collaborative Diagnosis
Claude doesn’t just run commands—it explains what’s happening:
Based on the runbook and current metrics, I see:
1. Pods are restarting every 2-3 minutes
2. Memory usage spiked after the 14:00 deployment
3. Database connections are timing out
Let me check the recent deployment to confirm the hypothesis...
Step 3: Guided Remediation
Once Claude identifies the issue, it walks you through fixes:
I recommend we roll back the deployment. This will:
- Restore the previous stable version
- Take approximately 30 seconds
- Preserve the current database state
To proceed, run:
kubectl rollout undo deployment/payment-api
Would you like me to execute this, or would you prefer to do it manually?
Best Practices for Claude-Managed Runbooks
Keep Runbooks Living Documents
The biggest failure of traditional runbooks is stagnation. With Claude Code, you can keep runbooks current by:
- Version control everything: Store runbooks in git alongside your code
- Auto-update from deployments: Add post-deploy hooks that document changes
- Review after incidents: Update runbooks based on what actually worked
Use Descriptive Command Patterns
Claude performs better when commands are well-documented:
## Check Service Health
- Purpose: Verify all pods are running and ready
- Command: `kubectl get pods -l app=YOUR_SERVICE -o wide`
- Expected: All pods in Running status with Ready 1/1
- If fails: Check events with `kubectl describe pods -l app=YOUR_SERVICE`
Implement Graduated Severity Responses
Create separate workflows for different severity levels:
def handle_incident(severity, symptoms):
if severity == "P1":
# Immediate automated response
run_automated_diagnostics()
page_on_call()
create_incident_ticket()
elif severity == "P2":
# Semi-automated response
run_automated_diagnostics()
notify_slack_channel()
else:
# Manual investigation
provide_diagnostic_commands()
Conclusion
Claude Code transforms on-call from a stressful, error-prone experience into a structured, confident response process. By combining well-documented runbooks with Claude’s autonomous execution capabilities, you get the best of both worlds: human judgment for critical decisions and automated execution for repetitive tasks.
Start small—pick one service, create a basic runbook, and let Claude help you respond to the next incident. You’ll be surprised how quickly this approach becomes indispensable.
Remember: The goal isn’t to replace human judgment—it’s to augment it with intelligent automation that helps you respond faster and more accurately when it matters most.
Related Reading
- Claude Code for Beginners: Complete Getting Started Guide
- Best Claude Skills for Developers in 2026
- Claude Skills Guides Hub
Built by theluckystrike — More at zovo.one