Claude Skills Guide

Claude Code Multi-Agent Error Recovery Strategies

When building complex workflows with Claude Code, multi-agent architectures offer significant power but introduce new failure modes A single agent failing can cascade through dependent tasks, and without proper error recovery, your entire workflow stalls. This guide covers practical strategies for building resilient multi-agent systems using Claude Code skills and patterns.

Understanding Multi-Agent Failure Modes

Multi-agent setups in Claude Code typically involve orchestration where one agent delegates subtasks to specialized agents or skills. Failure can occur at several points:

Each failure mode requires a different recovery approach, and the most reliable systems handle multiple failure types simultaneously.

Pattern 1: Explicit Error Handling with Try-Catch Blocks

The foundation of error recovery is wrapping risky operations in explicit error handlers When using skills that perform file operations or external API calls, structure your prompts to include error handling instructions:

/pdf extract tables from report.pdf and save results to extracted-data.json
—if extraction fails, output the error message and continue with the next section

This explicit instruction helps Claude recover gracefully rather than abandoning the task. The frontend-design skill often encounters CSS validation errors when processing complex layouts. Adding fallback instructions:

/frontend-design generate responsive navigation component
—if flexbox layout validation fails, fall back to grid-based alternative

Pattern 2: Incremental Checkpointing

Long-running multi-agent workflows benefit from checkpointing—saving progress at key stages so recovery can resume from a known good point rather than restarting entirely. Here’s a practical implementation:

# checkpoint.py - Simple checkpoint manager for Claude workflows
import json
import os
from datetime import datetime

class WorkflowCheckpoint:
    def __init__(self, workflow_id):
        self.workflow_id = workflow_id
        self.checkpoint_dir = f".checkpoints/{workflow_id}"
        os.makedirs(self.checkpoint_dir, exist_ok=True)
    
    def save(self, stage, data):
        checkpoint = {
            "stage": stage,
            "timestamp": datetime.now().isoformat(),
            "data": data
        }
        path = f"{self.checkpoint_dir}/{stage}.json"
        with open(path, 'w') as f:
            json.dump(checkpoint, f, indent=2)
    
    def load(self, stage):
        path = f"{self.checkpoint_dir}/{stage}.json"
        if os.path.exists(path):
            with open(path) as f:
                return json.load(f)
        return None
    
    def get_latest(self):
        checkpoints = sorted(os.listdir(self.checkpoint_dir))
        if checkpoints:
            return self.load(checkpoints[-1].replace('.json', ''))
        return None

For the tdd skill, checkpointing becomes essential when generating test suites across multiple modules. After each module’s tests complete, save the results:

/tdd generate unit tests for auth.py
—after completing each test file, save checkpoint with test status
—if interrupted, resume from the last successful module

Pattern 3: Skill Chaining with Fallbacks

Resilient multi-agent systems chain skills together with explicit fallbacks. If one skill fails or produces unsatisfactory results, the system automatically tries an alternative approach:

Use the xlsx skill to analyze sales-data.xlsx and generate summary statistics
—if xlsx skill fails, use bash with python pandas to accomplish the same task
—if that also fails, output a plain text summary of what data was found

This pattern ensures the workflow always produces some output rather than failing entirely. The supermemory skill can track which fallback strategies succeeded in previous runs:

/supermemory remember: when xlsx fails on large files (>100MB), 
bash/python fallback succeeded in previous sessions

Pattern 4: Timeout and Retry Logic

Agent operations can hang or take unexpectedly long. Implementing timeout logic prevents workflows from stalling indefinitely:

// retry-handler.js - Timeout and retry utilities
async function withTimeout(prompt, skill, ms = 60000) {
  const timeout = new Promise((_, reject) => 
    setTimeout(() => reject(new Error('Timeout')), ms)
  );
  
  try {
    return await Promise.race([skill.invoke(prompt), timeout]);
  } catch (error) {
    if (error.message === 'Timeout') {
      console.log(`Skill ${skill.name} timed out, attempting retry...`);
      return skill.invoke(prompt); // Single retry
    }
    throw error;
  }
}

async function withRetry(prompt, skill, maxRetries = 3, delay = 1000) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await skill.invoke(prompt);
    } catch (error) {
      if (attempt === maxRetries) throw error;
      console.log(`Attempt ${attempt} failed: ${error.message}`);
      await new Promise(r => setTimeout(r, delay * attempt));
    }
  }
}

Pattern 4b: Exponential Backoff with Jitter

Simple fixed-delay retries can cause thundering-herd problems when many agents retry simultaneously. Exponential backoff with random jitter spreads retry pressure across time:

import asyncio
import random

async def retry_with_backoff(func, max_retries=3, base_delay=1):
    """Retry a function with exponential backoff and jitter."""
    for attempt in range(max_retries):
        try:
            return await func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay:.2f}s...")
            await asyncio.sleep(delay)

This is particularly valuable when multiple agents call the same external API concurrently — the jitter prevents synchronized retry storms.

Pattern 4c: Circuit Breaker

For more robust protection, a circuit breaker stops sending requests to a failing service entirely, preventing cascading failures across your agent pipeline:

import time

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.failures = 0
        self.last_failure_time = None
        self.state = "closed"

    async def call(self, func):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "half-open"
            else:
                raise Exception("Circuit breaker is open")

        try:
            result = await func()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        self.failures = 0
        self.state = "closed"

    def _on_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.failure_threshold:
            self.state = "open"

Use a circuit breaker around any external dependency — an AI skill calling a third-party API, a database query, or a remote file store — that could fail repeatedly and hold up your entire pipeline.

Pattern 5: Error Classification and Routing

Not all errors warrant the same recovery approach. Classifying errors enables targeted responses:

When the docx skill encounters a corrupted file, the error classification matters:

/docx parse contract-template.docx
—if parse error occurs, classify: 
   - "file not found" → skip and log
   - "corrupted" → attempt recovery with backup
   - "permission denied" → escalate with error details

Pattern 6: Human-in-the-Loop Escalation

Some errors cannot be automatically resolved. Building escalation points allows human intervention without losing context:

/pdf extract text from contract.pdf
—if extraction confidence < 80%, pause and ask:
   "Manual review needed for sections with low confidence. 
   Should I proceed with partial extraction or wait for review?"

This pattern works particularly well with complex document processing via the pdf skill where automated extraction might miss context-dependent information.

Implementing Recovery in Practice

Combining these patterns creates reliable multi-agent systems. A typical workflow might include:

  1. Initial attempt with the primary skill
  2. Fallback to alternative if primary fails
  3. Checkpoint save after each successful stage
  4. Retry with backoff for transient failures
  5. Escalation to human for unrecoverable errors

The xlsx skill combined with bash scripting demonstrates this well:

Process quarterly data as follows:
1. Use xlsx to validate and clean input data
—if validation fails, use bash/python for cleaning
2. After each cleaning step, save checkpoint
3. If processing exceeds 60 seconds, timeout and use fallback
4. If any step fails, log error and continue with remaining data
5. If failure rate exceeds 20%, pause and request review

Production Best Practices

Pre-Flight Validation

Before executing potentially destructive operations, run pre-flight checks. Validate file existence, permissions, and input parameters before committing to an action:

async def safe_delete(file_path):
    # Pre-flight checks
    if not await file_exists(file_path):
        return {"success": False, "error": "File does not exist"}

    if not await can_delete(file_path):
        return {"success": False, "error": "Permission denied"}

    # Execute with error handling
    try:
        await delete_file(file_path)
        return {"success": True}
    except Exception as e:
        return {"success": False, "error": str(e)}

This pattern is especially important in multi-agent workflows where one agent’s destructive action may affect files another agent depends on.

Logging Errors and State Changes

Use Claude Code’s record_note tool to create an audit trail of errors and recovery actions across the agent session:

async def log_error(context, error, recovery_action=None):
    await record_note({
        "category": "error",
        "content": f"Error in {context}: {error}. Recovery: {recovery_action or 'manual intervention required'}"
    })

This complements external monitoring — see Monitoring and Logging Claude Code Multi-Agent Systems — and gives agents the ability to reason about their own error history within a session.

Test Your Error Paths

Do not only test the happy path. Build test cases that simulate network failures, API timeouts, corrupted files, and invalid inputs to confirm your error handling works correctly before deploying a multi-agent workflow to production.

Conclusion

Error recovery in Claude Code multi-agent workflows requires intentional design. By implementing checkpointing, fallback chains, retry logic, error classification, and escalation points, you build systems that handle failures gracefully rather than crashing entirely. These patterns work across all Claude skills—from pdf document processing to frontend-design component generation—making your AI-assisted development workflow production-ready.

Start with simple retry logic, add checkpointing for longer workflows, and progressively add fallback chains and escalation points as your systems grow more complex.


Built by theluckystrike — More at zovo.one