Claude Skills Guide

Monitoring and Logging in Claude Code Multi-Agent Systems

Building multi-agent systems with Claude Code requires visibility into agent behavior, message flows, and error conditions. Without proper monitoring, debugging distributed agent workflows becomes nearly impossible. This guide covers practical patterns for observability in Claude Code-based multi-agent architectures. For coordinating the agents you will monitor, see Claude Code agent swarm coordination strategies.

Why Multi-Agent Monitoring Matters

When you orchestrate multiple Claude agents to handle different aspects of a task—such as one agent for code review, another for testing, and a third for deployment—each agent generates logs, state changes, and potential errors. A production-grade system needs centralized logging to trace requests across agents, measure latency, and detect failures early.

The challenge: Claude Code doesn’t provide built-in observability for multi-agent orchestration. You need to implement it yourself using available tools like bash commands, file operations, and external logging services.

Structured Logging Pattern

The foundation of monitoring is structured logging. Instead of scattered print statements, emit JSON-formatted log entries that external tools can parse and aggregate.

import json
import datetime
import os

def log_agent_event(agent_id: str, event_type: str, message: str, metadata: dict = None):
    """Emit a structured log entry for agent events."""
    entry = {
        "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
        "agent_id": agent_id,
        "event_type": event_type,
        "message": message,
        "metadata": metadata or {}
    }
    
    log_file = os.environ.get("AGENT_LOG_FILE", "/var/log/claude-agents.log")
    with open(log_file, "a") as f:
        f.write(json.dumps(entry) + "\n")

Call this function from your agent orchestration layer:

# When an agent starts processing
log_agent_event(
    agent_id="code-reviewer-01",
    event_type="task_started",
    message="Beginning code review for PR #247",
    metadata={"pr_number": 247, "files_count": 12}
)

# When the agent completes
log_agent_event(
    agent_id="code-reviewer-01",
    event_type="task_completed",
    message="Code review finished",
    metadata={"issues_found": 3, "duration_seconds": 45}
)

Centralized Log Aggregation

For multi-agent systems, aggregate logs from all agents into a single location. A simple approach uses a shared log file or directory:

# Create a centralized log directory
mkdir -p /var/log/claude-agents
chmod 755 /var/log/claude-agents

# Each agent writes to its own file
export AGENT_LOG_FILE="/var/log/claude-agents/${AGENT_NAME}.log"

For more sophisticated setups, integrate with log aggregation services:

The skill super-memory can help you recall patterns from previous debugging sessions, making it easier to identify recurring issues across agent runs.

Health Checks and Metrics

Implement health checks to verify each agent’s operational status:

import subprocess
import time

def check_agent_health(agent_id: str) -> dict:
    """Perform a health check on a specific agent."""
    health_file = f"/tmp/claude-agent-{agent_id}.health"
    
    # Check if the agent's health file exists and is recent
    try:
        mtime = os.path.getmtime(health_file)
        is_healthy = (time.time() - mtime) < 300  # 5 minute threshold
        return {"agent_id": agent_id, "healthy": is_healthy, "last_seen": mtime}
    except FileNotFoundError:
        return {"agent_id": agent_id, "healthy": False, "last_seen": None}

Each agent should periodically update its health file:

# In your agent's main loop
while true; do
    date > /tmp/claude-agent-${AGENT_NAME}.health
    sleep 60
done

Distributed Tracing

When agents communicate through message queues or HTTP APIs, implement distributed tracing to follow requests end-to-end:

import uuid

def create_trace_context() -> str:
    """Generate a unique trace ID for request correlation."""
    return str(uuid.uuid4())

def trace_agent_call(trace_id: str, from_agent: str, to_agent: str, payload: dict):
    """Log an inter-agent communication event."""
    log_agent_event(
        agent_id=from_agent,
        event_type="agent_call",
        message=f"Calling {to_agent}",
        metadata={
            "trace_id": trace_id,
            "target_agent": to_agent,
            "payload_size": len(str(payload))
        }
    )

This pattern enables you to reconstruct the full flow when something goes wrong. The tdd skill complements this by letting you write tests that verify agent communication contracts.

Error Tracking and Alerting

Capture errors with enough context for debugging:

def log_error(agent_id: str, error: Exception, context: dict):
    """Log an error with full context for debugging."""
    import traceback
    
    log_agent_event(
        agent_id=agent_id,
        event_type="error",
        message=str(error),
        metadata={
            "error_type": type(error).__name__,
            "traceback": traceback.format_exc(),
            "context": context
        }
    )
    
    # Optionally trigger an alert
    if os.environ.get("ALERT_ON_ERROR") == "true":
        subprocess.run([
            "curl", "-X", "POST",
            os.environ["ALERT_WEBHOOK_URL"],
            "-d", '{"text": "Agent ' + agent_id + ' failed: ' + str(error) + '"}'
        ])

Monitoring Dashboard

Build a simple monitoring dashboard using available skills:

A minimal dashboard might display:

Best Practices Summary

  1. Emit structured JSON logs from every agent for parseable output
  2. Use trace IDs to correlate events across agent boundaries
  3. Implement health checks with timely heartbeat updates
  4. Log context-rich errors including stack traces and relevant state
  5. Aggregate logs centrally for unified searching and analysis
  6. Build observability into agent prompts — include logging instructions in skill definitions

The frontend-design skill can help you build monitoring interfaces if you need a visual component. The pdf skill enables generating automated status reports. For alerting, you’ll primarily work with webhook integrations and custom shell scripts.

Monitoring multi-agent Claude Code systems requires deliberate architecture. Start with structured logging, add health checks, and progressively build toward comprehensive observability as your system grows.

Built by theluckystrike — More at zovo.one