Monitoring and Logging in Claude Code Multi-Agent Systems
Building multi-agent systems with Claude Code requires visibility into agent behavior, message flows, and error conditions. Without proper monitoring, debugging distributed agent workflows becomes nearly impossible. This guide covers practical patterns for observability in Claude Code-based multi-agent architectures. For coordinating the agents you will monitor, see Claude Code agent swarm coordination strategies.
Why Multi-Agent Monitoring Matters
When you orchestrate multiple Claude agents to handle different aspects of a task—such as one agent for code review, another for testing, and a third for deployment—each agent generates logs, state changes, and potential errors. A production-grade system needs centralized logging to trace requests across agents, measure latency, and detect failures early.
The challenge: Claude Code doesn’t provide built-in observability for multi-agent orchestration. You need to implement it yourself using available tools like bash commands, file operations, and external logging services.
Structured Logging Pattern
The foundation of monitoring is structured logging. Instead of scattered print statements, emit JSON-formatted log entries that external tools can parse and aggregate.
import json
import datetime
import os
def log_agent_event(agent_id: str, event_type: str, message: str, metadata: dict = None):
"""Emit a structured log entry for agent events."""
entry = {
"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
"agent_id": agent_id,
"event_type": event_type,
"message": message,
"metadata": metadata or {}
}
log_file = os.environ.get("AGENT_LOG_FILE", "/var/log/claude-agents.log")
with open(log_file, "a") as f:
f.write(json.dumps(entry) + "\n")
Call this function from your agent orchestration layer:
# When an agent starts processing
log_agent_event(
agent_id="code-reviewer-01",
event_type="task_started",
message="Beginning code review for PR #247",
metadata={"pr_number": 247, "files_count": 12}
)
# When the agent completes
log_agent_event(
agent_id="code-reviewer-01",
event_type="task_completed",
message="Code review finished",
metadata={"issues_found": 3, "duration_seconds": 45}
)
Centralized Log Aggregation
For multi-agent systems, aggregate logs from all agents into a single location. A simple approach uses a shared log file or directory:
# Create a centralized log directory
mkdir -p /var/log/claude-agents
chmod 755 /var/log/claude-agents
# Each agent writes to its own file
export AGENT_LOG_FILE="/var/log/claude-agents/${AGENT_NAME}.log"
For more sophisticated setups, integrate with log aggregation services:
- Loki: Grafana Labs’ log aggregation system works well with Prometheus metrics
- ELK Stack: Elasticsearch, Logstash, and Kibana provide powerful search and visualization
- CloudWatch: If running on AWS, CloudWatch Logs offers native integration
The skill super-memory can help you recall patterns from previous debugging sessions, making it easier to identify recurring issues across agent runs.
Health Checks and Metrics
Implement health checks to verify each agent’s operational status:
import subprocess
import time
def check_agent_health(agent_id: str) -> dict:
"""Perform a health check on a specific agent."""
health_file = f"/tmp/claude-agent-{agent_id}.health"
# Check if the agent's health file exists and is recent
try:
mtime = os.path.getmtime(health_file)
is_healthy = (time.time() - mtime) < 300 # 5 minute threshold
return {"agent_id": agent_id, "healthy": is_healthy, "last_seen": mtime}
except FileNotFoundError:
return {"agent_id": agent_id, "healthy": False, "last_seen": None}
Each agent should periodically update its health file:
# In your agent's main loop
while true; do
date > /tmp/claude-agent-${AGENT_NAME}.health
sleep 60
done
Distributed Tracing
When agents communicate through message queues or HTTP APIs, implement distributed tracing to follow requests end-to-end:
import uuid
def create_trace_context() -> str:
"""Generate a unique trace ID for request correlation."""
return str(uuid.uuid4())
def trace_agent_call(trace_id: str, from_agent: str, to_agent: str, payload: dict):
"""Log an inter-agent communication event."""
log_agent_event(
agent_id=from_agent,
event_type="agent_call",
message=f"Calling {to_agent}",
metadata={
"trace_id": trace_id,
"target_agent": to_agent,
"payload_size": len(str(payload))
}
)
This pattern enables you to reconstruct the full flow when something goes wrong. The tdd skill complements this by letting you write tests that verify agent communication contracts.
Error Tracking and Alerting
Capture errors with enough context for debugging:
def log_error(agent_id: str, error: Exception, context: dict):
"""Log an error with full context for debugging."""
import traceback
log_agent_event(
agent_id=agent_id,
event_type="error",
message=str(error),
metadata={
"error_type": type(error).__name__,
"traceback": traceback.format_exc(),
"context": context
}
)
# Optionally trigger an alert
if os.environ.get("ALERT_ON_ERROR") == "true":
subprocess.run([
"curl", "-X", "POST",
os.environ["ALERT_WEBHOOK_URL"],
"-d", '{"text": "Agent ' + agent_id + ' failed: ' + str(error) + '"}'
])
Monitoring Dashboard
Build a simple monitoring dashboard using available skills:
- Use
canvas-designto create visual representations of agent status - Use
pdfto generate daily or weekly status reports - Use
xlsxto maintain a spreadsheet of agent metrics over time
A minimal dashboard might display:
- Active agents and their current tasks
- Recent errors across all agents
- Average task completion time per agent type
- Success/failure rates
Best Practices Summary
- Emit structured JSON logs from every agent for parseable output
- Use trace IDs to correlate events across agent boundaries
- Implement health checks with timely heartbeat updates
- Log context-rich errors including stack traces and relevant state
- Aggregate logs centrally for unified searching and analysis
- Build observability into agent prompts — include logging instructions in skill definitions
The frontend-design skill can help you build monitoring interfaces if you need a visual component. The pdf skill enables generating automated status reports. For alerting, you’ll primarily work with webhook integrations and custom shell scripts.
Monitoring multi-agent Claude Code systems requires deliberate architecture. Start with structured logging, add health checks, and progressively build toward comprehensive observability as your system grows.
Related Reading
- Multi-Agent Orchestration with Claude Subagents Guide — Build the multi-agent architecture you will monitor and instrument for observability.
- Claude Code Multi-Agent Subagent Communication Guide — Instrument inter-agent communication channels for comprehensive observability.
- Claude Code Agent Swarm Coordination Strategies — Apply monitoring patterns to distributed agent swarm coordination workflows.
- Claude Skills Advanced Hub — Explore advanced observability and coordination patterns for Claude Code agents.
Built by theluckystrike — More at zovo.one