Claude Code Weights and Biases Experiment Tracking
Machine learning experimentation requires careful tracking of hyperparameters, metrics, and artifacts. Weights & Biases (W&B) has become the standard for experiment tracking, and when combined with Claude Code’s CLI capabilities, you get a powerful workflow for automating your ML development pipeline. This guide shows you how to integrate Claude Code with Weights & Biases for seamless experiment tracking.
Setting Up W&B with Claude Code
Before integrating with Claude Code, ensure you have W&B installed and authenticated:
pip install wandb
wandb login
Your API key will be stored locally after authentication. Claude Code can then interact with W&B through shell commands or by reading output from W&B CLI operations.
Creating a Claude Skill for Experiment Tracking
Build a dedicated skill for managing W&B experiments. Create skills/wandb-experiment.md:
---
name: wandb-exp
description: "Track ML experiments with Weights & Biases"
---
# Weights & Biases Experiment Tracking
This skill helps initialize runs, log metrics, and track experiments in W&B.
## Initialize a New Experiment Run
To start a new experiment:
1. Run `wandb init` to configure the project
2. Use `wandb.init()` in your training script
3. Log parameters, metrics, and artifacts as needed
Logging Metrics from Training Scripts
When Claude Code runs your training scripts, it can capture W&B output and help you analyze results. Here’s a practical example using a simple training script:
import wandb
import torch
import torch.nn as nn
# Initialize W&B run with parameters
wandb.init(
project="image-classification",
config={
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 10,
"optimizer": "adam"
}
)
# Simple training loop
model = nn.Linear(784, 10)
optimizer = torch.optim.Adam(model.parameters(), lr=wandb.config.learning_rate)
for epoch in range(wandb.config.epochs):
for batch in data_loader:
optimizer.zero_grad()
loss = model(batch.x)
loss.backward()
optimizer.step()
# Log training metrics
wandb.log({
"train_loss": loss.item(),
"epoch": epoch
})
# Log validation metrics at end of epoch
val_loss = evaluate(model, val_data)
wandb.log({
"val_loss": val_loss,
"epoch": epoch,
"accuracy": compute_accuracy(model, val_data)
})
wandb.finish()
Claude Code can execute this script and monitor the W&B output in real-time, giving you visibility into training progress without leaving your terminal.
Using Claude Code to Compare Experiments
One of W&B’s strongest features is comparing runs. Claude Code can help you query and analyze experiment results:
# List recent experiments in your project
wandb sweep create --project image-classification config.yaml
wandb agent image-classification/<sweep-id>
After experiments complete, use Claude Code to fetch and compare results:
import wandb
api = wandb.Api()
# Fetch all runs from a project
runs = api.runs("theluckystrike/image-classification")
# Find best performing run
best_run = min(runs, key=lambda r: r.summary.get("val_loss", float("inf")))
print(f"Best run: {best_run.name}")
print(f"Validation loss: {best_run.summary['val_loss']}")
print(f"Test accuracy: {best_run.summary['accuracy']}")
This approach lets Claude Code analyze your experiment history and help you identify optimal hyperparameters.
Automating Experiment Tracking with Claude Code
You can create custom Claude commands that automatically log information to W&B. For instance, a skill that tracks dataset information:
---
name: track-data
description: "Log dataset statistics to W&B"
---
# Dataset Tracking
Track dataset versions and statistics:
- Run dataset profiling scripts
- Log dataset hash and statistics to W&B
- Associate dataset version with experiment runs
Use this skill to ensure reproducibility by automatically attaching dataset metadata to every experiment.
Integrating with Existing W&B Workflows
Claude Code complements your existing W&B setup:
- Sweep Automation: Let Claude Code manage hyperparameter sweeps by generating sweep configurations and launching agents
- Artifact Management: Use Claude Code to version and track model artifacts in W&B
- Report Generation: Pull W&B metrics and generate summary reports using Claude Code
# Log model artifacts
artifact = wandb.Artifact(
name="trained-model",
type="model",
metadata={"accuracy": 0.95, "framework": "pytorch"}
)
artifact.add_file("model.pt")
run.log_artifact(artifact)
Best Practices for Claude Code + W&B Integration
- Consistent Naming: Use clear, descriptive names for runs and experiments
- Parameter Tracking: Always log all hyperparameters to W&B config
- Metric Logging: Log both training and validation metrics at appropriate intervals
- Artifact Versioning: Use W&B artifacts to version models, datasets, and preprocessing code
- Early Stopping: Monitor validation metrics and implement early stopping to prevent overfitting
- Grouping Experiments: Use W&B groups to organize related experiments (e.g., different seeds, augmentation strategies)
Advanced: Creating a Complete Training Workflow
Here’s how you might structure a complete training workflow with Claude Code orchestrating the process:
- Pre-training: Claude Code checks dataset availability, validates data integrity, and logs dataset version to W&B
- Training: Execute training script with W&B logging enabled, monitoring progress in real-time
- Post-training: Analyze results, compare with previous runs, and log model artifacts
# Complete workflow example
import wandb
import hashlib
import os
def log_dataset_info(data_path):
"""Log dataset information for reproducibility"""
dataset_hash = hashlib.md5(open(data_path, 'rb').read()).hexdigest()
wandb.log({
"dataset_hash": dataset_hash,
"dataset_path": data_path,
"dataset_size": os.path.getsize(data_path)
})
# Pre-training
log_dataset_info("train_data.pt")
# Training (simplified)
wandb.init(project="my-project", name="experiment-001")
# ... training code ...
# Post-training
best_model = find_best_model()
wandb.log_artifact(best_model, name="best-model")
Debugging Failed Experiments
When experiments fail, Claude Code can help you investigate:
# Fetch failed runs and their error logs
api = wandb.Api()
failed_runs = [r for r in api.runs("project") if r.state == "failed"]
for run in failed_runs:
print(f"Run: {run.name}")
print(f"Error: {run.summary.get('failed_error', 'Unknown')}")
print(f"Crash logs: {run.files['stderr'].download()}")
This debugging capability helps you quickly identify and fix issues in your training pipeline.
Conclusion
Combining Claude Code with Weights & Biases gives you powerful experiment tracking capabilities. Claude Code can execute training scripts, analyze results, and help you manage your ML workflow while W&B handles the heavy lifting of metrics logging and comparison. Start by creating dedicated skills for your experiment tracking needs, and progressively add more automation as your workflow matures. The integration enables reproducible research, easier debugging, and faster iteration cycles for your machine learning projects.
Related Reading
- Claude Code for Beginners: Complete Getting Started Guide
- Best Claude Skills for Developers in 2026
- Claude Skills Guides Hub
Built by theluckystrike — More at zovo.one