Claude Code for Helicone LLM Gateway Workflow Tutorial
As AI applications scale, managing LLM API calls becomes increasingly complex. Helicone provides a powerful LLM gateway that adds observability, caching, request transformation, and rate limiting to any LLM API. This tutorial shows you how to integrate Claude Code with Helicone to build robust, efficient AI workflows.
What is Helicone?
Helicone is an open-source LLM gateway that sits between your application and LLM providers like OpenAI, Anthropic, and others. It provides:
- Request Logging: Every LLM call is logged with full context
- Smart Caching: Reduce costs and latency with semantic caching
- Rate Limiting: Protect your API quotas
- Request Transformation: Modify prompts or swap models dynamically
- Analytics Dashboard: Understand usage patterns
By routing Claude Code’s LLM requests through Helicone, you gain these benefits while maintaining full compatibility.
Setting Up Helicone
Before integrating with Claude Code, you need a Helicone instance. You have two options:
Option 1: Helicone Cloud
Sign up at helicone.ai and get your API key. This is the fastest way to start.
Option 2: Self-Hosted
For full control, deploy Helicone using Docker:
docker run -d -p \
8989:8989 \
-e API_KEY=your_api_key \
ghcr.io/helicone/helicone
Configuring Claude Code for Helicone
Claude Code can use Helicone as its API endpoint with a simple configuration. The key is setting the appropriate environment variables and API base URL.
Environment Setup
Create a .env file in your project:
# For OpenAI models through Helicone
export OPENAI_API_BASE="https://gateway.helicone.ai/v1"
export OPENAI_API_KEY="your_helicone_api_key"
# For Anthropic models through Helicone
export ANTHROPIC_API_BASE="https://gateway.helicone.ai"
export ANTHROPIC_API_KEY="your_helicone_api_key"
Using Helicone with Claude Code Skills
When creating Claude Code skills that interact with LLMs, you can configure them to route through Helicone by setting the API base in the skill’s environment context.
Here’s a skill that makes LLM calls through Helicone:
---
name: llm-gateway-example
description: "Example skill demonstrating LLM calls through Helicone"
tools: [Bash, Read, Write]
env:
ANTHROPIC_API_BASE: "https://gateway.helicone.ai"
ANTHROPIC_API_KEY: "your_helicone_key"
---
You are an assistant that makes LLM calls through Helicone gateway.
Building Helicone-Aware Workflows
Let’s create practical workflows that leverage Helicone’s features.
Workflow 1: Cached Summarization
This workflow uses Helicone’s caching to reduce costs for repeated summarization tasks:
import anthropic
import os
client = anthropic.Anthropic(
api_key=os.getenv("ANTHROPIC_API_KEY"),
base_url=os.getenv("ANTHROPIC_API_BASE", "https://gateway.helicone.ai")
)
def summarize_with_cache(text: str) -> str:
"""Summarize text with Helicone caching enabled"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{"role": "user", "content": f"Summarize this: {text}"}
]
)
return response.content[0].text
Helicone automatically caches requests with the same semantic meaning. Subsequent calls with similar text return cached responses instantly.
Workflow 2: Request Transformation
Use Helicone’s request transformation to modify prompts dynamically:
from helicone.attrs import HeliconeAttributes
# Add custom properties for tracking
helicone_attrs = HeliconeAttributes(
properties={
"user_tier": "premium",
"feature": "summarization",
"environment": "production"
}
)
response = client.messages.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Explain quantum computing"}],
extra_headers={"Helicone-Properties": helicone_attrs.to_json()}
)
This lets you track usage by custom dimensions in the Helicone dashboard.
Workflow 3: Fallback with Rate Limiting
Build resilient workflows that handle rate limits gracefully:
import time
from anthropic import RateLimitError
def call_with_retry(prompt: str, max_retries: int = 3) -> str:
"""Call LLM with automatic retry on rate limits"""
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
except RateLimitError as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt
print(f"Rate limited, waiting {wait_time}s...")
time.sleep(wait_time)
else:
raise e
return None
Helicone’s rate limiting headers help your code respond appropriately to quota constraints.
Advanced Patterns
Prompt Caching with System Prompts
Helicone supports prompt caching to reduce costs on long system prompts:
# Use cached system prompts for cost savings
response = client.messages.create(
model="claude-sonnet-4-20250514",
system=[
{
"type": "text",
"text": "You are a coding assistant with access to files and terminals.",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Write a hello world program"}]
)
Request Routing by User
Route different users to different models based on tier:
def get_client_for_user(user_tier: str):
"""Get appropriate LLM client based on user tier"""
base_url = "https://gateway.helicone.ai"
api_key = os.getenv("ANTHROPIC_API_KEY")
if user_tier == "premium":
model = "claude-opus-4-20250514"
else:
model = "claude-haiku-3-20240307"
return client, model
Monitoring with Helicone Dashboard
Once your Claude Code workflows are running through Helicone, access the dashboard to:
- View Request Logs: See every LLM call with full request/response data
- Analyze Caching Efficiency: Track cache hit rates and savings
- Monitor Rate Limits: See when limits are hit and adjust
- Set Alerts: Get notified of anomalies or high usage
Best Practices
- Enable Caching Early: Set up caching from the start to maximize savings
- Use Custom Properties: Add metadata to track usage by feature or user
- Implement Retry Logic: Handle rate limits gracefully in production
- Monitor Cache Rates: Aim for 30%+ cache hit rates for significant savings
- Set Up Alerts: Get notified of errors or unusual patterns
Conclusion
Integrating Claude Code with Helicone’s LLM gateway transforms your AI development workflow. You gain observability into every LLM call, reduce costs through intelligent caching, and build more resilient applications with rate limiting and retry logic.
Start with basic integration, then add caching, monitoring, and advanced routing as your needs grow. The combination of Claude Code’s agentic capabilities and Helicone’s gateway features gives you a powerful foundation for production AI applications.