Claude Code for Helicone LLM Gateway Workflow Tutorial

As AI applications scale, managing LLM API calls becomes increasingly complex. Helicone provides a powerful LLM gateway that adds observability, caching, request transformation, and rate limiting to any LLM API. This tutorial shows you how to integrate Claude Code with Helicone to build robust, efficient AI workflows.

What is Helicone?

Helicone is an open-source LLM gateway that sits between your application and LLM providers like OpenAI, Anthropic, and others. It provides:

Request Logging: Every LLM call is logged with full context
Smart Caching: Reduce costs and latency with semantic caching
Rate Limiting: Protect your API quotas
Request Transformation: Modify prompts or swap models dynamically
Analytics Dashboard: Understand usage patterns

By routing Claude Code’s LLM requests through Helicone, you gain these benefits while maintaining full compatibility.

Setting Up Helicone

Before integrating with Claude Code, you need a Helicone instance. You have two options:

Option 1: Helicone Cloud

Option 2: Self-Hosted

For full control, deploy Helicone using Docker:

docker run -d -p \
  8989:8989 \
  -e API_KEY=your_api_key \
  ghcr.io/helicone/helicone

Configuring Claude Code for Helicone

Claude Code can use Helicone as its API endpoint with a simple configuration. The key is setting the appropriate environment variables and API base URL.

Environment Setup

Create a .env file in your project:

# For OpenAI models through Helicone
export OPENAI_API_BASE="https://gateway.helicone.ai/v1"
export OPENAI_API_KEY="your_helicone_api_key"

# For Anthropic models through Helicone
export ANTHROPIC_API_BASE="https://gateway.helicone.ai"
export ANTHROPIC_API_KEY="your_helicone_api_key"

Using Helicone with Claude Code Skills

When creating Claude Code skills that interact with LLMs, you can configure them to route through Helicone by setting the API base in the skill’s environment context.

Here’s a skill that makes LLM calls through Helicone:

---
name: llm-gateway-example
description: "Example skill demonstrating LLM calls through Helicone"
tools: [Bash, Read, Write]
env:
  ANTHROPIC_API_BASE: "https://gateway.helicone.ai"
  ANTHROPIC_API_KEY: "your_helicone_key"
---

You are an assistant that makes LLM calls through Helicone gateway.

Building Helicone-Aware Workflows

Let’s create practical workflows that leverage Helicone’s features.

Workflow 1: Cached Summarization

This workflow uses Helicone’s caching to reduce costs for repeated summarization tasks:

import anthropic
import os

client = anthropic.Anthropic(
    api_key=os.getenv("ANTHROPIC_API_KEY"),
    base_url=os.getenv("ANTHROPIC_API_BASE", "https://gateway.helicone.ai")
)

def summarize_with_cache(text: str) -> str:
    """Summarize text with Helicone caching enabled"""
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[
            {"role": "user", "content": f"Summarize this: {text}"}
        ]
    )
    
    return response.content[0].text

Helicone automatically caches requests with the same semantic meaning. Subsequent calls with similar text return cached responses instantly.

Workflow 2: Request Transformation

Use Helicone’s request transformation to modify prompts dynamically:

from helicone.attrs import HeliconeAttributes

# Add custom properties for tracking
helicone_attrs = HeliconeAttributes(
    properties={
        "user_tier": "premium",
        "feature": "summarization",
        "environment": "production"
    }
)

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    extra_headers={"Helicone-Properties": helicone_attrs.to_json()}
)

This lets you track usage by custom dimensions in the Helicone dashboard.

Workflow 3: Fallback with Rate Limiting

Build resilient workflows that handle rate limits gracefully:

import time
from anthropic import RateLimitError

def call_with_retry(prompt: str, max_retries: int = 3) -> str:
    """Call LLM with automatic retry on rate limits"""
    
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text
            
        except RateLimitError as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt
                print(f"Rate limited, waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise e
    
    return None

Helicone’s rate limiting headers help your code respond appropriately to quota constraints.

Advanced Patterns

Prompt Caching with System Prompts

Helicone supports prompt caching to reduce costs on long system prompts:

# Use cached system prompts for cost savings
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=[
        {
            "type": "text",
            "text": "You are a coding assistant with access to files and terminals.",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Write a hello world program"}]
)

Request Routing by User

Route different users to different models based on tier:

def get_client_for_user(user_tier: str):
    """Get appropriate LLM client based on user tier"""
    
    base_url = "https://gateway.helicone.ai"
    api_key = os.getenv("ANTHROPIC_API_KEY")
    
    if user_tier == "premium":
        model = "claude-opus-4-20250514"
    else:
        model = "claude-haiku-3-20240307"
    
    return client, model

Monitoring with Helicone Dashboard

Once your Claude Code workflows are running through Helicone, access the dashboard to:

View Request Logs: See every LLM call with full request/response data
Analyze Caching Efficiency: Track cache hit rates and savings
Monitor Rate Limits: See when limits are hit and adjust
Set Alerts: Get notified of anomalies or high usage

Best Practices

Enable Caching Early: Set up caching from the start to maximize savings
Use Custom Properties: Add metadata to track usage by feature or user
Implement Retry Logic: Handle rate limits gracefully in production
Monitor Cache Rates: Aim for 30%+ cache hit rates for significant savings
Set Up Alerts: Get notified of errors or unusual patterns

Conclusion

Integrating Claude Code with Helicone’s LLM gateway transforms your AI development workflow. You gain observability into every LLM call, reduce costs through intelligent caching, and build more resilient applications with rate limiting and retry logic.

Start with basic integration, then add caching, monitoring, and advanced routing as your needs grow. The combination of Claude Code’s agentic capabilities and Helicone’s gateway features gives you a powerful foundation for production AI applications.