Claude API bills extended thinking tokens at the standard output token rate. When extended thinking is enabled, the model generates internal reasoning tokens that are invisible in the final response but count toward your output token total. This means your output token cost increases proportionally to the complexity of the reasoning task, even though only the final answer appears in the response. Below is a practical breakdown of the billing mechanism, cost implications, and strategies to optimize your spending.
What Is Extended Thinking in Claude API
Extended thinking is a feature that allows Claude models to engage in deeper reasoning before producing their final response. When enabled, the model breaks down complex problems, explores multiple approaches, and reasons through its solution before delivering the actual output. This results in more thoughtful, accurate responses for tasks that require complex reasoning, coding, or analysis.
The feature works by having the model generate internal reasoning tokens that are not visible in the final response but are still processed and billed as output tokens. These reasoning tokens represent the model’s “thought process” as it works through your request.
How Output Tokens Are Billed
When using the Claude API, you’re charged based on the number of tokens processed—both input tokens (what you send) and output tokens (what the model generates). Extended thinking specifically affects output token billing because the reasoning process generates additional tokens beyond the visible response.
Here’s the basic cost structure:
- Input tokens: Charged at the model’s input rate per 1,000 tokens
- Output tokens: Charged at the model’s output rate per 1,000 tokens
Extended thinking increases output tokens because the model generates reasoning tokens before producing its final answer. These reasoning tokens count toward your output token limit and are billed at the standard output token rate.
Current Pricing Structure
Claude API pricing varies by model. Here’s a general breakdown for the main models:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3.5 Opus | $15.00 | $75.00 |
When you enable extended thinking, the output tokens include both the visible response and the internal reasoning tokens. The exact number of reasoning tokens depends on the complexity of your request.
Practical Code Examples
Here’s how to enable extended thinking in your API calls:
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
extra_headers={
"anthropic-beta": "extended-thinking-2025-05-01"
},
messages=[
{"role": "user", "content": "Explain how quicksort works and implement it in Python"}
]
)
print(message.content)
print(f"Usage: {message.usage}")
In this example, message.usage will show both input and output tokens, including any reasoning tokens generated by extended thinking.
For Node.js applications, the equivalent code looks like this:
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
const message = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
extra_headers: {
'anthropic-beta': 'extended-thinking-2025-05-01',
},
messages: [
{
role: 'user',
content: 'Explain how quicksort works and implement it in Python'
}
],
});
console.log(message.content);
console.log(message.usage);
Understanding Token Usage in Responses
To understand exactly how extended thinking affects your billing, examine the usage object in the API response:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
extra_headers={"anthropic-beta": "extended-thinking-2025-05-01"},
messages=[{"role": "user", "content": "Solve this complex problem..."}]
)
# Usage breakdown
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
# Extended thinking tokens are included in output_tokens
The output_tokens field includes both the visible response tokens and the reasoning tokens. You can estimate the reasoning token count by comparing output counts between requests with and without extended thinking for similar queries.
Cost Optimization Strategies
Managing costs with extended thinking requires careful consideration of when to use the feature:
Use extended thinking for:
- Complex coding problems requiring multi-step reasoning
- Mathematical and analytical tasks
- Architecture and system design questions
- Debugging and code review tasks
Skip extended thinking for:
- Simple queries and factual lookups
- Straightforward text generation
- Basic summarization tasks
- Quick translations
You can also control costs by setting appropriate max_tokens limits:
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024, # Limit output to control costs
extra_headers={"anthropic-beta": "extended-thinking-2025-05-01"},
messages=[{"role": "user", "content": "Your prompt here"}]
)
Monitoring Your Spending
Implement logging to track your extended thinking usage over time:
import time
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
def track_request(prompt, enable_extended_thinking=True):
headers = {}
if enable_extended_thinking:
headers["anthropic-beta"] = "extended-thinking-2025-05-01"
start_time = time.time()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
extra_headers=headers,
messages=[{"role": "user", "content": prompt}]
)
duration = time.time() - start_time
cost = (response.usage.input_tokens / 1_000_000 * 3.0) + \
(response.usage.output_tokens / 1_000_000 * 15.0)
print(f"Tokens: {response.usage.input_tokens} in / {response.usage.output_tokens} out")
print(f"Cost: ${cost:.4f}")
print(f"Duration: {duration:.2f}s")
return response, cost
# Test with extended thinking
response, cost = track_request("Explain the time complexity of merge sort")
Conclusion
Extended thinking in Claude API provides deeper reasoning capabilities at the cost of additional output tokens. The feature is billed at standard output token rates, with reasoning tokens included in your total output count. For complex tasks requiring thorough reasoning, the additional cost often delivers significantly better results. For simpler tasks, you can disable the feature and save on token costs.
Monitor your usage patterns, set appropriate limits, and enable extended thinking selectively to balance cost and performance for your specific use case.
Related Reading
Built by theluckystrike — More at zovo.one