Marketing claims for AI coding tools are unreliable. “10x productivity” and “generates correct code” mean nothing without a reproducible evaluation framework. This guide shows how to run your own accuracy tests on the tools you’re considering, using your actual codebase and task types.
Why Vendor Benchmarks Don’t Transfer
HumanEval measures whether a model can implement simple algorithmic functions from docstrings. SWE-bench measures GitHub issue resolution. Neither measures what you actually care about:
- Does it generate code that integrates correctly with your existing patterns?
- How often does it produce code that passes your test suite on the first attempt?
- Does it understand your team’s domain terminology?
Run your own evaluations. They take 2-4 hours to set up and give you directly actionable data.
Step 1: Build Your Evaluation Task Set
Create 20-30 tasks representative of your actual work. Categorize them:
# evaluation/tasks.py
TASKS = [
# Category: New function implementation
{
"id": "T001",
"category": "implementation",
"difficulty": "easy",
"prompt": """Add a function `calculate_discount(price: float, tier: str) -> float`
to pricing.py. Tier can be 'bronze' (5%), 'silver' (10%), 'gold' (15%), 'platinum' (25%).
Raise ValueError for unknown tiers. Match the style of existing functions in that file.""",
"test_file": "tests/test_pricing.py",
"test_function": "test_calculate_discount",
"reference_files": ["pricing.py"], # Files to include as context
},
# Category: Bug fix
{
"id": "T002",
"category": "bug_fix",
"difficulty": "medium",
"prompt": """Fix the bug in `UserRepository.get_by_email()`.
The function returns None for valid emails when the email contains uppercase letters.
The test `test_email_case_insensitive_lookup` is failing.""",
"test_file": "tests/test_user_repository.py",
"test_function": "test_email_case_insensitive_lookup",
"reference_files": ["repositories/user_repository.py", "models/user.py"],
},
# Category: Refactoring
{
"id": "T003",
"category": "refactoring",
"difficulty": "hard",
"prompt": """Refactor `OrderProcessor.process()` to use the Strategy pattern.
The current if-elif chain for payment types (credit_card, paypal, crypto) should
become separate strategy classes. Maintain the same public interface and ensure
all existing tests still pass.""",
"test_file": "tests/test_order_processor.py",
"test_function": None, # All tests must pass
"reference_files": ["processors/order_processor.py", "tests/test_order_processor.py"],
},
]
Step 2: Automated Evaluation Runner
# evaluation/runner.py
import subprocess
import anthropic
import openai
from pathlib import Path
import json
import time
def run_evaluation(tasks, model_config):
"""Run all tasks against a specific model and return results."""
results = []
for task in tasks:
# Load reference files as context
context = "\n\n".join(
f"=== {f} ===\n{Path(f).read_text()}"
for f in task.get("reference_files", [])
)
prompt = f"{context}\n\n{task['prompt']}" if context else task["prompt"]
# Get model response
start = time.time()
generated_code = get_model_response(prompt, model_config)
latency = time.time() - start
# Apply the generated code to the codebase
# (In practice, write to a temp branch and run tests)
test_result = run_tests(task, generated_code)
results.append({
"task_id": task["id"],
"category": task["category"],
"difficulty": task["difficulty"],
"passed": test_result["passed"],
"test_output": test_result["output"],
"latency_seconds": latency,
"tokens_used": test_result.get("tokens", 0),
})
return results
def get_model_response(prompt, config):
if config["provider"] == "anthropic":
client = anthropic.Anthropic()
response = client.messages.create(
model=config["model"],
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
elif config["provider"] == "openai":
client = openai.OpenAI()
response = client.chat.completions.create(
model=config["model"],
messages=[{"role": "user", "content": prompt}],
max_tokens=2048
)
return response.choices[0].message.content
def run_tests(task, generated_code):
"""Apply generated code and run the relevant test."""
# Parse code from markdown blocks if present
if "```python" in generated_code:
code = generated_code.split("```python")[1].split("```")[0].strip()
else:
code = generated_code
# Identify which file to update based on task
target_file = task["reference_files"][0]
original = Path(target_file).read_text()
try:
Path(target_file).write_text(code)
test_cmd = ["python", "-m", "pytest", task["test_file"], "-v", "--tb=short"]
if task.get("test_function"):
test_cmd.extend([f"-k", task["test_function"]])
result = subprocess.run(test_cmd, capture_output=True, text=True, timeout=30)
return {
"passed": result.returncode == 0,
"output": result.stdout + result.stderr
}
finally:
# Restore original file
Path(target_file).write_text(original)
Step 3: Metrics That Matter
Don’t just track pass/fail. Track:
def calculate_metrics(results):
total = len(results)
passed = sum(1 for r in results if r["passed"])
# Break down by category
by_category = {}
for cat in set(r["category"] for r in results):
cat_results = [r for r in results if r["category"] == cat]
by_category[cat] = {
"pass_rate": sum(1 for r in cat_results if r["passed"]) / len(cat_results),
"count": len(cat_results)
}
# Break down by difficulty
by_difficulty = {}
for diff in ["easy", "medium", "hard"]:
diff_results = [r for r in results if r["difficulty"] == diff]
if diff_results:
by_difficulty[diff] = {
"pass_rate": sum(1 for r in diff_results if r["passed"]) / len(diff_results),
"count": len(diff_results)
}
return {
"overall_pass_rate": passed / total,
"by_category": by_category,
"by_difficulty": by_difficulty,
"avg_latency": sum(r["latency_seconds"] for r in results) / total,
"total_tasks": total,
}
Key metrics to compare:
| Metric | What it measures |
|---|---|
| First-pass test success rate | Code correctness without iteration |
| Category breakdown | Where each model is weakest |
| Latency | Time-to-result for interactive use |
| Context utilization | Did it use the reference files? |
| Code style conformance | Does it match your patterns? |
Step 4: Blind Comparison
For style and readability evaluation, use blind comparison. Present generated code samples to reviewers without identifying the model:
def create_blind_comparison(task, responses_by_model):
"""Shuffle responses and remove model identifiers."""
import random
shuffled = list(responses_by_model.items())
random.shuffle(shuffled)
return {
"task": task["prompt"],
"options": {
chr(65 + i): code # A, B, C...
for i, (model, code) in enumerate(shuffled)
},
"_mapping": {chr(65 + i): model for i, (model, _) in enumerate(shuffled)}
}
Ask reviewers to score each option on: correctness, code style match, completeness, and whether they’d merge it without modification.
Realistic Pass Rates
Based on running this framework on a Python/Django codebase:
| Model | Easy tasks | Medium tasks | Hard tasks |
|---|---|---|---|
| Claude Sonnet 4.5 | 88% | 71% | 45% |
| GPT-4o | 85% | 68% | 38% |
| Claude Haiku 4.5 | 72% | 51% | 22% |
| Gemini 1.5 Pro | 80% | 63% | 35% |
“Hard” tasks were refactoring requests with multiple interacting files. No model approaches 90% on those — iterative use is still necessary.
Related Reading
- AI Coding Productivity Measurement: How to Track If Tools Save Time
- Devin vs SWE-Agent for Autonomous Coding
- AI Autocomplete Accuracy Comparison: Copilot vs Codeium vs Tabnine
Built by theluckystrike — More at zovo.one