How to Evaluate AI Coding Assistant Accuracy

Marketing claims for AI coding tools are unreliable. “10x productivity” and “generates correct code” mean nothing without a reproducible evaluation framework. This guide shows how to run your own accuracy tests on the tools you’re considering, using your actual codebase and task types.

Why Vendor Benchmarks Don’t Transfer

HumanEval measures whether a model can implement simple algorithmic functions from docstrings. SWE-bench measures GitHub issue resolution. Neither measures what you actually care about:

Does it generate code that integrates correctly with your existing patterns?
How often does it produce code that passes your test suite on the first attempt?
Does it understand your team’s domain terminology?

Run your own evaluations. They take 2-4 hours to set up and give you directly actionable data.

Step 1: Build Your Evaluation Task Set

Create 20-30 tasks representative of your actual work. Categorize them:

# evaluation/tasks.py
TASKS = [
    # Category: New function implementation
    {
        "id": "T001",
        "category": "implementation",
        "difficulty": "easy",
        "prompt": """Add a function `calculate_discount(price: float, tier: str) -> float`
        to pricing.py. Tier can be 'bronze' (5%), 'silver' (10%), 'gold' (15%), 'platinum' (25%).
        Raise ValueError for unknown tiers. Match the style of existing functions in that file.""",
        "test_file": "tests/test_pricing.py",
        "test_function": "test_calculate_discount",
        "reference_files": ["pricing.py"],  # Files to include as context
    },

    # Category: Bug fix
    {
        "id": "T002",
        "category": "bug_fix",
        "difficulty": "medium",
        "prompt": """Fix the bug in `UserRepository.get_by_email()`.
        The function returns None for valid emails when the email contains uppercase letters.
        The test `test_email_case_insensitive_lookup` is failing.""",
        "test_file": "tests/test_user_repository.py",
        "test_function": "test_email_case_insensitive_lookup",
        "reference_files": ["repositories/user_repository.py", "models/user.py"],
    },

    # Category: Refactoring
    {
        "id": "T003",
        "category": "refactoring",
        "difficulty": "hard",
        "prompt": """Refactor `OrderProcessor.process()` to use the Strategy pattern.
        The current if-elif chain for payment types (credit_card, paypal, crypto) should
        become separate strategy classes. Maintain the same public interface and ensure
        all existing tests still pass.""",
        "test_file": "tests/test_order_processor.py",
        "test_function": None,  # All tests must pass
        "reference_files": ["processors/order_processor.py", "tests/test_order_processor.py"],
    },
]

Step 2: Automated Evaluation Runner

# evaluation/runner.py
import subprocess
import anthropic
import openai
from pathlib import Path
import json
import time

def run_evaluation(tasks, model_config):
    """Run all tasks against a specific model and return results."""
    results = []

    for task in tasks:
        # Load reference files as context
        context = "\n\n".join(
            f"=== {f} ===\n{Path(f).read_text()}"
            for f in task.get("reference_files", [])
        )

        prompt = f"{context}\n\n{task['prompt']}" if context else task["prompt"]

        # Get model response
        start = time.time()
        generated_code = get_model_response(prompt, model_config)
        latency = time.time() - start

        # Apply the generated code to the codebase
        # (In practice, write to a temp branch and run tests)
        test_result = run_tests(task, generated_code)

        results.append({
            "task_id": task["id"],
            "category": task["category"],
            "difficulty": task["difficulty"],
            "passed": test_result["passed"],
            "test_output": test_result["output"],
            "latency_seconds": latency,
            "tokens_used": test_result.get("tokens", 0),
        })

    return results

def get_model_response(prompt, config):
    if config["provider"] == "anthropic":
        client = anthropic.Anthropic()
        response = client.messages.create(
            model=config["model"],
            max_tokens=2048,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

    elif config["provider"] == "openai":
        client = openai.OpenAI()
        response = client.chat.completions.create(
            model=config["model"],
            messages=[{"role": "user", "content": prompt}],
            max_tokens=2048
        )
        return response.choices[0].message.content

def run_tests(task, generated_code):
    """Apply generated code and run the relevant test."""
    # Parse code from markdown blocks if present
    if "```python" in generated_code:
        code = generated_code.split("```python")[1].split("```")[0].strip()
    else:
        code = generated_code

    # Identify which file to update based on task
    target_file = task["reference_files"][0]
    original = Path(target_file).read_text()

    try:
        Path(target_file).write_text(code)
        test_cmd = ["python", "-m", "pytest", task["test_file"], "-v", "--tb=short"]
        if task.get("test_function"):
            test_cmd.extend([f"-k", task["test_function"]])

        result = subprocess.run(test_cmd, capture_output=True, text=True, timeout=30)
        return {
            "passed": result.returncode == 0,
            "output": result.stdout + result.stderr
        }
    finally:
        # Restore original file
        Path(target_file).write_text(original)

Step 3: Metrics That Matter

Don’t just track pass/fail. Track:

def calculate_metrics(results):
    total = len(results)
    passed = sum(1 for r in results if r["passed"])

    # Break down by category
    by_category = {}
    for cat in set(r["category"] for r in results):
        cat_results = [r for r in results if r["category"] == cat]
        by_category[cat] = {
            "pass_rate": sum(1 for r in cat_results if r["passed"]) / len(cat_results),
            "count": len(cat_results)
        }

    # Break down by difficulty
    by_difficulty = {}
    for diff in ["easy", "medium", "hard"]:
        diff_results = [r for r in results if r["difficulty"] == diff]
        if diff_results:
            by_difficulty[diff] = {
                "pass_rate": sum(1 for r in diff_results if r["passed"]) / len(diff_results),
                "count": len(diff_results)
            }

    return {
        "overall_pass_rate": passed / total,
        "by_category": by_category,
        "by_difficulty": by_difficulty,
        "avg_latency": sum(r["latency_seconds"] for r in results) / total,
        "total_tasks": total,
    }

Key metrics to compare:

Metric	What it measures
First-pass test success rate	Code correctness without iteration
Category breakdown	Where each model is weakest
Latency	Time-to-result for interactive use
Context utilization	Did it use the reference files?
Code style conformance	Does it match your patterns?

For style and readability evaluation, use blind comparison. Present generated code samples to reviewers without identifying the model:

def create_blind_comparison(task, responses_by_model):
    """Shuffle responses and remove model identifiers."""
    import random

    shuffled = list(responses_by_model.items())
    random.shuffle(shuffled)

    return {
        "task": task["prompt"],
        "options": {
            chr(65 + i): code  # A, B, C...
            for i, (model, code) in enumerate(shuffled)
        },
        "_mapping": {chr(65 + i): model for i, (model, _) in enumerate(shuffled)}
    }

Ask reviewers to score each option on: correctness, code style match, completeness, and whether they’d merge it without modification.

Realistic Pass Rates

Based on running this framework on a Python/Django codebase:

Model	Easy tasks	Medium tasks	Hard tasks
Claude Sonnet 4.5	88%	71%	45%
GPT-4o	85%	68%	38%
Claude Haiku 4.5	72%	51%	22%
Gemini 1.5 Pro	80%	63%	35%

“Hard” tasks were refactoring requests with multiple interacting files. No model approaches 90% on those — iterative use is still necessary.

Built by theluckystrike — More at zovo.one

Why Vendor Benchmarks Don’t Transfer

Step 1: Build Your Evaluation Task Set

Step 2: Automated Evaluation Runner

Step 3: Metrics That Matter

Step 4: Blind Comparison

Realistic Pass Rates

Related Reading