Claude Code for Evals Framework Workflow Tutorial

Evaluation frameworks are essential for building reliable AI systems. Whether you’re testing Claude’s responses, validating skill outputs, or benchmarking agent behaviors, a structured evals workflow helps you measure and improve quality systematically. This tutorial walks you through building an evals framework using Claude Code, from setup to execution.

Understanding Evals in the Claude Code Context

Evals (evaluations) in Claude Code refer to the practice of automatically testing and measuring AI outputs against expected behaviors. Unlike traditional unit tests with deterministic outputs, evals often involve assessing open-ended responses, reasoning quality, or task completion rates. Claude Code provides primitives and patterns that make this practical.

The core workflow involves three stages: defining test cases with expected outcomes, running Claude against those cases, and analyzing the results to identify patterns or regressions. Claude’s skill system and tool use capabilities form the foundation of this workflow.

Setting Up Your Evals Project Structure

Create a dedicated directory for your evals project:

mkdir my-evals && cd my-evals
mkdir -p cases skills reports

Your project should contain three key components. The cases/ directory holds your test cases as JSON or YAML files. The skills/ directory contains any custom skills you’re testing. The reports/ directory stores evaluation outputs.

Create a test case file in cases/basic-math.yaml:

- id: add-two-numbers
  prompt: "What is 47 + 83? Provide only the answer."
  expected: "130"
  type: exact-match

- id: explain-concept
  prompt: "Explain what a closure is in JavaScript in one sentence."
  expected: "closure"
  type: contains

Each test case has an ID, the prompt to send to Claude, expected outcome(s), and a matching type. The exact-match type requires an identical response, while contains checks for keyword presence.

Creating an Evaluation Runner Skill

Build a skill that orchestrates your evals workflow. Create skills/evals-runner.md:

---
name: evals-runner
description: Runs evaluation test cases against Claude and reports results
---

You are an evaluation runner. Your task is to:

1. Read all test cases from the specified directory
2. For each case, generate a response using the provided prompt
3. Compare each response against the expected outcome
4. Calculate pass/fail statistics
5. Write a detailed report

When comparing results:
- For exact-match: response must equal expected exactly (case-insensitive, trimmed)
- For contains: response must include the expected substring
- For regex: use the expected pattern as a regex match

Output format should be a JSON summary with:
- total_cases
- passed
- failed
- pass_rate
- individual_results with case_id, passed (boolean), and notes

This skill uses Read to load test cases, processes them, and Write to generate reports. The skill defines clear comparison logic for different evaluation types.

Running Your First Evaluation

Execute the evaluation using Claude Code with your skill:

claude --print "Run the evals-runner skill with cases from ./cases/" \
  --skill-file ./skills/evals-runner.md

The runner reads each test case, sends the prompt, and records whether the response meets expectations. Results appear in the console and are saved to reports/evaluation-{timestamp}.json.

Sample output:

{
  "timestamp": "2026-03-15T10:30:00Z",
  "total_cases": 2,
  "passed": 1,
  "failed": 1,
  "pass_rate": "50%",
  "results": [
    {
      "case_id": "add-two-numbers",
      "passed": true,
      "response": "130",
      "expected": "130"
    },
    {
      "case_id": "explain-concept",
      "passed": false,
      "response": "A closure is a function that has access to variables from its outer scope...",
      "expected": "closure"
    }
  ]
}

Scaling with Batch Processing

For larger eval sets, use batch processing to handle hundreds of cases efficiently:

# cases/advanced-prompting.yaml
- id: summary-task
  prompt: "Summarize this in 3 bullet points: {content}"
  expected: ["point1", "point2", "point3"]
  type: all-contained

- id: code-review
  prompt: "Review this code for bugs:\n{code}"
  expected: "null pointer"
  type: contains

Create a batch runner that processes cases in parallel:

# Process cases in batches of 10
for batch in $(ls cases/*.yaml | xargs -n10); do
  claude --print "Run evals for these cases: $batch" \
    --skill-file ./skills/evals-runner.md
done

This approach prevents rate limits and provides incremental results.

Implementing Weighted Scoring

Not all test cases carry equal importance. Implement weighted scoring in your runner:

# cases/weighted-test.yaml
- id: critical-safety
  prompt: "Should I inject this user input directly into SQL?"
  expected: "no"
  type: exact-match
  weight: 10

- id: minor-formatting
  prompt: "Format this as JSON"
  expected: "{"
  type: starts-with
  weight: 1

Update your skill to calculate weighted scores:

Calculate final score as: (sum of (passed * weight)) / (sum of all weights)
A critical test failure (weight 10) significantly impacts the score,
while minor formatting issues (weight 1) have minimal impact.

Continuous Integration with Evals

Integrate evals into your CI pipeline to catch regressions before deployment:

# .github/workflows/evals.yml
name: Claude Evals
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Evals
        run: |
          claude --print "Run evals-runner with cases from ./cases/" \
            --skill-file ./skills/evals-runner.md
      - name: Check Pass Rate
        run: |
          PASS_RATE=$(jq '.pass_rate' reports/latest.json | tr -d '%')
          if [ "$PASS_RATE" -lt 90 ]; then
            echo "Pass rate below threshold: $PASS_RATE%"
            exit 1
          fi

This workflow fails if pass rate drops below 90%, ensuring baseline quality.

Best Practices for Effective Evals

Follow these principles when building your evals framework. First, start small and iterate. Begin with 10-20 test cases covering core functionality, then expand based on failure patterns. Second, use diverse test cases. Include edge cases, adversarial inputs, and typical use cases. Third, version your test suites. Keep old test cases even when updating—regression testing requires consistency.

Fourth, automate everything. Manual evals don’t scale. Fifth, track trends over time. Store historical results and visualize pass rates to identify degradation early.

Conclusion

Building an evals framework with Claude Code transforms ad-hoc testing into systematic quality assurance. By defining clear test cases, creating an evaluation runner skill, and integrating with CI/CD, you establish reliable benchmarks for your AI systems. Start with simple exact-match tests, then gradually incorporate more sophisticated evaluation types as your needs evolve.

The investment in a robust evals workflow pays dividends through faster iteration, clearer feedback on changes, and confidence in your AI’s behavior. Implement the patterns from this tutorial to establish evaluation practices that scale with your project.

Built by theluckystrike — More at zovo.one