AI Tools Compared

How to Use AI to Create Data Quality Scorecards from dbt Test Results

Data quality is the backbone of reliable analytics. When you run dbt tests, you generate valuable information about your data’s health—but transforming those raw test results into meaningful scorecards requires additional effort. AI can help you automate this process, turning complex test outputs into clear, actionable metrics that your team can monitor and improve over time.

This guide shows you how to use AI to create data quality scorecards from dbt test results. You’ll learn practical approaches for parsing test outputs, generating insights, and building automated reporting systems that go well beyond what dbt’s native reporting offers.

Understanding dbt Test Results

dbt provides several built-in tests: unique, not_null, accepted_values, relationships, and foreign_key. When you run dbt test, these tests produce JSON or CSV output that includes test status, timing information, and failure details.

Here’s a typical dbt test result in JSON format:

{
  "results": [
    {
      "unique_id": "test.unique.orders.order_id",
      "status": "pass",
      " timings": [
        {
          "name": "execute",
          "elapsed_time": 0.152
        }
      ],
      "message": null
    },
    {
      "unique_id": "test.not_null.customers.email",
      "status": "fail",
      " timings": [
        {
          "name": "execute",
          "elapsed_time": 0.089
        }
      ],
      "message": "1 out of 15420 rows failed"
    }
  ]
}

The challenge is aggregating these individual test results into meaningful quality scores that stakeholders can understand at a glance.

Building Your AI-Powered Scorecard System

Step 1: Collect and Normalize Test Results

First, run your dbt tests and capture the output. Use the --output json flag to get machine-readable results:

dbt test --output json > results.json

Load and normalize these results in Python:

import json
from datetime import datetime

def load_dbt_results(filepath):
    with open(filepath) as f:
        data = json.load(f)

    normalized = []
    for result in data.get('results', []):
        normalized.append({
            'test_name': result['unique_id'],
            'model': result['unique_id'].split('.')[2],
            'test_type': result['unique_id'].split('.')[1],
            'status': result['status'],
            'elapsed_time': result['timings'][0]['elapsed_time'],
            'message': result.get('message', ''),
            'timestamp': datetime.now().isoformat()
        })

    return normalized

Step 2: Calculate Quality Metrics

With normalized data, you can now calculate aggregate metrics. This is where AI adds value by identifying patterns and generating insights:

def calculate_quality_score(results):
    total = len(results)
    passed = sum(1 for r in results if r['status'] == 'pass')
    failed = sum(1 for r in results if r['status'] == 'fail')

    score = (passed / total * 100) if total > 0 else 0

    return {
        'total_tests': total,
        'passed': passed,
        'failed': failed,
        'quality_score': round(score, 2),
        'pass_rate': f"{score:.1f}%"
    }

Step 3: Generate AI Insights

Now use an AI model to analyze the test results and generate practical recommendations:

def generate_ai_insights(results, metrics):
    # Group failures by model for analysis
    failures = [r for r in results if r['status'] == 'fail']
    failure_by_model = {}

    for failure in failures:
        model = failure['model']
        if model not in failure_by_model:
            failure_by_model[model] = []
        failure_by_model[model].append(failure)

    # Build prompt for AI analysis
    prompt = f"""Analyze these dbt test failures and provide actionable insights:

    Quality Score: {metrics['quality_score']}%
    Failed Tests: {metrics['failed']}

    Failures by Model:
    {json.dumps(failure_by_model, indent=2)}

    Provide:
    1. Root cause patterns
    2. Priority recommendations
    3. Suggested next steps"""

    # Call your preferred AI model here
    # response = ai_model.generate(prompt)
    return prompt  # In production, return AI response

Step 4: Create the Scorecard Output

Combine everything into an usable scorecard format:

def create_scorecard(results_path):
    results = load_dbt_results(results_path)
    metrics = calculate_quality_score(results)
    insights = generate_ai_insights(results, metrics)

    scorecard = {
        'generated_at': datetime.now().isoformat(),
        'metrics': metrics,
        'insights': insights,
        'models': group_by_model(results)
    }

    return scorecard

def group_by_model(results):
    models = {}
    for r in results:
        model = r['model']
        if model not in models:
            models[model] = {'tests': [], 'score': 0}
        models[model]['tests'].append(r)

    # Calculate per-model scores
    for model, data in models.items():
        total = len(data['tests'])
        passed = sum(1 for t in data['tests'] if t['status'] == 'pass')
        data['score'] = round((passed / total * 100), 2) if total > 0 else 0

    return models

AI Tooling Comparison for dbt Scorecard Generation

Different AI tools handle the scorecard generation task with varying strengths. Here’s how the leading options compare when used to analyze dbt test output and generate natural-language summaries:

Tool Strengths Context Limit Best For Cost
Claude (claude-opus-4) Nuanced failure analysis, clear narrative 200K tokens Large dbt projects with complex lineage Pay-per-use
GPT-4o Strong JSON reasoning, fast responses 128K tokens Real-time scorecard pipelines Pay-per-use
Gemini 1.5 Pro Very large context for big result files 1M tokens Mega-warehouses with thousands of tests Pay-per-use
Ollama + DeepSeek-Coder Fully offline, no data egress 32K tokens Air-gapped data warehouses Free (local)
Groq + Llama 3.1 70B Ultra-fast inference 128K tokens High-frequency CI/CD scoring Free tier + paid

For most analytics engineering teams, Claude or GPT-4o offer the best balance of reasoning quality and context handling. If you run dbt in a regulated environment where test results contain sensitive schema metadata, running a local model via Ollama eliminates data egress risk entirely.

Real-World Workflow: CI/CD Integration

The most effective scorecard systems run automatically after every dbt test execution. Here’s a complete GitHub Actions workflow:

name: dbt Quality Scorecard

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 8 * * *'  # Daily at 8am UTC

jobs:
  dbt-test-scorecard:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup dbt
        run: pip install dbt-bigquery dbt-artifacts-parser

      - name: Run dbt tests
        run: |
          dbt deps
          dbt test --output json 2>&1 | tee dbt_results.json

      - name: Generate AI scorecard
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python scripts/generate_scorecard.py --input dbt_results.json --output scorecard.json

      - name: Post scorecard to Slack
        env:
          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
        run: python scripts/notify_slack.py --scorecard scorecard.json

To post a formatted scorecard to Slack, use the Block Kit format with color-coded severity:

def post_to_slack(scorecard, webhook_url):
    import requests

    score = scorecard['metrics']['quality_score']
    color = "#36a64f" if score >= 95 else "#ff9900" if score >= 85 else "#ff0000"

    payload = {
        "attachments": [{
            "color": color,
            "title": f"Data Quality Report — {scorecard['generated_at'][:10]}",
            "fields": [
                {"title": "Quality Score", "value": f"{score}%", "short": True},
                {"title": "Failed Tests", "value": str(scorecard['metrics']['failed']), "short": True},
                {"title": "AI Insight", "value": scorecard['insights'][:300]}
            ]
        }]
    }

    requests.post(webhook_url, json=payload)

Performance Benchmarks: Manual vs AI-Assisted Scoring

Teams that have implemented AI-assisted dbt scorecard pipelines report measurable improvements over manual review workflows:

The biggest gains come from the AI’s ability to correlate failures. When not_null tests fail on three downstream models simultaneously, AI can recognize that all three depend on the same upstream source model and surface that root cause immediately.

Automating Your Pipeline

Integrate this scorecard into your dbt workflow by adding a post-hook in your dbt_project.yml:

models:
  my_project:
    +post-hook:
      - "python scripts/generate_scorecard.py"

You can also schedule daily or weekly runs and send results to Slack or email:

# Daily scorecard generation
0 9 * * * cd /path/to/project && dbt test --output json && python scripts/generate_scorecard.py

Interpreting Your Scorecard

When reviewing your scorecard, focus on these key indicators:

Use the AI-generated insights to move beyond simple pass/fail reporting. Instead of just knowing that tests failed, you’ll understand why they failed and what actions to take.

FAQ

Q: Can I use this approach with dbt Cloud instead of dbt Core? Yes. dbt Cloud exposes test results via its Admin API at /api/v2/accounts/{account_id}/runs/{run_id}/artifacts/. Fetch the run_results.json artifact and pass it to the same normalization functions.

Q: What dbt version does the JSON output format require? The structured JSON output format (--output json) is stable from dbt Core 1.0 onwards. The run_results.json schema is documented in dbt’s artifact specification and has been consistent across 1.x releases.

Q: How do I handle incremental test results across multiple dbt runs? Store each scorecard in a time-series database like TimescaleDB or append results to a BigQuery table partitioned by date. Query 30-day rolling averages to detect gradual degradation that individual run comparisons miss.

Q: My dbt project has 2,000+ tests. Will the AI prompt exceed context limits? Aggregate failure data before sending it to the AI. Instead of passing all 2,000 test results, send a grouped summary (failures by model, test type distribution, top 10 failing models). This keeps prompts under 5,000 tokens while preserving the signal the AI needs.

Built by theluckystrike — More at zovo.one