How to Use AI to Create Data Quality Scorecards from dbt Test Results
Data quality is the backbone of reliable analytics. When you run dbt tests, you generate valuable information about your data’s health—but transforming those raw test results into meaningful scorecards requires additional effort. AI can help you automate this process, turning complex test outputs into clear, actionable metrics that your team can monitor and improve over time.
This guide shows you how to use AI to create data quality scorecards from dbt test results. You’ll learn practical approaches for parsing test outputs, generating insights, and building automated reporting systems that go well beyond what dbt’s native reporting offers.
Understanding dbt Test Results
dbt provides several built-in tests: unique, not_null, accepted_values, relationships, and foreign_key. When you run dbt test, these tests produce JSON or CSV output that includes test status, timing information, and failure details.
Here’s a typical dbt test result in JSON format:
{
"results": [
{
"unique_id": "test.unique.orders.order_id",
"status": "pass",
" timings": [
{
"name": "execute",
"elapsed_time": 0.152
}
],
"message": null
},
{
"unique_id": "test.not_null.customers.email",
"status": "fail",
" timings": [
{
"name": "execute",
"elapsed_time": 0.089
}
],
"message": "1 out of 15420 rows failed"
}
]
}
The challenge is aggregating these individual test results into meaningful quality scores that stakeholders can understand at a glance.
Building Your AI-Powered Scorecard System
Step 1: Collect and Normalize Test Results
First, run your dbt tests and capture the output. Use the --output json flag to get machine-readable results:
dbt test --output json > results.json
Load and normalize these results in Python:
import json
from datetime import datetime
def load_dbt_results(filepath):
with open(filepath) as f:
data = json.load(f)
normalized = []
for result in data.get('results', []):
normalized.append({
'test_name': result['unique_id'],
'model': result['unique_id'].split('.')[2],
'test_type': result['unique_id'].split('.')[1],
'status': result['status'],
'elapsed_time': result['timings'][0]['elapsed_time'],
'message': result.get('message', ''),
'timestamp': datetime.now().isoformat()
})
return normalized
Step 2: Calculate Quality Metrics
With normalized data, you can now calculate aggregate metrics. This is where AI adds value by identifying patterns and generating insights:
def calculate_quality_score(results):
total = len(results)
passed = sum(1 for r in results if r['status'] == 'pass')
failed = sum(1 for r in results if r['status'] == 'fail')
score = (passed / total * 100) if total > 0 else 0
return {
'total_tests': total,
'passed': passed,
'failed': failed,
'quality_score': round(score, 2),
'pass_rate': f"{score:.1f}%"
}
Step 3: Generate AI Insights
Now use an AI model to analyze the test results and generate practical recommendations:
def generate_ai_insights(results, metrics):
# Group failures by model for analysis
failures = [r for r in results if r['status'] == 'fail']
failure_by_model = {}
for failure in failures:
model = failure['model']
if model not in failure_by_model:
failure_by_model[model] = []
failure_by_model[model].append(failure)
# Build prompt for AI analysis
prompt = f"""Analyze these dbt test failures and provide actionable insights:
Quality Score: {metrics['quality_score']}%
Failed Tests: {metrics['failed']}
Failures by Model:
{json.dumps(failure_by_model, indent=2)}
Provide:
1. Root cause patterns
2. Priority recommendations
3. Suggested next steps"""
# Call your preferred AI model here
# response = ai_model.generate(prompt)
return prompt # In production, return AI response
Step 4: Create the Scorecard Output
Combine everything into an usable scorecard format:
def create_scorecard(results_path):
results = load_dbt_results(results_path)
metrics = calculate_quality_score(results)
insights = generate_ai_insights(results, metrics)
scorecard = {
'generated_at': datetime.now().isoformat(),
'metrics': metrics,
'insights': insights,
'models': group_by_model(results)
}
return scorecard
def group_by_model(results):
models = {}
for r in results:
model = r['model']
if model not in models:
models[model] = {'tests': [], 'score': 0}
models[model]['tests'].append(r)
# Calculate per-model scores
for model, data in models.items():
total = len(data['tests'])
passed = sum(1 for t in data['tests'] if t['status'] == 'pass')
data['score'] = round((passed / total * 100), 2) if total > 0 else 0
return models
AI Tooling Comparison for dbt Scorecard Generation
Different AI tools handle the scorecard generation task with varying strengths. Here’s how the leading options compare when used to analyze dbt test output and generate natural-language summaries:
| Tool | Strengths | Context Limit | Best For | Cost |
|---|---|---|---|---|
| Claude (claude-opus-4) | Nuanced failure analysis, clear narrative | 200K tokens | Large dbt projects with complex lineage | Pay-per-use |
| GPT-4o | Strong JSON reasoning, fast responses | 128K tokens | Real-time scorecard pipelines | Pay-per-use |
| Gemini 1.5 Pro | Very large context for big result files | 1M tokens | Mega-warehouses with thousands of tests | Pay-per-use |
| Ollama + DeepSeek-Coder | Fully offline, no data egress | 32K tokens | Air-gapped data warehouses | Free (local) |
| Groq + Llama 3.1 70B | Ultra-fast inference | 128K tokens | High-frequency CI/CD scoring | Free tier + paid |
For most analytics engineering teams, Claude or GPT-4o offer the best balance of reasoning quality and context handling. If you run dbt in a regulated environment where test results contain sensitive schema metadata, running a local model via Ollama eliminates data egress risk entirely.
Real-World Workflow: CI/CD Integration
The most effective scorecard systems run automatically after every dbt test execution. Here’s a complete GitHub Actions workflow:
name: dbt Quality Scorecard
on:
push:
branches: [main]
schedule:
- cron: '0 8 * * *' # Daily at 8am UTC
jobs:
dbt-test-scorecard:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup dbt
run: pip install dbt-bigquery dbt-artifacts-parser
- name: Run dbt tests
run: |
dbt deps
dbt test --output json 2>&1 | tee dbt_results.json
- name: Generate AI scorecard
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python scripts/generate_scorecard.py --input dbt_results.json --output scorecard.json
- name: Post scorecard to Slack
env:
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
run: python scripts/notify_slack.py --scorecard scorecard.json
To post a formatted scorecard to Slack, use the Block Kit format with color-coded severity:
def post_to_slack(scorecard, webhook_url):
import requests
score = scorecard['metrics']['quality_score']
color = "#36a64f" if score >= 95 else "#ff9900" if score >= 85 else "#ff0000"
payload = {
"attachments": [{
"color": color,
"title": f"Data Quality Report — {scorecard['generated_at'][:10]}",
"fields": [
{"title": "Quality Score", "value": f"{score}%", "short": True},
{"title": "Failed Tests", "value": str(scorecard['metrics']['failed']), "short": True},
{"title": "AI Insight", "value": scorecard['insights'][:300]}
]
}]
}
requests.post(webhook_url, json=payload)
Performance Benchmarks: Manual vs AI-Assisted Scoring
Teams that have implemented AI-assisted dbt scorecard pipelines report measurable improvements over manual review workflows:
- Time to triage failures: Manual review averages 45 minutes per incident; AI-assisted triage averages 8 minutes (82% reduction)
- False-positive alerts: AI context analysis reduces alert noise by ~60% by correlating failures across related models
- MTTR (mean time to resolve): Teams using AI-summarized scorecards resolve data incidents 35% faster on average
- Coverage: Automated scoring covers 100% of tests; human reviewers typically sample 20-30% under time pressure
The biggest gains come from the AI’s ability to correlate failures. When not_null tests fail on three downstream models simultaneously, AI can recognize that all three depend on the same upstream source model and surface that root cause immediately.
Automating Your Pipeline
Integrate this scorecard into your dbt workflow by adding a post-hook in your dbt_project.yml:
models:
my_project:
+post-hook:
- "python scripts/generate_scorecard.py"
You can also schedule daily or weekly runs and send results to Slack or email:
# Daily scorecard generation
0 9 * * * cd /path/to/project && dbt test --output json && python scripts/generate_scorecard.py
Interpreting Your Scorecard
When reviewing your scorecard, focus on these key indicators:
-
Overall Quality Score: A percentage that represents your data’s health. Aim for 95%+ in production environments.
-
Failed Test Count: Absolute numbers help you prioritize debugging efforts.
-
Model-Level Scores: Identify which models need the most attention.
-
Trend Over Time: Track scores daily or weekly to detect degradation early.
Use the AI-generated insights to move beyond simple pass/fail reporting. Instead of just knowing that tests failed, you’ll understand why they failed and what actions to take.
FAQ
Q: Can I use this approach with dbt Cloud instead of dbt Core?
Yes. dbt Cloud exposes test results via its Admin API at /api/v2/accounts/{account_id}/runs/{run_id}/artifacts/. Fetch the run_results.json artifact and pass it to the same normalization functions.
Q: What dbt version does the JSON output format require?
The structured JSON output format (--output json) is stable from dbt Core 1.0 onwards. The run_results.json schema is documented in dbt’s artifact specification and has been consistent across 1.x releases.
Q: How do I handle incremental test results across multiple dbt runs? Store each scorecard in a time-series database like TimescaleDB or append results to a BigQuery table partitioned by date. Query 30-day rolling averages to detect gradual degradation that individual run comparisons miss.
Q: My dbt project has 2,000+ tests. Will the AI prompt exceed context limits? Aggregate failure data before sending it to the AI. Instead of passing all 2,000 test results, send a grouped summary (failures by model, test type distribution, top 10 failing models). This keeps prompts under 5,000 tokens while preserving the signal the AI needs.
Related Articles
- How to Use AI to Create Data Pipeline Orchestration Configs
- AI Tools for Creating Dbt Documentation Blocks
- AI Tools for Creating dbt Model Definitions from Raw Databas
Built by theluckystrike — More at zovo.one