AI tools can automatically generate dbt documentation blocks by analyzing your project’s column-level lineage and transformation logic. These tools parse dbt manifests, trace column dependencies across models, and generate YAML descriptions that keep documentation synchronized with your code. Combining pattern-based generation with lineage-aware context produces accurate column descriptions at scale without manual maintenance.
Understanding dbt Documentation Blocks
dbt provides a built-in documentation system that relies on YAML-based description files. These files define descriptions for models, columns, and their relationships. A typical documentation block structure looks like this:
version: 2
models:
- name: dim_customers
description: "Customer dimension table aggregating data from multiple source systems"
columns:
- name: customer_id
description: "Primary key identifying unique customers"
tests:
- unique
- not_null
- name: created_at
description: "Timestamp when the customer record was first created"
The challenge emerges when you need to trace how each column flows through your transformation pipeline. Column-level lineage shows you the complete journey—from source table through intermediate transforms to final downstream models. In large projects with dozens of staging, intermediate, and mart layers, tracking this manually becomes error-prone and unsustainable.
How AI Tools Extract Column Lineage
Several AI-powered approaches can analyze your dbt project and extract column-level lineage. These tools typically work by parsing your SQL model files and identifying source references.
Parsing Source References
Modern AI tools can parse dbt’s ref() and source() functions to build lineage graphs:
# Example: Extracting column lineage from dbt manifest
import json
def extract_column_lineage(manifest_path):
with open(manifest_path) as f:
manifest = json.load(f)
lineage = {}
for node_name, node in manifest['nodes'].items():
if node['resource_type'] == 'model':
# Extract source columns referenced in this model
sources = node.get('depends_on', {}).get('nodes', [])
lineage[node_name] = {
'sources': sources,
'columns': extract_column_references(node['raw_sql'])
}
return lineage
This extracted lineage then serves as the foundation for AI-generated documentation.
SQL Parsing for Column-Level Granularity
Node-level lineage (which model depends on which model) is only the starting point. True column-level lineage requires parsing the SQL within each model to understand exactly how output columns are derived:
import sqlglot
def extract_column_references(sql: str, dialect: str = "bigquery") -> dict:
"""Parse SQL to extract column-level derivation information."""
parsed = sqlglot.parse_one(sql, dialect=dialect)
column_map = {}
for col in parsed.find_all(sqlglot.expressions.Alias):
output_col = col.alias
source_expr = col.this
column_map[output_col] = {
"expression": source_expr.sql(),
"direct_reference": isinstance(source_expr, sqlglot.expressions.Column),
"source_column": source_expr.name if hasattr(source_expr, "name") else None,
}
return column_map
Using sqlglot gives you dialect-aware parsing that handles BigQuery, Snowflake, and Redshift syntax differences without requiring a live database connection.
AI-Powered Documentation Generation Approaches
Pattern-Based Generation
One effective approach uses AI to analyze column names, data types, and surrounding context to generate meaningful descriptions:
def generate_column_description(column_name, data_type, sample_values):
"""AI generates descriptions based on column patterns"""
prompt = f"""
Generate a concise dbt column description for:
- Column name: {column_name}
- Data type: {data_type}
- Sample values: {sample_values[:3]}
Return only the description text, no formatting.
"""
# Call your preferred AI API here
response = ai_client.complete(prompt)
return response.text
This pattern recognition becomes particularly valuable when you have consistent naming conventions across your organization. A column named arr_usd_monthly in a SaaS context reliably maps to “Monthly recurring revenue in USD,” and a well-prompted AI will infer this from name alone.
Lineage-Aware Documentation
AI tools with full lineage awareness can generate more accurate descriptions by understanding upstream transformations:
# AI-generated documentation with lineage context
models:
- name: fact_orders
description: "Order fact table with derived metrics from source and staging layers"
columns:
- name: order_id
description: "Primary key sourced from raw orders table, no transformation applied"
meta:
lineage:
source: staging.stg_orders.order_id
transform: "Direct passthrough"
- name: gross_revenue_usd
description: "Order subtotal converted to USD using daily exchange rates from staging.stg_fx_rates"
meta:
lineage:
source: staging.stg_orders.subtotal_local
transform: "Multiplied by stg_fx_rates.usd_rate for the order date"
The lineage context helps documentation readers understand not just what a column represents, but where it originates and how it was transformed.
Practical Implementation Strategies
Integrating with Your dbt Workflow
You can integrate AI documentation generation into your existing CI/CD pipeline:
# Add to your Makefile or dbt run script
generate-docs:
dbt compile --profiles-dir . > /dev/null
dbt ls --resource-type model --output json > models.json
python scripts/ai_doc_generator.py \
--manifest target/manifest.json \
--catalog target/catalog.json \
--output models/
dbt docs generate
This automation ensures documentation stays current whenever you run make generate-docs before a release.
Using dbt Artifacts
The manifest.json and catalog.json artifacts generated by dbt contain valuable metadata:
# Extract column metadata from dbt artifacts
import json
def get_column_metadata(project_path: str) -> dict:
with open(f"{project_path}/target/manifest.json") as f:
manifest = json.load(f)
with open(f"{project_path}/target/catalog.json") as f:
catalog = json.load(f)
columns = {}
for unique_id, node in manifest['nodes'].items():
if node['resource_type'] != 'model':
continue
catalog_node = catalog['nodes'].get(unique_id, {})
columns[node['name']] = {
'description': node.get('description', ''),
'columns': {
col_name: {
'type': catalog_node.get('columns', {}).get(col_name, {}).get('type', 'unknown'),
'described': bool(col_data.get('description', '').strip()),
}
for col_name, col_data in node.get('columns', {}).items()
}
}
return columns
This metadata becomes the input for AI documentation generation. The described flag lets you skip columns that already have human-written documentation, ensuring AI only fills genuine gaps.
Batch Generation with Rate Limiting
For large projects with hundreds of models, you need to handle API rate limits gracefully:
import time
import anthropic
client = anthropic.Anthropic()
def generate_docs_for_model(model_name: str, columns: list, lineage_context: dict) -> str:
context_str = "\n".join([
f" - {col}: derived from {info.get('source', 'unknown')}"
for col, info in lineage_context.items()
])
prompt = f"""Generate dbt YAML documentation for the model '{model_name}'.
Column lineage context:
{context_str}
Output only valid YAML in dbt schema.yml format. Include a model-level description
and a column description for each column. Keep descriptions concise (one sentence each).
"""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def batch_generate(models: list, delay_seconds: float = 0.5) -> dict:
results = {}
for i, model in enumerate(models):
results[model['name']] = generate_docs_for_model(
model['name'],
model['columns'],
model.get('lineage', {})
)
if i < len(models) - 1:
time.sleep(delay_seconds)
return results
Best Practices for AI-Generated Documentation
Review and validate before committing. AI-generated descriptions provide an excellent starting point, but domain expertise catches inaccuracies. A column named net_arr might mean “net annual recurring revenue” in your business context, but an AI trained on general data might describe it more generically. A five-minute review pass before every merge is sufficient for most projects.
Establish and share documentation standards. Provide AI tools with a style guide in the prompt. “Descriptions should be one sentence, present tense, and avoid jargon unless the column name already implies it” produces much more consistent output than an open-ended request.
Keep generated docs in version control. Store your schema.yml files alongside your dbt models. This creates an audit trail, enables collaborative review through pull requests, and means documentation diffs are visible in code review.
Supplement with business context where needed. AI excels at technical descriptions derived from column names and types. It cannot know that opportunity_stage_id = 6 means “Closed Won” in your CRM, or that arr_expansion excludes seat upgrades per a finance team decision. Add these business rules manually.
Measuring Documentation Quality
Track documentation coverage across your project to know where gaps exist:
-- Calculate documentation completeness from dbt metadata tables
-- Run after: dbt docs generate
SELECT
COUNT(DISTINCT m.name) AS total_models,
SUM(CASE WHEN m.description IS NOT NULL AND m.description != '' THEN 1 ELSE 0 END) AS documented_models,
COUNT(DISTINCT c.column_name) AS total_columns,
SUM(CASE WHEN c.description IS NOT NULL AND c.description != '' THEN 1 ELSE 0 END) AS documented_columns,
ROUND(
100.0 * SUM(CASE WHEN c.description IS NOT NULL AND c.description != '' THEN 1 ELSE 0 END)
/ NULLIF(COUNT(DISTINCT c.column_name), 0),
1
) AS column_coverage_pct
FROM information_schema.columns c
JOIN dbt_metadata.models m ON c.table_name = m.name
Aim for 100% model-level coverage first, then push column coverage above 80% before expanding AI generation to new domains.
Tools Worth Exploring
dbt-osmosis is the most mature open-source tool for propagating and generating dbt column documentation. It understands column inheritance across models and can fill in missing descriptions using upstream sources.
Elementary provides observability tooling that includes documentation coverage tracking as part of its data quality monitoring suite.
Custom LLM scripts using the Anthropic or OpenAI APIs give you full control over prompt design and output formatting, which is often worth the additional setup time when your project has unusual conventions.
The best choice depends on your team’s existing tooling, the size of your dbt project, and how much control you need over the generated output style.
Moving Forward
Automating dbt documentation through AI and column-level lineage analysis reduces manual effort while improving consistency. Start small—pick a single domain or mart layer—and iterate on your prompts until the output quality meets your standards. Once patterns mature, expand coverage across your project incrementally.
The initial investment in setting up automated documentation pays dividends through improved data discoverability, faster onboarding for new analysts, and reduced time spent answering “what does this column mean?” questions in Slack.
Related Reading
- AI Tools for Creating dbt Model Definitions from Raw Databas
- Cheapest AI Tool With GPT-4 Level Code Generation 2026
- How Accurate Are AI Tools for Rust Unsafe Code Blocks
- How to Use AI to Generate Terraform Import Blocks for
- AI Tools for Generating dbt Project Structure from Existing
Built by theluckystrike — More at zovo.one