AI Tools for Creating Dbt Documentation Blocks

AI tools can automatically generate dbt documentation blocks by analyzing your project’s column-level lineage and transformation logic. These tools parse dbt manifests, trace column dependencies across models, and generate YAML descriptions that keep documentation synchronized with your code. Combining pattern-based generation with lineage-aware context produces accurate column descriptions at scale without manual maintenance.

Understanding dbt Documentation Blocks

dbt provides a built-in documentation system that relies on YAML-based description files. These files define descriptions for models, columns, and their relationships. A typical documentation block structure looks like this:

version: 2

models:
  - name: dim_customers
    description: "Customer dimension table aggregating data from multiple source systems"
    columns:
      - name: customer_id
        description: "Primary key identifying unique customers"
        tests:
          - unique
          - not_null
      - name: created_at
        description: "Timestamp when the customer record was first created"

The challenge emerges when you need to trace how each column flows through your transformation pipeline. Column-level lineage shows you the complete journey—from source table through intermediate transforms to final downstream models. In large projects with dozens of staging, intermediate, and mart layers, tracking this manually becomes error-prone and unsustainable.

How AI Tools Extract Column Lineage

Several AI-powered approaches can analyze your dbt project and extract column-level lineage. These tools typically work by parsing your SQL model files and identifying source references.

Parsing Source References

Modern AI tools can parse dbt’s ref() and source() functions to build lineage graphs:

# Example: Extracting column lineage from dbt manifest
import json

def extract_column_lineage(manifest_path):
    with open(manifest_path) as f:
        manifest = json.load(f)

    lineage = {}
    for node_name, node in manifest['nodes'].items():
        if node['resource_type'] == 'model':
            # Extract source columns referenced in this model
            sources = node.get('depends_on', {}).get('nodes', [])
            lineage[node_name] = {
                'sources': sources,
                'columns': extract_column_references(node['raw_sql'])
            }

    return lineage

This extracted lineage then serves as the foundation for AI-generated documentation.

SQL Parsing for Column-Level Granularity

Node-level lineage (which model depends on which model) is only the starting point. True column-level lineage requires parsing the SQL within each model to understand exactly how output columns are derived:

import sqlglot

def extract_column_references(sql: str, dialect: str = "bigquery") -> dict:
    """Parse SQL to extract column-level derivation information."""
    parsed = sqlglot.parse_one(sql, dialect=dialect)
    column_map = {}

    for col in parsed.find_all(sqlglot.expressions.Alias):
        output_col = col.alias
        source_expr = col.this

        column_map[output_col] = {
            "expression": source_expr.sql(),
            "direct_reference": isinstance(source_expr, sqlglot.expressions.Column),
            "source_column": source_expr.name if hasattr(source_expr, "name") else None,
        }

    return column_map

Using sqlglot gives you dialect-aware parsing that handles BigQuery, Snowflake, and Redshift syntax differences without requiring a live database connection.

AI-Powered Documentation Generation Approaches

Pattern-Based Generation

One effective approach uses AI to analyze column names, data types, and surrounding context to generate meaningful descriptions:

def generate_column_description(column_name, data_type, sample_values):
    """AI generates descriptions based on column patterns"""
    prompt = f"""
    Generate a concise dbt column description for:
    - Column name: {column_name}
    - Data type: {data_type}
    - Sample values: {sample_values[:3]}

    Return only the description text, no formatting.
    """
    # Call your preferred AI API here
    response = ai_client.complete(prompt)
    return response.text

This pattern recognition becomes particularly valuable when you have consistent naming conventions across your organization. A column named arr_usd_monthly in a SaaS context reliably maps to “Monthly recurring revenue in USD,” and a well-prompted AI will infer this from name alone.

Lineage-Aware Documentation

AI tools with full lineage awareness can generate more accurate descriptions by understanding upstream transformations:

# AI-generated documentation with lineage context
models:
  - name: fact_orders
    description: "Order fact table with derived metrics from source and staging layers"
    columns:
      - name: order_id
        description: "Primary key sourced from raw orders table, no transformation applied"
        meta:
          lineage:
            source: staging.stg_orders.order_id
            transform: "Direct passthrough"
      - name: gross_revenue_usd
        description: "Order subtotal converted to USD using daily exchange rates from staging.stg_fx_rates"
        meta:
          lineage:
            source: staging.stg_orders.subtotal_local
            transform: "Multiplied by stg_fx_rates.usd_rate for the order date"

The lineage context helps documentation readers understand not just what a column represents, but where it originates and how it was transformed.

Practical Implementation Strategies

Integrating with Your dbt Workflow

You can integrate AI documentation generation into your existing CI/CD pipeline:

# Add to your Makefile or dbt run script
generate-docs:
    dbt compile --profiles-dir . > /dev/null
    dbt ls --resource-type model --output json > models.json
    python scripts/ai_doc_generator.py \
        --manifest target/manifest.json \
        --catalog target/catalog.json \
        --output models/
    dbt docs generate

This automation ensures documentation stays current whenever you run make generate-docs before a release.

Using dbt Artifacts

The manifest.json and catalog.json artifacts generated by dbt contain valuable metadata:

# Extract column metadata from dbt artifacts
import json

def get_column_metadata(project_path: str) -> dict:
    with open(f"{project_path}/target/manifest.json") as f:
        manifest = json.load(f)
    with open(f"{project_path}/target/catalog.json") as f:
        catalog = json.load(f)

    columns = {}
    for unique_id, node in manifest['nodes'].items():
        if node['resource_type'] != 'model':
            continue

        catalog_node = catalog['nodes'].get(unique_id, {})
        columns[node['name']] = {
            'description': node.get('description', ''),
            'columns': {
                col_name: {
                    'type': catalog_node.get('columns', {}).get(col_name, {}).get('type', 'unknown'),
                    'described': bool(col_data.get('description', '').strip()),
                }
                for col_name, col_data in node.get('columns', {}).items()
            }
        }

    return columns

This metadata becomes the input for AI documentation generation. The described flag lets you skip columns that already have human-written documentation, ensuring AI only fills genuine gaps.

Batch Generation with Rate Limiting

For large projects with hundreds of models, you need to handle API rate limits gracefully:

import time
import anthropic

client = anthropic.Anthropic()

def generate_docs_for_model(model_name: str, columns: list, lineage_context: dict) -> str:
    context_str = "\n".join([
        f"  - {col}: derived from {info.get('source', 'unknown')}"
        for col, info in lineage_context.items()
    ])

    prompt = f"""Generate dbt YAML documentation for the model '{model_name}'.

Column lineage context:
{context_str}

Output only valid YAML in dbt schema.yml format. Include a model-level description
and a column description for each column. Keep descriptions concise (one sentence each).
"""

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text


def batch_generate(models: list, delay_seconds: float = 0.5) -> dict:
    results = {}
    for i, model in enumerate(models):
        results[model['name']] = generate_docs_for_model(
            model['name'],
            model['columns'],
            model.get('lineage', {})
        )
        if i < len(models) - 1:
            time.sleep(delay_seconds)
    return results

Best Practices for AI-Generated Documentation

Review and validate before committing. AI-generated descriptions provide an excellent starting point, but domain expertise catches inaccuracies. A column named net_arr might mean “net annual recurring revenue” in your business context, but an AI trained on general data might describe it more generically. A five-minute review pass before every merge is sufficient for most projects.

Establish and share documentation standards. Provide AI tools with a style guide in the prompt. “Descriptions should be one sentence, present tense, and avoid jargon unless the column name already implies it” produces much more consistent output than an open-ended request.

Keep generated docs in version control. Store your schema.yml files alongside your dbt models. This creates an audit trail, enables collaborative review through pull requests, and means documentation diffs are visible in code review.

Supplement with business context where needed. AI excels at technical descriptions derived from column names and types. It cannot know that opportunity_stage_id = 6 means “Closed Won” in your CRM, or that arr_expansion excludes seat upgrades per a finance team decision. Add these business rules manually.

Measuring Documentation Quality

Track documentation coverage across your project to know where gaps exist:

-- Calculate documentation completeness from dbt metadata tables
-- Run after: dbt docs generate
SELECT
    COUNT(DISTINCT m.name) AS total_models,
    SUM(CASE WHEN m.description IS NOT NULL AND m.description != '' THEN 1 ELSE 0 END) AS documented_models,
    COUNT(DISTINCT c.column_name) AS total_columns,
    SUM(CASE WHEN c.description IS NOT NULL AND c.description != '' THEN 1 ELSE 0 END) AS documented_columns,
    ROUND(
        100.0 * SUM(CASE WHEN c.description IS NOT NULL AND c.description != '' THEN 1 ELSE 0 END)
        / NULLIF(COUNT(DISTINCT c.column_name), 0),
        1
    ) AS column_coverage_pct
FROM information_schema.columns c
JOIN dbt_metadata.models m ON c.table_name = m.name

Aim for 100% model-level coverage first, then push column coverage above 80% before expanding AI generation to new domains.

Tools Worth Exploring

dbt-osmosis is the most mature open-source tool for propagating and generating dbt column documentation. It understands column inheritance across models and can fill in missing descriptions using upstream sources.

Elementary provides observability tooling that includes documentation coverage tracking as part of its data quality monitoring suite.

Custom LLM scripts using the Anthropic or OpenAI APIs give you full control over prompt design and output formatting, which is often worth the additional setup time when your project has unusual conventions.

The best choice depends on your team’s existing tooling, the size of your dbt project, and how much control you need over the generated output style.

Moving Forward

Automating dbt documentation through AI and column-level lineage analysis reduces manual effort while improving consistency. Start small—pick a single domain or mart layer—and iterate on your prompts until the output quality meets your standards. Once patterns mature, expand coverage across your project incrementally.

The initial investment in setting up automated documentation pays dividends through improved data discoverability, faster onboarding for new analysts, and reduced time spent answering “what does this column mean?” questions in Slack.

Built by theluckystrike — More at zovo.one