Claude Skills Guide

Claude Code for Medallion Architecture Workflow

Medallion architecture—also known as Bronze, Silver, and Gold—has become a fundamental pattern in modern data engineering. This three-tier approach organizes data pipelines into distinct quality levels, enabling teams to progressively refine raw data into business-ready datasets. Implementing and maintaining medallion architectures requires careful orchestration of ETL processes, robust data quality checks, and clear separation between layers. Claude Code transforms this complexity into a manageable workflow by providing intelligent assistance throughout the development lifecycle.

Understanding the Medallion Architecture

The medallion architecture divides data processing into three distinct stages, each serving a specific purpose in the data journey. Understanding these layers is essential before implementing any automation.

The Bronze layer serves as the landing zone where raw, unprocessed data arrives directly from source systems—log files, API responses, database exports, or streaming events. This layer preserves the original data structure exactly as received, enabling complete reprocessing if downstream transformations prove incorrect. Think of bronze as your system of record for immutable raw data.

The Silver layer acts as the curated intermediate layer where data undergoes cleaning, validation, deduplication, and basic enrichment. Relationships get resolved, data types become consistent, and business rules begin shaping the information. The silver layer typically serves as the primary source for analytical queries and downstream applications.

The Gold layer represents the final refined layer containing business-level aggregates, metrics, and analytics-ready datasets. This layer contains the transformed data that directly fuels dashboards, reports, and ML models. Optimized for query performance, gold tables often employ star schemas or dimensional models.

Setting Up Your Claude Code Environment

Before implementing medallion workflows, ensure your Claude Code environment is properly configured. The key skills to load include those for file operations, bash execution, and any database-specific tools relevant to your stack.

# Verify Claude Code is available and check version
claude --version

# Initialize your project with proper structure
mkdir -p pipeline/bronze pipeline/silver pipeline/gold
mkdir -p tests/bronze tests/silver tests/gold
mkdir -p dbt/

This establishes the foundational directory structure that will house your medallion pipeline components. Maintaining clear separation from the outset prevents confusion as transformations grow more complex.

Implementing the Bronze Layer

The bronze layer functions as your system of record for raw data. Claude Code helps generate ingestion scripts that capture data exactly as it arrives, without applying any transformations.

# bronze/ingest_raw.py
import pandas as pd
from datetime import datetime
import json

def ingest_source_data(source_file, batch_id):
    """Ingest raw data directly to bronze layer"""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Read data without transformation
    if source_file.endswith('.json'):
        df = pd.read_json(source_file)
    elif source_file.endswith('.csv'):
        df = pd.read_csv(source_file)
    else:
        raise ValueError(f"Unsupported format: {source_file}")
    
    # Add metadata columns for tracking
    df['_ingested_at'] = datetime.now()
    df['_batch_id'] = batch_id
    df['_source_file'] = source_file
    
    # Write to bronze with partitioning
    output_path = f"data/bronze/{timestamp}_{batch_id}.parquet"
    df.to_parquet(output_path, partition_cols=['_ingested_at'])
    
    return output_path

The ingestion script maintains raw data fidelity and captures essential metadata for auditability and reprocessing. Storing data in Parquet format with timestamp-based partitioning enables efficient time-travel queries and simplifies lifecycle management.

Building the Silver Layer Transformations

Silver layer processing applies business logic, cleanses data, and establishes relationships. Claude Code excels at generating transformation logic that handles common data quality issues.

# silver/transform_to_silver.py
import pandas as pd
from pyspark.sql import functions as F

def clean_and_enrich_bronze(bronze_path, silver_path):
    """Transform bronze data to silver with cleaning and enrichment"""
    df = spark.read.parquet(bronze_path)
    
    # Data type standardization
    df = df.withColumn('email', F.lower(F.trim('email'))) \
           .withColumn('phone', F.regexp_replace('phone', r'\D', '')) \
           .withColumn('created_at', F.to_timestamp('created_at'))
    
    # Deduplication based on business key
    df = df.dropDuplicates(['customer_id', 'transaction_id'])
    
    # Null handling with appropriate defaults
    df = df.fillna({
        'status': 'unknown',
        'region': 'undefined',
        'amount': 0.0
    })
    
    # Add derived columns for analytics
    df = df.withColumn('year_month', F.date_format('created_at', 'yyyy-MM')) \
           .withColumn('is_high_value', F.col('amount') > 1000)
    
    # Write to silver layer
    df.write.mode('overwrite') \
            .partitionBy('year_month') \
            .parquet(silver_path)

This transformation pipeline demonstrates essential silver layer practices: standardizing formats, removing duplicates, handling missing values, and creating derived attributes that support downstream analytics.

Creating Gold Layer Aggregations

The gold layer produces business-ready aggregates optimized for specific use cases. These transformations typically involve complex joins, window functions, and business-specific calculations.

# gold/create_customer_metrics.py
from pyspark.sql import functions as F
from pyspark.sql.window import Window

def build_customer_metrics(silver_path, gold_path):
    """Create business-level customer metrics for analytics"""
    df = spark.read.parquet(silver_path)
    
    # Calculate customer lifetime value using window functions
    window_spec = Window.partitionBy('customer_id') \
                        .orderBy('created_at') \
                        .rowsBetween(Window.unboundedPreceding, Window.currentRow)
    
    metrics = df.groupBy('customer_id', 'region').agg(
        F.count('transaction_id').alias('transaction_count'),
        F.sum('amount').alias('total_revenue'),
        F.avg('amount').alias('avg_transaction_value'),
        F.min('created_at').alias('first_purchase_date'),
        F.max('created_at').alias('last_purchase_date'),
        F.sum(F.when(F.col('is_high_value'), 1).otherwise(0)).alias('high_value_transactions')
    )
    
    # Add calculated fields
    metrics = metrics.withColumn(
        'customer_lifetime_months',
        F.monthsBetween(F.col('last_purchase_date'), F.col('first_purchase_date')) + 1
    ).withColumn(
        'monthly_value',
        F.col('total_revenue') / F.col('customer_lifetime_months')
    )
    
    # Write optimized gold table
    metrics.repartition('region') \
           .write.mode('overwrite') \
           .parquet(gold_path)

Gold layer tables should align directly with consumption patterns—dashboard queries, ML feature engineering, or API responses. The partition strategy should reflect the most common access patterns to minimize query latency.

Implementing Data Quality Checks

Maintaining data quality across medallion layers requires automated validation at each stage. Integrate checks that catch issues before they propagate downstream.

# dbt/tests/test_silver_quality.yml
version: 2

models:
  - name: silver_transactions
    description: "Cleaned transaction data"
    tests:
      - dbt_utils.recency:
          datepart: hour
          interval: 4
          field: _ingested_at
      - not_null:
          - customer_id
          - transaction_id
      - unique:
          - transaction_id
      - dbt_utils.expression_is_true:
          expression: "amount >= 0"

These tests catch data anomalies immediately upon pipeline execution rather than allowing defects to reach end users. Implementing quality gates between each medallion layer prevents bad data from contaminating downstream tables.

Orchestrating the Pipeline

Finally, orchestrate your medallion pipeline using tools like Apache Airflow or Prefect. Define dependencies clearly to ensure proper execution order and enable parallel processing where appropriate.

# dags/medallion_pipeline.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

with DAG('medallion_architecture',
         start_date=datetime(2026, 1, 1),
         schedule_interval='@daily') as dag:
    
    ingest_bronze = PythonOperator(
        task_id='ingest_to_bronze',
        python_callable=ingest_raw,
        op_kwargs={'source': 'sales_api'}
    )
    
    transform_silver = PythonOperator(
        task_id='transform_to_silver',
        python_callable=clean_and_enrich_bronze,
        depends_on_past=True
    )
    
    aggregate_gold = PythonOperator(
        task_id='create_gold_metrics',
        python_callable=build_customer_metrics,
        depends_on_past=True
    )
    
    ingest_bronze >> transform_silver >> aggregate_gold

Best Practices for Medallion with Claude Code

When implementing medallion architecture with Claude Code, several practices will significantly improve your workflow reliability and maintainability.

Use timestamp-based partitioning in your Bronze and Silver layers to enable efficient time-travel queries. This allows you to reprocess historical data without scanning entire tables.

Implement idempotent transformations that produce consistent results regardless of how many times they’re executed. This is crucial for debugging and recovery scenarios.

Capture comprehensive metadata at each layer—ingestion timestamps, source systems, batch IDs, and processing durations. This metadata proves invaluable for debugging lineage issues.

Validate data quality at layer transitions using tools like Great Expectations or dbt tests. Catching issues early prevents expensive cleanup operations later.

Leverage Claude Code’s strengths by asking it to generate boilerplate code, explain complex transformations, or suggest optimization opportunities. For complex medallion implementations, consider using the dbt skill to define transformations declaratively and maintain clear documentation.

Claude Code accelerates medallion architecture implementation by handling repetitive boilerplate while you focus on business logic. The key is establishing clear separation between layers, implementing robust quality checks, and maintaining comprehensive metadata for observability.

Built by theluckystrike — More at zovo.one