Claude Skills Guide

Introduction

Data quality is the foundation of reliable software systems and data-driven decisions. Poor data quality leads to flawed analytics, incorrect business decisions, and system failures. This guide explores how Claude Code CLI can automate and streamline your data quality validation workflows, making it easier to catch issues before they impact production systems.

Understanding Data Quality Validation

Data quality validation encompasses several key dimensions:

Traditional validation approaches often involve writing custom scripts for each dataset, leading to duplicated effort and inconsistent validation logic across projects.

Setting Up Claude Code for Data Validation

First, ensure Claude Code is installed and configured. Then create a dedicated validation project:

mkdir data-validation && cd data-validation
mkdir -p .claude

Create a validation script that Claude can use:

#!/usr/bin/env python3
"""Data quality validation framework"""

import json
import sys
from typing import Any, Dict, List

class ValidationError(Exception):
    def __init__(self, field: str, message: str):
        self.field = field
        super().__init__(f"{field}: {message}")

def validate_required_fields(data: Dict, required_fields: List[str]) -> None:
    """Check that all required fields are present"""
    for field in required_fields:
        if field not in data or data[field] is None:
            raise ValidationError(field, "Required field is missing or null")

def validate_data_types(data: Dict, schema: Dict) -> None:
    """Validate data types match expected schema"""
    for field, expected_type in schema.items():
        if field in data:
            actual_type = type(data[field]).__name__
            if actual_type != expected_type:
                raise ValidationError(field, f"Expected {expected_type}, got {actual_type}")

def validate_range(data: Dict, field: str, min_val: Any = None, max_val: Any = None) -> None:
    """Validate numeric fields fall within acceptable ranges"""
    if field in data:
        value = data[field]
        if min_val is not None and value < min_val:
            raise ValidationError(field, f"Value {value} is below minimum {min_val}")
        if max_val is not None and value > max_val:
            raise ValidationError(field, f"Value {value} exceeds maximum {max_val}")

def run_validation(data: List[Dict], schema: Dict) -> Dict:
    """Run full validation pipeline and return results"""
    results = {"valid": [], "invalid": [], "errors": []}
    
    for idx, record in enumerate(data):
        try:
            validate_required_fields(record, schema.get("required", []))
            validate_data_types(record, schema.get("types", {}))
            for field, range_spec in schema.get("ranges", {}).items():
                validate_range(record, field, range_spec.get("min"), range_spec.get("max"))
            results["valid"].append(idx)
        except ValidationError as e:
            results["invalid"].append(idx)
            results["errors"].append({"index": idx, "error": str(e)})
    
    return results

if __name__ == "__main__":
    # Example usage
    sample_data = [
        {"id": 1, "name": "Alice", "age": 30, "email": "alice@example.com"},
        {"id": 2, "name": "Bob", "age": 25},
        {"id": 3, "name": "Charlie", "age": 150, "email": "charlie@example.com"}
    ]
    
    schema = {
        "required": ["id", "name", "email"],
        "types": {"id": "int", "name": "str", "age": "int", "email": "str"},
        "ranges": {"age": {"min": 0, "max": 120}}
    }
    
    results = run_validation(sample_data, schema)
    print(json.dumps(results, indent=2))

Creating a Validation Prompt for Claude

The real power comes from using Claude to generate and run validation tasks. Create a CLAUDE.md file in your project:

# Data Validation Assistant

You help validate datasets against defined schemas and business rules.

## Capabilities

1. **Schema Validation**: Check data conforms to JSON Schema or custom schemas
2. **Business Rule Validation**: Apply custom validation logic
3. **Data Profiling**: Analyze data quality metrics
4. **Error Reporting**: Generate detailed validation reports

## Workflow

When asked to validate data:
1. Load the data file and schema
2. Run all validation checks
3. Report findings in structured format
4. Suggest fixes for failed validations

## Output Format

Always output validation results as:
- Summary: total records, passed, failed
- Errors: list of specific validation failures
- Recommendations: suggested fixes

Practical Example: Validating Customer Data

Here’s a practical workflow for validating customer records:

# customer_schema.py
CUSTOMER_SCHEMA = {
    "required": ["customer_id", "email", "signup_date", "status"],
    "types": {
        "customer_id": "str",
        "email": "str", 
        "signup_date": "str",
        "status": "str",
        "age": "int",
        "lifetime_value": "float"
    },
    "ranges": {
        "age": {"min": 13, "max": 120},
        "lifetime_value": {"min": 0, "max": 1000000}
    },
    "allowed_values": {
        "status": ["active", "inactive", "suspended", "pending"]
    },
    "email_pattern": r"^[\w\.-]+@[\w\.-]+\.\w+$"
}

def validate_customer_record(record: Dict) -> List[str]:
    """Validate a single customer record"""
    errors = []
    
    # Check required fields
    for field in CUSTOMER_SCHEMA["required"]:
        if not record.get(field):
            errors.append(f"Missing required field: {field}")
    
    # Validate email format
    email = record.get("email", "")
    import re
    if not re.match(CUSTOMER_SCHEMA["email_pattern"], email):
        errors.append(f"Invalid email format: {email}")
    
    # Validate status values
    status = record.get("status")
    if status and status not in CUSTOMER_SCHEMA["allowed_values"]["status"]:
        errors.append(f"Invalid status: {status}. Must be one of {CUSTOMER_SCHEMA['allowed_values']['status']}")
    
    return errors

Automating Validation Workflows

Combine Claude Code with scheduled jobs for continuous data quality monitoring:

# Run validation as part of your data pipeline
claude --print "Validate the customer_data.json file against customer_schema.py and report any issues" > validation_report.txt

# Check exit code for pipeline integration
if [ $? -eq 0 ]; then
    echo "Validation passed"
else
    echo "Validation failed - review report"
    exit 1
fi

Best Practices for Data Validation

1. Define Clear Schemas

Always start with explicit schema definitions. Use JSON Schema or Python type hints:

from dataclasses import dataclass
from typing import Optional

@dataclass
class Customer:
    customer_id: str
    email: str
    signup_date: str
    status: str
    age: Optional[int] = None
    lifetime_value: Optional[float] = None

2. Fail Fast, Fail Loud

Configure validation to fail immediately on critical errors rather than continuing with bad data:

# Critical validation - stop on failure
def validate_critical(record: Dict) -> None:
    validate_required_fields(record, ["customer_id", "email"])
    validate_email_format(record["email"])
    # Don't continue if critical fields are invalid

3. Log Everything

Maintain audit trails of all validation runs:

import logging
from datetime import datetime

logging.basicConfig(
    filename=f"validation_{datetime.now().strftime('%Y%m%d')}.log",
    level=logging.INFO
)

def log_validation(results: Dict) -> None:
    logging.info(f"Validation run: {datetime.now()}")
    logging.info(f"Total: {len(results['valid']) + len(results['invalid'])}")
    logging.info(f"Passed: {len(results['valid'])}")
    logging.info(f"Failed: {len(results['invalid'])}")

4. Implement Incremental Validation

For large datasets, validate in batches to catch issues early:

BATCH_SIZE = 1000

def validate_in_batches(data: List[Dict], batch_size: int = BATCH_SIZE):
    for i in range(0, len(data), batch_size):
        batch = data[i:i + batch_size]
        results = run_validation(batch, schema)
        yield {"batch": i // batch_size, "results": results}

Conclusion

Claude Code transforms data quality validation from a manual, error-prone process into an automated, reliable workflow. By defining clear schemas, implementing comprehensive validation logic, and using Claude’s capabilities for generating validation code and analyzing results, you can ensure your data meets the highest quality standards.

Start small—validate one dataset with a simple schema—and progressively add complexity as your validation framework matures. The investment in robust data quality validation pays dividends in system reliability and data-driven decision making.

Remember: Clean data leads to clean insights. Let Claude Code help you achieve both.

Built by theluckystrike — More at zovo.one