Schema drift silently breaks APIs, corrupts databases, and causes midnight pages. Manual review catches maybe 60% of issues. AI-assisted schema validation can catch breaking changes, suggest fixes, and enforce conventions before code merges.
This guide covers three approaches: AI-generated validation rules, LLM-assisted schema review in CI, and schema evolution analysis.
The Problem With Manual Schema Review
Consider this Protobuf change:
// Before
message UserEvent {
string user_id = 1;
string event_type = 2;
int64 timestamp = 3;
}
// After (proposed change)
message UserEvent {
string user_id = 1;
string event_type = 2;
int64 timestamp = 3;
string metadata = 4; // NEW: optional, safe
// removed: timestamp field -- BREAKING: field 3 reused below
string correlation_id = 3; // BREAKING: field number collision
}
A human reviewer focused on the new metadata field might miss the field number collision. An AI tool reviewing the schema diff will catch it every time.
Approach 1: AI Schema Review in CI
This GitHub Action sends schema diffs to Claude for review before merge:
# .github/workflows/schema-review.yml
name: AI Schema Review
on:
pull_request:
paths:
- '**/*.proto'
- '**/*.json'
- '**/schema.sql'
- '**/migrations/**'
jobs:
schema-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Extract schema diffs
id: diffs
run: |
git diff origin/${{ github.base_ref }}...HEAD \
-- '*.proto' '*.json' 'schema.sql' 'migrations/' \
> /tmp/schema.diff
echo "has_changes=$(test -s /tmp/schema.diff && echo true || echo false)" >> $GITHUB_OUTPUT
- name: AI schema review
if: steps.diffs.outputs.has_changes == 'true'
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python3 << 'EOF'
import os, json, urllib.request
with open('/tmp/schema.diff') as f:
diff = f.read()[:6000]
payload = {
"model": "claude-opus-4-6",
"max_tokens": 1500,
"messages": [{
"role": "user",
"content": f"""Review this schema diff for:
1. Breaking changes (field removal, type change, number reuse in Protobuf)
2. Backward compatibility issues
3. Missing required fields that could cause null pointer errors
4. Index changes that could degrade query performance
5. Convention violations (naming, nullable constraints)
Format your response as:
SEVERITY: [BLOCKING | WARNING | INFO]
ISSUES: [bullet list or "None found"]
SUGGESTIONS: [bullet list or "None"]
Diff:
{diff}"""
}]
}
req = urllib.request.Request(
"https://api.anthropic.com/v1/messages",
data=json.dumps(payload).encode(),
headers={
"x-api-key": os.environ["ANTHROPIC_API_KEY"],
"anthropic-version": "2023-06-01",
"content-type": "application/json"
}
)
with urllib.request.urlopen(req) as resp:
result = json.loads(resp.read())
review = result["content"][0]["text"]
with open('/tmp/schema_review.txt', 'w') as f:
f.write(review)
print(review)
if "SEVERITY: BLOCKING" in review:
exit(1)
EOF
- name: Post review as PR comment
if: always() && steps.diffs.outputs.has_changes == 'true'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
if (!fs.existsSync('/tmp/schema_review.txt')) return;
const review = fs.readFileSync('/tmp/schema_review.txt', 'utf8');
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body: `## AI Schema Review\n\n\`\`\`\n${review}\n\`\`\``
});
Approach 2: JSON Schema Generation and Validation
AI can generate JSON Schema from example payloads, then validate new examples against it:
# schema_validator.py
import json
import jsonschema
from anthropic import Anthropic
client = Anthropic()
def generate_schema_from_examples(examples: list[dict]) -> dict:
"""Use Claude to infer a strict JSON Schema from example payloads."""
examples_json = json.dumps(examples[:5], indent=2)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2000,
messages=[{
"role": "user",
"content": f"""Generate a strict JSON Schema (draft-07) from these examples.
Rules:
- Mark fields present in all examples as required
- Use specific types (no 'any')
- Add format hints (email, date-time, uri) where obvious
- Use additionalProperties: false for all objects
- Add minimum/maximum for numeric fields if pattern is clear
Return only valid JSON, no explanation.
Examples:
{examples_json}"""
}]
)
schema_text = response.content[0].text.strip()
# Strip markdown code blocks if present
if schema_text.startswith("```"):
schema_text = "\n".join(schema_text.split("\n")[1:-1])
return json.loads(schema_text)
def validate_payload(payload: dict, schema: dict) -> dict:
"""Validate a payload and return human-readable error summary."""
validator = jsonschema.Draft7Validator(schema)
errors = list(validator.iter_errors(payload))
if not errors:
return {"valid": True, "errors": []}
error_list = []
for error in errors:
path = " -> ".join(str(p) for p in error.absolute_path) or "root"
error_list.append({
"path": path,
"message": error.message,
"schema_path": " -> ".join(str(p) for p in error.absolute_schema_path)
})
return {"valid": False, "errors": error_list}
# Usage example
if __name__ == "__main__":
training_examples = [
{
"user_id": "usr_abc123",
"email": "alice@example.com",
"created_at": "2026-01-01T00:00:00Z",
"plan": "pro",
"seats": 5
},
{
"user_id": "usr_def456",
"email": "bob@example.com",
"created_at": "2026-02-15T12:30:00Z",
"plan": "free",
"seats": 1
}
]
schema = generate_schema_from_examples(training_examples)
print("Generated schema:")
print(json.dumps(schema, indent=2))
# Test with a bad payload
bad_payload = {
"user_id": "usr_xyz",
"email": "not-an-email",
"created_at": "2026-03-22", # wrong format
"plan": "enterprise",
"seats": -1 # negative seats
}
result = validate_payload(bad_payload, schema)
if not result["valid"]:
print("\nValidation errors:")
for err in result["errors"]:
print(f" [{err['path']}] {err['message']}")
Sample generated schema output:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["user_id", "email", "created_at", "plan", "seats"],
"additionalProperties": false,
"properties": {
"user_id": {
"type": "string",
"pattern": "^usr_[a-z0-9]+$"
},
"email": {
"type": "string",
"format": "email"
},
"created_at": {
"type": "string",
"format": "date-time"
},
"plan": {
"type": "string",
"enum": ["free", "pro", "enterprise"]
},
"seats": {
"type": "integer",
"minimum": 1,
"maximum": 1000
}
}
}
Claude correctly inferred the enum from two examples, the usr_ ID prefix pattern, and the date-time format.
Approach 3: SQL Schema Migration Analysis
Before running migrations, use AI to predict impact:
# migration_analyzer.py
from anthropic import Anthropic
import re
client = Anthropic()
MIGRATION_PROMPT = """Analyze this SQL migration for production safety.
Check for:
1. Table locks: ALTER TABLE on large tables, adding NOT NULL without DEFAULT
2. Index operations: CREATE INDEX without CONCURRENTLY on PostgreSQL
3. Data loss risk: DROP COLUMN, DROP TABLE, TRUNCATE
4. Breaking changes: column rename (apps may still reference old name)
5. Performance: missing index on new foreign keys
For each issue found, output:
RISK: [HIGH | MEDIUM | LOW]
LINE: [line number or N/A]
ISSUE: [description]
FIX: [recommended fix]
Migration:
"""
def analyze_migration(migration_sql: str) -> list[dict]:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1500,
messages=[{
"role": "user",
"content": MIGRATION_PROMPT + migration_sql
}]
)
text = response.content[0].text
issues = []
# Parse structured output
blocks = re.split(r'\n(?=RISK:)', text.strip())
for block in blocks:
if "RISK:" not in block:
continue
issue = {}
for line in block.split("\n"):
for key in ["RISK", "LINE", "ISSUE", "FIX"]:
if line.startswith(f"{key}:"):
issue[key.lower()] = line[len(key)+1:].strip()
if issue:
issues.append(issue)
return issues
# Test migration
test_migration = """
ALTER TABLE users ADD COLUMN last_login_ip VARCHAR(45) NOT NULL;
CREATE INDEX idx_users_email ON users (email);
ALTER TABLE orders DROP COLUMN legacy_status;
ALTER TABLE products ADD COLUMN category_id INT REFERENCES categories(id);
"""
issues = analyze_migration(test_migration)
for issue in issues:
print(f"[{issue.get('risk', '?')}] {issue.get('issue', '')}")
print(f" Fix: {issue.get('fix', '')}\n")
Output for the test migration:
[HIGH] Adding NOT NULL column without DEFAULT will lock table and fail if existing rows exist
Fix: Use NOT NULL DEFAULT '' or add column as nullable first, backfill, then add constraint
[MEDIUM] CREATE INDEX without CONCURRENTLY will lock table for reads+writes during index build
Fix: Use CREATE INDEX CONCURRENTLY idx_users_email ON users (email)
[HIGH] DROP COLUMN legacy_status is irreversible — data will be lost
Fix: Rename to _deprecated_legacy_status first, verify no app references, drop in next release
[MEDIUM] Foreign key on category_id has no index — JOIN and DELETE operations will be slow
Fix: Add CREATE INDEX idx_products_category_id ON products (category_id)
Tool Comparison
| Tool | Schema Types | CI Integration | Custom Rules | Cost |
|---|---|---|---|---|
| Claude (API) | Any (Protobuf, JSON, SQL) | Via script | Prompt engineering | ~$0.01/review |
| GPT-4 (API) | Any | Via script | Prompt engineering | ~$0.02/review |
| Buf (Protobuf) | Protobuf only | Native | Config file | Free |
| Spectral (OpenAPI) | OpenAPI/AsyncAPI | Native | Ruleset files | Free |
| schemachange (SQL) | SQL only | Native | Limited | Free |
For polyglot schemas (Protobuf + JSON + SQL in one repo), an AI-based approach is the only tool that handles all three in a unified pipeline.