SQL query generation is one of the few AI tasks with objective evaluation criteria — the query either returns correct results or it doesn’t. This makes it unusually easy to benchmark. This guide covers dedicated text-to-SQL tools, general LLMs, and IDE-integrated options, with accuracy benchmarks on real-world query patterns.
Tools Compared
- SQLAI.ai — Dedicated text-to-SQL with schema awareness
- Outerbase — Database client with AI query generation built in
- Claude (Anthropic) — General LLM with strong SQL generation
- ChatGPT (GPT-4o) — General LLM, strong SQL understanding
- Cursor / Copilot — IDE-native query assistance
Why SQL Generation Is Hard
Simple SELECTs are trivial for any modern LLM. The failure modes appear at complexity:
- Multi-level CTEs with self-joins
- Window functions with frame specifications
- Dialect-specific syntax (QUALIFY in BigQuery, LATERAL JOIN in Postgres)
- Aggregations with HAVING vs WHERE distinction
- NULL handling across different databases
Benchmark Query 1: Year-over-Year Revenue with Window Functions
Prompt: “Show monthly revenue for 2024 and 2025 side by side with percent change, for each product category”
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
category,
SUM(revenue) AS total_revenue
FROM orders
JOIN products USING (product_id)
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2
)
SELECT
category,
month,
total_revenue AS current_revenue,
LAG(total_revenue, 12) OVER (
PARTITION BY category ORDER BY month
) AS prior_year_revenue,
ROUND(
100.0 * (total_revenue - LAG(total_revenue, 12) OVER (
PARTITION BY category ORDER BY month
)) / NULLIF(LAG(total_revenue, 12) OVER (
PARTITION BY category ORDER BY month
), 0),
2
) AS yoy_pct_change
FROM monthly_revenue
ORDER BY category, month;
| Tool | Generated Correct Query | Notes |
|---|---|---|
| Claude (Sonnet 4.5) | Yes | Used LAG with correct NULLIF for division safety |
| GPT-4o | Yes | Slightly different CTE structure, same result |
| SQLAI.ai | Partial | Missed the NULLIF, division by zero on new categories |
| Copilot Chat | Partial | Generated window function but wrong PARTITION BY |
Benchmark Query 2: Recursive CTE for Hierarchy
Prompt: “Find all managers in the org chart above employee ID 1042, including their level”
WITH RECURSIVE org_hierarchy AS (
SELECT employee_id, manager_id, name, 0 AS depth
FROM employees
WHERE employee_id = 1042
UNION ALL
SELECT e.employee_id, e.manager_id, e.name, oh.depth + 1
FROM employees e
INNER JOIN org_hierarchy oh ON e.employee_id = oh.manager_id
)
SELECT employee_id, name, depth AS levels_above_1042
FROM org_hierarchy
WHERE depth > 0
ORDER BY depth;
| Tool | Generated Correct Query | Notes |
|---|---|---|
| Claude (Sonnet 4.5) | Yes | Correct recursive structure |
| GPT-4o | Yes | Added cycle detection guard (good extra) |
| SQLAI.ai | No | Generated iterative approach, not recursive CTE |
| Copilot Chat | Partial | Recursive structure correct, wrong termination |
Benchmark Query 3: BigQuery QUALIFY Clause
Prompt: “In BigQuery, filter rows where the most recent event per user is a ‘purchase’, using QUALIFY”
SELECT
user_id,
event_type,
event_timestamp,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp DESC) AS rn
FROM events
QUALIFY rn = 1 AND event_type = 'purchase';
| Tool | Used QUALIFY | Notes |
|---|---|---|
| Claude (Sonnet 4.5) | Yes | Correctly used when dialect specified |
| GPT-4o | Yes | Also correct |
| SQLAI.ai | No | Used subquery with WHERE, valid but verbose |
| Copilot Chat | No | Used subquery approach |
Schema-Aware Tools
For production use, schema awareness matters more than raw generation quality.
Using General LLMs Effectively for SQL
Always include your schema when using Claude or ChatGPT:
I'm querying a PostgreSQL database. Here is the relevant schema:
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
created_at TIMESTAMPTZ DEFAULT NOW(),
status TEXT CHECK (status IN ('pending', 'paid', 'shipped', 'cancelled')),
total_cents INTEGER
);
CREATE TABLE order_items (
item_id SERIAL PRIMARY KEY,
order_id INTEGER REFERENCES orders(order_id),
product_id INTEGER REFERENCES products(product_id),
quantity INTEGER,
unit_price_cents INTEGER
);
Query: Find the top 10 products by revenue in the last 30 days,
showing product name, units sold, and total revenue in dollars.
Include only products with at least 5 orders.
With schema context, Claude and GPT-4o produce correct column names and join conditions on the first attempt. Without it, they hallucinate reasonable-sounding but wrong column names about 30% of the time.
Accuracy Summary
Tested on 40 queries covering: basic aggregations, window functions, CTEs, recursive queries, dialect-specific syntax, and NULL edge cases.
| Tool | Accuracy | Schema Aware | Dialect Support | Cost |
|---|---|---|---|---|
| Claude Sonnet 4.5 | 87% | Prompt-based | All major | API pricing |
| GPT-4o | 85% | Prompt-based | All major | API pricing |
| SQLAI.ai | 78% | Native connection | Postgres, MySQL, BQ, Snowflake | $29/mo |
| Outerbase AI | 72% | Native connection | Most major | Free + paid |
| Copilot Chat | 68% | File context only | All major | $10-19/mo |
Practical Workflow
For analysts without IDE access, SQLAI.ai or Outerbase gives the best experience because of native schema connections. For engineers in an IDE, providing schema in the prompt to Claude or GPT-4o achieves higher accuracy at lower cost.
For production query generation, the schema-in-prompt approach with Claude is most reliable — especially for complex analytical queries where the accuracy gap between tools is widest.
Benchmark Query 4: Percentile Calculation Without Window Functions
Prompt: “Find the 95th percentile of order value in the last 30 days, for each product category”
-- Claude output (correct for PostgreSQL)
WITH ordered_values AS (
SELECT category, revenue
FROM orders
JOIN products USING (product_id)
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT DISTINCT
category,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY revenue)
OVER (PARTITION BY category) AS p95_revenue
FROM ordered_values;
| Tool | Correct | Notes |
|---|---|---|
| Claude | Yes | Used correct PERCENTILE_CONT aggregation |
| GPT-4o | Yes | Also correct, slightly different syntax |
| SQLAI.ai | No | Used NTILE instead, off by one tier |
Integrating Generated Queries into Applications
Python with SQLAlchemy
from sqlalchemy import text
def get_revenue_by_category(session, days_back: int = 30):
query = text("""
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
category,
SUM(revenue) AS total_revenue
FROM orders
JOIN products USING (product_id)
WHERE order_date >= NOW() - INTERVAL ':days days'
GROUP BY 1, 2
)
SELECT
category,
month,
total_revenue,
LAG(total_revenue, 12) OVER (
PARTITION BY category ORDER BY month
) AS prior_year_revenue
FROM monthly_revenue
ORDER BY category, month
""")
return session.execute(query, {"days": days_back}).fetchall()
When using Claude to generate text() queries, always include :param_name placeholders for parameterization. Claude respects this pattern when you show an example.
JavaScript with Knex.js
function getProductsByRevenue(knex, minRevenue) {
return knex
.with("ranked_products", (qb) => {
qb.select("product_id", "category", "revenue")
.from("orders")
.join("products", "orders.product_id", "=", "products.product_id")
.select(knex.raw(
"ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) AS rank"
))
.where("revenue", ">", minRevenue);
})
.select("*")
.from("ranked_products")
.where("rank", "<=", 10);
}
For Knex, provide Claude with the builder syntax examples from your codebase. Claude learns the pattern and generates compatible queries.
Direct Database Connection (Node.js)
async function analyzeUserTrends(pool, userId) {
const query = `
WITH user_orders AS (
SELECT
order_id,
DATE(order_date) AS order_day,
COUNT(*) OVER (
ORDER BY DATE(order_date)
ROWS BETWEEN 30 PRECEDING AND CURRENT ROW
) AS orders_in_window
FROM orders
WHERE user_id = $1
)
SELECT
order_day,
orders_in_window,
ROUND(100.0 * orders_in_window / MAX(orders_in_window) OVER (), 1) AS pct_of_max
FROM user_orders
ORDER BY order_day DESC;
`;
return pool.query(query, [userId]);
}
Testing Generated Queries
Always validate generated queries before deploying:
def test_generated_query_correctness():
# Test data
session.add_all([
Order(product_id=1, revenue=100, order_date=datetime(2024, 1, 1)),
Order(product_id=1, revenue=150, order_date=datetime(2025, 1, 1)),
])
session.commit()
results = get_revenue_by_category(session)
# Assert: should have row for month=2025-01 with revenue=150 and prior_year=100
assert results[0].yoy_pct_change == 50.0
def test_edge_case_no_prior_data():
# Test: if only current year data exists, prior_year should be NULL
session.add(Order(product_id=1, revenue=100, order_date=datetime(2025, 1, 1)))
session.commit()
results = get_revenue_by_category(session)
assert results[0].prior_year_revenue is None
Common Query Generation Mistakes to Watch For
- Window function frame misspecification — ORDER BY without frame clause causes incorrect rows
- Aggregation with GROUP BY missing columns — Some databases require all non-aggregated columns in GROUP BY
- NULL handling in comparisons —
column = NULLshould beIS NULL - Subquery correlation errors — Outer query columns not visible in subqueries without explicit correlation names
- Date arithmetic across databases — MySQL uses DATE_ADD, Postgres uses INTERVAL syntax
Ask Claude explicitly: “Include NULL checks. Ensure all non-aggregated columns in GROUP BY. Use {database} specific syntax.”
When to Skip AI Query Generation
Some queries are too domain-specific for reliable generation:
- Recursive hierarchies with cycle detection (CTEs with UNION clauses need manual validation)
- Fuzzy matching / trigram similarity (database-specific, high error rate)
- Complex graph traversal (Postgres WITH RECURSIVE, Neo4j Cypher)
- Machine learning result scoring (requires domain context on weighting)
For these, AI-generated boilerplate + manual refinement is faster than pure generation.
Related Articles
- Best AI Assistant for SQL Query Optimization
- Best AI Tools for SQL Query Optimization 2026: EverSQL.
- Best AI Tools for SQL Query Optimization and Database
- AI-Powered Database Query Optimization Tools 2026
- AI Tools for Database Performance Optimization Query
Built by theluckystrike — More at zovo.one