Claude Skills Guide

Claude Code for Apache Drill Workflow Tutorial

Apache Drill is a schema-free SQL query engine that enables analysts and developers to explore data across multiple data sources—including HDFS, MongoDB, Amazon S3, and cloud storage—using familiar SQL syntax. When combined with Claude Code’s automation capabilities, you can create powerful, reproducible workflows for data exploration, schema discovery, and complex query generation. This tutorial walks you through building Drill-powered workflows using Claude Code.

Prerequisites and Setup

Before creating Drill workflows with Claude Code, ensure you have the following components installed and configured:

  1. Apache Drill - Download from the official Apache Drill site and start the embedded mode or Drillbit service
  2. Claude Code - Installed and configured with appropriate tool access (Bash, read_file, write_file)
  3. Python with JayDeBeApi - For JDBC connectivity to Drill (optional but recommended)

Verify Drill is running by executing:

curl -X GET http://localhost:8047/health

A successful response indicates Drill’s REST API is accessible. With Claude Code configured, you can now automate Drill interactions through the command-line interface or build custom skills for repeated workflows.

Connecting Claude Code to Apache Drill

The most straightforward approach involves using Drill’s REST API through curl commands executed via Claude Code’s Bash tool. This method requires no additional Python dependencies and works across all platforms.

Create a skill that encapsulates Drill connection logic:

---
name: drill
description: Execute Apache Drill queries via REST API
---

# Apache Drill Query Runner

This skill executes SQL queries against Apache Drill and returns formatted results.

## Connection Configuration

Drill defaults to localhost:8047. Set environment variables for custom configurations:

export DRILL_HOST=your-drill-server
export DRILL_PORT=8047

With this foundation, Claude Code can execute Drill queries by constructing appropriate HTTP requests. The basic query execution pattern:

curl -X POST -H "Content-Type: application/json" \
  -d '{"queryType": "SQL", "query": "SELECT * FROM dfs.tmp.`sample_data.json` LIMIT 10"}' \
  http://localhost:8047/query.json

Building Your First Drill Workflow

A practical Drill workflow typically involves three phases: schema discovery, query construction, and result export. Let’s build a complete example that explores a nested JSON dataset.

Step 1: Discover Available Data Sources

Start by listing storage plugins and their configurations:

curl -s -X GET http://localhost:8047/storage.json | jq '.'

This returns all configured Drill storage plugins (dfs, mongo, hive, etc.). For each plugin, you can query its schema:

curl -X POST -H "Content-Type: application/json" \
  -d '{"queryType": "SQL", "query": "SHOW TABLES FROM dfs.`/data/sales`"}' \
  http://localhost:8047/query.json

Claude Code can automate this exploration, systematically discovering tables and their structures across all connected data sources.

Step 2: Analyze Schema with DESCRIBE

Drill’s DESCRIBE command reveals complex nested structures—crucial for JSON, Parquet, and MongoDB sources:

curl -X POST -H "Content-Type: application/json" \
  -d '{"queryType": "SQL", "query": "DESCRIBE dfs.`/data/sales`"}' \
  http://localhost:8047/query.json

For nested data, Drill flattens structures using dot notation. Understanding the actual schema prevents query errors when accessing deeply nested arrays or maps.

Step 3: Construct and Execute Queries

Now build your analysis query. A well-structured Drill query handles complex types explicitly:

SELECT 
  customer_id,
  SUM(total_amount) AS total_spent,
  COUNT(*) AS order_count,
  FLATTEN(products) AS product
FROM dfs.`/data/sales`
WHERE order_date >= '2025-01-01'
GROUP BY customer_id, FLATTEN(products)
ORDER BY total_spent DESC
LIMIT 100

Execute this through Claude Code:

QUERY='SELECT customer_id, SUM(total_amount) AS total_spent FROM dfs.`/data/sales` WHERE order_date >= '\''2025-01-01'\'' GROUP BY customer_id ORDER BY total_spent DESC LIMIT 100'

curl -X POST -H "Content-Type: application/json" \
  -d "{\"queryType\": \"SQL\", \"query\": \"$QUERY\"}" \
  http://localhost:8047/query.json > results.json

Automating Repeated Workflows

For recurring analysis tasks, create Claude Code skills that encapsulate entire workflows. A typical pattern involves:

  1. Parameter-driven queries: Pass date ranges, filters, or table names as variables
  2. Result caching: Store query results for subsequent processing
  3. Output formatting: Transform Drill output into CSV, JSON, or analysis-ready formats

Here’s a skill for weekly sales reporting:

---
name: drill-sales-report
description: Generate weekly sales summary from Drill
---

# Weekly Sales Report Generator

## Usage

Provide the following parameters:
- start_date: Report start date (YYYY-MM-DD)
- end_date: Report end date (YYYY-MM-DD)
- output_format: csv or json

## Query Template

SELECT 
  DATE_TRUNC('week', order_date) AS week,
  category,
  COUNT(DISTINCT customer_id) AS unique_customers,
  SUM(total_amount) AS revenue,
  AVG(total_amount) AS avg_order_value
FROM dfs.`/data/sales`
WHERE order_date BETWEEN '${start_date}' AND '${end_date}'
GROUP BY DATE_TRUNC('week', order_date), category
ORDER BY week, revenue DESC

This approach makes Drill accessible to team members who prefer working through natural language prompts rather than writing SQL directly.

Best Practices for Drill Workflows

When building production-grade Drill workflows with Claude Code, consider these recommendations:

Optimize Query Performance: Use LIMIT clauses during exploration to avoid scanning entire datasets. Drill’s distributed execution model means poorly optimized queries can stress cluster resources significantly.

Leverage Metadata Caching: Drill caches metadata aggressively. For frequently queried sources, maintain a separate metadata refresh workflow to ensure schema changes are recognized.

Handle Nested Data Carefully: Complex types (arrays, maps, structs) require explicit handling. Always use DESCRIBE before querying unfamiliar data structures, and test FLATTEN operations on small samples before scaling.

Implement Error Handling: Drill queries can fail due to malformed SQL, missing files, or permission issues. Wrap API calls in error-checking logic and provide meaningful feedback when queries fail.

Conclusion

Claude Code transforms Apache Drill from an interactive query tool into an automatable workflow engine. By encapsulating connection logic, query templates, and result processing into skills, you enable reproducible data exploration across your organization. Start with simple REST API interactions, then graduate to parameterized workflows that handle real-world analytical requirements. The combination of Drill’s schema-free flexibility and Claude Code’s automation capabilities creates a powerful foundation for data-driven workflows.

Built by theluckystrike — More at zovo.one