Test data generation is more complex than it looks. Random strings and integers fail to catch bugs that only appear with realistic data — email formats, valid zip codes, coherent address/city/state combinations, or financial amounts that respect business rules.
The Problem with Simple Fake Data
# This catches syntax bugs but misses most real bugs
user = {
"email": "aabbcc", # Not a valid email
"zip_code": "123", # Too short for US zip
"state": "ZZ", # Not a real state
"city": "Testcity", # Doesn't match state
}
Tests that use structurally invalid data miss bugs in format validation, state/city matching logic, and timezone bugs related to specific locations.
Tools Compared
- Faker.js / Faker (Python) — Standard fake data libraries with locale support
- Mimesis — High-performance structured fake data for Python
- Mockaroo — Web/API tool for schema-defined realistic data
- Claude / LLMs — AI-generated contextually coherent datasets
Faker and Mimesis: The Baseline
from faker import Faker
from faker.providers import address, internet, bank
fake = Faker("en_US")
user = {
"name": fake.name(),
"email": fake.email(),
"address": fake.address(),
"phone": fake.phone_number(),
"credit_card": fake.credit_card_number(card_type="visa"),
}
# Seeded generation for reproducible tests
Faker.seed(12345)
fake_seeded = Faker()
print(fake_seeded.name()) # Always "Brian Torres" with seed 12345
Mimesis for high-volume generation (2-5x faster than Faker):
from mimesis import Person, Address, Finance
from mimesis.enums import Gender
import csv
person = Person("en")
address_gen = Address("en")
finance = Finance("en")
with open("test_users.csv", "w") as f:
writer = csv.DictWriter(f, fieldnames=["name", "email", "city", "state", "credit_card"])
writer.writeheader()
for _ in range(10_000):
writer.writerow({
"name": person.full_name(gender=Gender.FEMALE),
"email": person.email(domains=["company.com"]),
"city": address_gen.city(),
"state": address_gen.state(abbr=True),
"credit_card": finance.credit_card_number(),
})
Limitation of both: They don’t guarantee relational coherence. A city/state pair from Faker may not be a real combination.
AI-Generated Contextually Coherent Data
For test data that must be internally consistent, LLM generation produces better results:
import anthropic
import json
client = anthropic.Anthropic()
def generate_coherent_user_batch(count: int, constraints: dict) -> list[dict]:
prompt = f"""Generate {count} realistic but fake user profiles as a JSON array.
Each user must have:
- full_name: realistic for US locale
- email: derived from name (firstname.lastname@domain.com format)
- city: a real US city
- state: matching the city (correct state for that city)
- zip_code: a real valid zip code for that city/state
- phone: valid US format (not 555 numbers)
- birth_date: makes person 25-55 years old as of 2026
- job_title: realistic for a tech company
Constraints:
{json.dumps(constraints, indent=2)}
Return ONLY valid JSON array, no explanation."""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=4096,
messages=[{"role": "user", "content": prompt}]
)
raw = response.content[0].text
if raw.startswith("```"):
raw = "\n".join(raw.split("\n")[1:-1])
return json.loads(raw)
users = generate_coherent_user_batch(20, {
"geographic_distribution": "Mix of East Coast and West Coast cities",
"seniority_distribution": "60% junior, 40% senior",
"no_duplicate_emails": True
})
Output:
Jennifer Martinez | Austin, TX | Senior Software Engineer
David Chen | Seattle, WA | Product Manager
Aisha Williams | New York, NY | Software Engineer
The city/state pairs are real, emails follow the name format, and job titles are contextually appropriate.
Building Schema-Driven Generators
For database testing, generate data from your actual schema:
from sqlalchemy import inspect
import anthropic
import json
def generate_from_schema(engine, table_name: str, count: int) -> list[dict]:
inspector = inspect(engine)
columns = inspector.get_columns(table_name)
foreign_keys = inspector.get_foreign_keys(table_name)
schema_description = f"Table: {table_name}\nColumns:\n"
for col in columns:
nullable = "nullable" if col["nullable"] else "required"
schema_description += f" - {col['name']}: {col['type']} ({nullable})\n"
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-haiku-3-5",
max_tokens=4096,
messages=[{
"role": "user",
"content": f"""Generate {count} rows of realistic test data for this database table.
{schema_description}
Rules:
- All required fields must be populated
- Foreign key IDs should be integers 1-100
- Dates should be in ISO format
- Strings should be realistic, not 'test_value_1'
Return as a JSON array of objects with exact column names."""
}]
)
return json.loads(response.content[0].text)
Mockaroo for Team Use
curl "https://api.mockaroo.com/api/generate.json?count=100&key=YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"fields": [
{"name": "id", "type": "Row Number"},
{"name": "full_name", "type": "Full Name"},
{"name": "email", "type": "Email Address"},
{"name": "status", "type": "Custom List", "values": ["active", "inactive", "pending"]},
{"name": "created_at", "type": "Datetime", "min": "01/01/2024", "max": "03/21/2026"}
]
}'
Mockaroo’s strength is accessibility for QA teams who don’t write code. The web UI generates schema-defined data without code. Custom List types ensure status fields use your exact enum values.
PII-Safe Test Data
from faker import Faker
def anonymize_for_testing(real_record: dict) -> dict:
fake = Faker()
return {
**real_record,
"user_id": real_record["user_id"], # Keep ID for relational integrity
"email": fake.email(),
"full_name": fake.name(),
"phone": fake.phone_number(),
"address": fake.address(),
"ssn": fake.ssn(),
# Keep behavioral fields unchanged
"subscription_plan": real_record["subscription_plan"],
"created_at": real_record["created_at"],
"total_orders": real_record["total_orders"],
}
Generating Business-Rule-Compliant Data
Standard fake data libraries have no concept of your application’s constraints. If your system requires that orders can’t exceed a user’s credit limit, or that subscription end dates must be after start dates, Faker will silently violate those rules.
Use Claude to encode business rules directly in the generation prompt:
def generate_order_data(user_credit_limits: dict, count: int) -> list[dict]:
rules = """
Business rules for order data:
- order_amount must not exceed the user's credit_limit
- order_date must be within the last 12 months
- status must be 'pending' if order_date is within 3 days, else 'completed' or 'cancelled'
- cancelled orders must have a cancellation_reason
- discount_pct can only be 0, 10, 15, or 25
- total = order_amount * (1 - discount_pct/100), rounded to 2 decimal places
"""
prompt = f"""Generate {count} order records following these business rules:
{rules}
User credit limits: {json.dumps(user_credit_limits)}
Return as JSON array."""
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=8192,
messages=[{"role": "user", "content": prompt}]
)
return json.loads(response.content[0].text)
This approach works for any business constraint that can be expressed in plain language. The tradeoff is generation speed and API cost — use it for complex edge-case data, not for high-volume load test fixtures.
Comparison Table
| Scenario | Recommended Tool |
|---|---|
| Simple unit test fixtures | Faker or Mimesis |
| High-volume load test data | Mimesis |
| Coherent relational data (city/state/zip) | Claude API |
| Schema-defined data for non-engineers | Mockaroo |
| Production data anonymization | Faker + custom mapping |
| Complex business constraint data | Claude with constraint prompt |
| Edge case generation for bugs | Claude with boundary conditions in prompt |
FAQ
Can AI-generated test data be used in CI/CD pipelines? Yes. The pattern is to generate data once, commit the JSON fixtures to the repo, and load them in tests. Only regenerate when the schema changes. Real-time LLM calls in CI are too slow and costly for routine test runs.
How do you handle foreign key constraints? Generate parent records first and extract their IDs, then pass those IDs to the child record generation prompt as constraints. For example: “user_ids must be one of [1, 2, 3, 4, 5]” ensures referential integrity without DB lookups.
Is Faker or Mimesis better? Mimesis is 2-5x faster and better for generating millions of rows for load tests. Faker has more locale coverage and more data types (SSNs, license plates, etc.). For most projects, Faker is the default choice unless performance is a constraint.
Related Articles
- AI for Automated Regression Test Generation from Bug Reports
- AI Tools for Automated Changelog Generation 2026
- AI Tools for Automated Load Testing Script Generation and An
- AI Tools for Creating Test Data Generators That Respect Busi
- AI Tools for Creating Test Data Snapshots for Database
Built by theluckystrike — More at zovo.one