AI Tools Compared

Generating useful unit tests with AI is harder than it looks. The easy version — generating tests that pass — is trivially achievable. The hard version — generating tests that catch bugs, cover edge cases, and stay maintainable — requires tools that understand what your code should do, not just what it currently does.

Tools Compared

What Separates Good Test Generation from Bad

A test that only covers the happy path is nearly useless. The tests worth having cover:

  1. Happy path (expected inputs, expected outputs)
  2. Boundary values (empty string, zero, max int, empty list)
  3. Invalid inputs (null, wrong type, out-of-range)
  4. State variations (what if a dependency is unavailable)
  5. Error propagation (does the right exception reach the caller)

Test Subject: Payment Processor Function

def process_payment(
    amount: float,
    currency: str,
    payment_method: PaymentMethod,
    idempotency_key: str,
) -> PaymentResult:
    """
    Process a payment through the configured payment gateway.

    Raises:
        InvalidAmountError: amount <= 0 or > 1,000,000
        InvalidCurrencyError: currency code invalid
        DuplicateTransactionError: idempotency_key already processed
        PaymentGatewayError: gateway returned error
    """
    if amount <= 0 or amount > 1_000_000:
        raise InvalidAmountError(f"Amount {amount} out of valid range")

    if not re.match(r"^[A-Z]{3}$", currency):
        raise InvalidCurrencyError(f"Invalid currency: {currency}")

    if idempotency_store.exists(idempotency_key):
        return idempotency_store.get(idempotency_key)

    result = gateway.charge(amount, currency, payment_method)
    idempotency_store.set(idempotency_key, result)
    return result

CodiumAI / Qodo

CodiumAI analyzes your code’s behavior semantics and generates tests for each identified behavior. Integrates into VS Code and JetBrains.

CodiumAI-generated tests for process_payment:

class TestProcessPayment:
    def test_valid_payment_returns_result(self, mock_gateway, mock_idempotency_store):
        result = process_payment(100.0, "USD", mock_payment_method, "key-001")
        assert result.status == "success"

    def test_minimum_valid_amount(self, mock_gateway, mock_idempotency_store):
        result = process_payment(0.01, "USD", mock_payment_method, "key-002")
        assert result.status == "success"

    def test_maximum_valid_amount(self, mock_gateway, mock_idempotency_store):
        result = process_payment(1_000_000, "USD", mock_payment_method, "key-003")
        assert result.status == "success"

    def test_zero_amount_raises_error(self, mock_gateway, mock_idempotency_store):
        with pytest.raises(InvalidAmountError):
            process_payment(0, "USD", mock_payment_method, "key-004")

    def test_negative_amount_raises_error(self, mock_gateway, mock_idempotency_store):
        with pytest.raises(InvalidAmountError):
            process_payment(-1.0, "USD", mock_payment_method, "key-005")

    def test_over_max_amount_raises_error(self, mock_gateway, mock_idempotency_store):
        with pytest.raises(InvalidAmountError):
            process_payment(1_000_000.01, "USD", mock_payment_method, "key-006")

    def test_lowercase_currency_raises_error(self, mock_gateway, mock_idempotency_store):
        with pytest.raises(InvalidCurrencyError):
            process_payment(100.0, "usd", mock_payment_method, "key-007")

    def test_duplicate_key_returns_cached_result(self, mock_gateway, mock_idempotency_store):
        mock_idempotency_store.exists.return_value = True
        mock_idempotency_store.get.return_value = existing_result
        result = process_payment(100.0, "USD", mock_payment_method, "existing-key")
        mock_gateway.charge.assert_not_called()
        assert result == existing_result

    def test_gateway_error_propagates(self, mock_gateway, mock_idempotency_store):
        mock_gateway.charge.side_effect = PaymentGatewayError("declined")
        with pytest.raises(PaymentGatewayError):
            process_payment(100.0, "USD", mock_payment_method, "key-009")

CodiumAI generated 9 tests covering boundaries, invalid inputs, idempotency, and error propagation in one pass.

GitHub Copilot with /tests

# Copilot generated:
def test_process_payment_success():
    result = process_payment(100.0, "USD", payment_method, "key")
    assert result is not None

def test_process_payment_invalid_amount():
    with pytest.raises(InvalidAmountError):
        process_payment(-10, "USD", payment_method, "key")

def test_process_payment_invalid_currency():
    with pytest.raises(InvalidCurrencyError):
        process_payment(100.0, "invalid", payment_method, "key")

Copilot generated 3 tests. Missed: boundary conditions (0, 1_000_000, 1_000_000.01), idempotency test, and gateway error propagation.

Claude with a Strong Prompt

Claude generates high-quality tests when prompted with the testing strategy explicitly:

Write pytest unit tests for this function. Requirements:
- Test happy path
- Test ALL documented error cases
- Test boundary values for amount (0, 0.01, 1_000_000, 1_000_000.01)
- Test idempotency (same key used twice should return cached result)
- Use pytest fixtures and unittest.mock
- Each test should have a clear descriptive name

[paste function code]

With this prompt, Claude generates test quality comparable to CodiumAI. The difference is that CodiumAI identifies the test strategy automatically; Claude needs you to specify it.

Coverage Comparison

Tool Tests Generated Branch Coverage Edge Cases Found Setup Required
CodiumAI 9 tests 95%+ Yes (all identified) Minimal
Claude (detailed prompt) 8-10 tests 90%+ Yes Prompt engineering
GitHub Copilot 3-5 tests 60% Partial None
Diffblue (Java) Full suite 90%+ Yes CI integration

Workflow Recommendation

For new code as you write it: use Copilot or Claude inline for quick test generation.

For coverage improvement on existing code: CodiumAI is the most efficient.

For legacy Java codebases with no tests: Diffblue is the specialized tool.

The most cost-efficient approach for most teams: use Claude with a structured prompt. It matches CodiumAI quality when prompted correctly.

# Template prompt for comprehensive test generation:
Generate {framework} tests for the function below.
Include: happy path, boundary conditions for all numeric parameters,
all documented exceptions, state variations (mocked dependencies in error states),
and at minimum one test per documented behavior.

[paste function with docstring]

Test Generation for Async Code

Async testing requires extra care with mocking and timing. AI tools vary in quality:

# Function to test
async def fetch_and_cache(user_id: str, ttl_seconds: int = 3600) -> User:
    cached = await cache.get(f"user:{user_id}")
    if cached:
        return cached

    user = await api.fetch_user(user_id)
    await cache.set(f"user:{user_id}", user, ttl=ttl_seconds)
    return user

CodiumAI generates:

async def test_returns_cached_user(self):
    # Mocks cache hit
    mock_user = User(id="1", name="Alice")
    cache_mock.get.return_value = mock_user

    result = await fetch_and_cache("1")

    assert result == mock_user
    api.fetch_user.assert_not_called()

async def test_caches_fresh_user(self):
    # Mocks cache miss
    mock_user = User(id="2", name="Bob")
    cache_mock.get.return_value = None
    api.fetch_user.return_value = mock_user

    result = await fetch_and_cache("2", ttl_seconds=7200)

    assert result == mock_user
    cache_mock.set.assert_called_once_with("user:2", mock_user, ttl=7200)

Correct async handling with proper mock setup for both cache hit and miss paths.

Claude with weak prompt:

# Less specific — may not include all cases
def test_fetch_and_cache():
    user = fetch_and_cache("1")
    assert user is not None

Claude with strong prompt:

# With specific instructions, quality matches CodiumAI
async def test_fetch_and_cache_returns_cached_when_hit():
    mock_user = User(id="1", name="Alice")
    cache_mock.get = AsyncMock(return_value=mock_user)

    result = await fetch_and_cache("1")

    assert result == mock_user
    cache_mock.get.assert_called_once_with("user:1")
    api.fetch_user.assert_not_called()

async def test_fetch_and_cache_fetches_and_caches_on_miss():
    mock_user = User(id="2", name="Bob")
    cache_mock.get = AsyncMock(return_value=None)
    api.fetch_user = AsyncMock(return_value=mock_user)

    result = await fetch_and_cache("2", ttl_seconds=7200)

    assert result == mock_user
    cache_mock.set.assert_called_once()
    args, kwargs = cache_mock.set.call_args
    assert args == ("user:2", mock_user)
    assert kwargs["ttl"] == 7200

Integration Tests vs Unit Tests

Good test generation tools distinguish between unit tests (isolated function) and integration tests (testing database interaction, external APIs).

For unit tests: mock everything. For integration tests: use test database or fixtures.

CodiumAI: Generates both unit and integration test suggestions, clearly labeled.

Claude: Generates whatever you ask for. Be explicit: “Generate unit tests with mocked dependencies, not integration tests.”

Parameterized Tests for Multiple Inputs

Testing the same function with many input combinations:

import pytest

@pytest.mark.parametrize(
    "amount,currency,expected_error",
    [
        (0, "USD", InvalidAmountError),
        (-100, "USD", InvalidAmountError),
        (1_000_001, "USD", InvalidAmountError),
        (100, "invalid", InvalidCurrencyError),
        (100, "usd", InvalidCurrencyError),  # lowercase
        (100, "US", InvalidCurrencyError),  # too short
    ],
)
async def test_process_payment_validation(amount, currency, expected_error):
    with pytest.raises(expected_error):
        await process_payment(amount, currency, mock_payment_method, "key")

Tool quality on parameterized tests:

Test Maintenance and Coverage Monitoring

After generation, tests need maintenance as code changes.

# Check current coverage
pytest --cov=services/order_service tests/

# Generate coverage report
pytest --cov=services/order_service --cov-report=html tests/
# Opens htmlcov/index.html

AI-generated tests often achieve 80-95% line coverage but may miss edge cases (5-10% of real bugs live in edges). Developers need to add ~10 manual tests per module to catch domain-specific edge cases.

Test Generation for Different Frameworks

Tools vary by language/framework:

Language Best Tool Notes
Python/pytest Claude or CodiumAI Both excellent
Java/JUnit Diffblue > CodiumAI Diffblue specializes in Java
TypeScript/Jest CodiumAI or Claude Both good
Go/testing Claude No specialized tool yet
Rust/cargo test Claude No specialized tool yet
C++/googletest CodiumAI or Claude Limited specialized tools

For less common languages, Claude is reliable because it’s general-purpose. For Python and Java, specialized tools have higher coverage depth.

Built by theluckystrike — More at zovo.one