AI Tools Compared

Circuit breakers prevent cascading failures in distributed systems. Configuring them correctly is harder than it looks: wrong failure thresholds cause nuisance trips, wrong wait durations cause extended outages, and missing fallbacks turn circuit trips into user-visible errors. AI tools help by recommending thresholds based on your SLOs and generating the configuration code.

Circuit Breaker Configuration Parameters

Before generating any config, you need to understand the parameters:

Parameter What It Controls Common Mistake
slidingWindowSize N calls to evaluate failure rate Too small = hypersensitive
failureRateThreshold % failures before opening Too low = nuisance trips
waitDurationInOpenState How long to stay open before retrying Too short = thundering herd
permittedCallsInHalfOpenState Calls to test in half-open Too many = slow recovery
slowCallDurationThreshold What counts as “slow” Must be set per-service SLO
slowCallRateThreshold % slow calls before opening Often forgotten

Using AI to Recommend Thresholds

Prompt:

I'm configuring a circuit breaker for calls from my OrderService to a PaymentService.
Context:
- PaymentService p99 latency: 800ms
- PaymentService p50 latency: 120ms
- Acceptable failure rate: 5% (per SLO)
- During a payment outage, I want to stop retrying after 20% failure rate
- Retry attempts should resume after 30 seconds
- Traffic: ~200 payment requests/minute

Recommend Resilience4j circuit breaker configuration with explanations.

Claude’s configuration with reasoning:

# application.yml — Resilience4j configuration
resilience4j:
  circuitbreaker:
    instances:
      payment-service:
        # Use count-based window of 20 calls (covers ~6 seconds at 200 req/min)
        # Large enough to be statistically meaningful, small enough to react quickly
        sliding-window-type: COUNT_BASED
        sliding-window-size: 20

        # Open circuit when 20%+ of calls fail (4/20 calls)
        # Your SLO allows 5%, so 20% means something is definitely wrong
        failure-rate-threshold: 20

        # Open circuit when 40%+ of calls take >1.5s (2x p99 latency)
        # This catches partial degradation before full outages
        slow-call-duration-threshold: 1500ms
        slow-call-rate-threshold: 40

        # Stay open for 30 seconds — let PaymentService recover
        # Shorter risks thundering herd on recovery
        wait-duration-in-open-state: 30s

        # Test with 3 calls in half-open (conservative — don't hammer recovering service)
        permitted-number-of-calls-in-half-open-state: 3

        # Close circuit when all 3 half-open test calls succeed
        # (or failure rate drops below threshold)

        # Minimum calls before evaluating (avoids tripping on cold start)
        minimum-number-of-calls: 10

        # Health indicator for monitoring
        register-health-indicator: true

        # Automatic state transition (half-open → open if still failing)
        automatic-transition-from-open-to-half-open-enabled: true

Claude explains each setting and its rationale. GPT-4 generates similar config but often omits slowCallDurationThreshold and slowCallRateThreshold — missing the slow-call circuit breaker is a common misconfiguration.

Resilience4j Java Implementation

// PaymentServiceClient.java — full circuit breaker + retry + fallback
@Service
@Slf4j
@RequiredArgsConstructor
public class PaymentServiceClient {

    private final WebClient webClient;
    private final CircuitBreakerRegistry circuitBreakerRegistry;
    private final RetryRegistry retryRegistry;

    /**
     * Process payment with circuit breaker, retry, and fallback.
     * Decorators applied in order: Retry → CircuitBreaker
     */
    public Mono<PaymentResult> processPayment(PaymentRequest request) {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("payment-service");
        Retry retry = retryRegistry.retry("payment-service");

        return Mono.fromCallable(() -> request)
            .flatMap(this::callPaymentApi)
            .transformDeferred(CircuitBreakerOperator.of(cb))
            .transformDeferred(RetryOperator.of(retry))
            .onErrorResume(CallNotPermittedException.class, ex -> {
                log.warn("Payment circuit breaker OPEN — using fallback for request {}",
                    request.getOrderId());
                return processFallbackPayment(request);
            })
            .onErrorResume(ex -> {
                log.error("Payment service failed after retries for order {}: {}",
                    request.getOrderId(), ex.getMessage());
                return Mono.error(new PaymentServiceUnavailableException(
                    "Payment service temporarily unavailable", ex));
            });
    }

    private Mono<PaymentResult> callPaymentApi(PaymentRequest request) {
        return webClient.post()
            .uri("/payments")
            .bodyValue(request)
            .retrieve()
            .onStatus(status -> status.is5xxServerError(),
                resp -> resp.bodyToMono(String.class)
                    .flatMap(body -> Mono.error(
                        new PaymentServerException("Payment service error: " + body))))
            .bodyToMono(PaymentResult.class)
            .timeout(Duration.ofMillis(2000));  // Hard timeout
    }

    /**
     * Fallback: queue the payment for async processing when circuit is open.
     */
    private Mono<PaymentResult> processFallbackPayment(PaymentRequest request) {
        return paymentQueueService.enqueue(request)
            .map(queueId -> PaymentResult.pending(request.getOrderId(), queueId));
    }

    @EventListener
    public void onCircuitBreakerStateChange(CircuitBreakerOnStateTransitionEvent event) {
        log.warn("Payment circuit breaker state: {} → {}",
            event.getStateTransition().getFromState(),
            event.getStateTransition().getToState());

        // Alert on-call if circuit opens
        if (event.getStateTransition().getToState() == CircuitBreaker.State.OPEN) {
            alertingService.sendAlert(Alert.builder()
                .severity(Severity.HIGH)
                .message("Payment service circuit breaker opened")
                .build());
        }
    }
}

.NET Polly Configuration

Claude handles multi-language circuit breaker configuration:

Prompt: “Configure a Polly v8 circuit breaker for a .NET 8 service calling an external inventory API.”

// InventoryServiceExtensions.cs — generated by Claude
using Polly;
using Polly.CircuitBreaker;

public static class InventoryServiceExtensions
{
    public static IServiceCollection AddInventoryService(
        this IServiceCollection services,
        IConfiguration configuration)
    {
        services.AddHttpClient<IInventoryService, InventoryService>(client =>
        {
            client.BaseAddress = new Uri(configuration["Services:Inventory:BaseUrl"]!);
            client.Timeout = TimeSpan.FromSeconds(5);
        })
        .AddResilienceHandler("inventory-pipeline", builder =>
        {
            // Circuit Breaker
            builder.AddCircuitBreaker(new CircuitBreakerStrategyOptions
            {
                // Open after 30% failure rate over 20 calls
                FailureRatio = 0.3,
                SamplingDuration = TimeSpan.FromSeconds(30),
                MinimumThroughput = 20,

                // Stay open for 45 seconds
                BreakDuration = TimeSpan.FromSeconds(45),

                // Count server errors and timeouts as failures
                ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                    .Handle<HttpRequestException>()
                    .Handle<TimeoutRejectedException>()
                    .HandleResult(r => r.StatusCode >= HttpStatusCode.InternalServerError),

                OnOpened = args =>
                {
                    logger.LogWarning(
                        "Inventory circuit breaker opened. Duration: {Duration}",
                        args.BreakDuration);
                    return ValueTask.CompletedTask;
                },

                OnClosed = args =>
                {
                    logger.LogInformation("Inventory circuit breaker closed.");
                    return ValueTask.CompletedTask;
                },
            });

            // Retry (applied before circuit breaker)
            builder.AddRetry(new RetryStrategyOptions<HttpResponseMessage>
            {
                MaxRetryAttempts = 2,
                Delay = TimeSpan.FromMilliseconds(200),
                BackoffType = DelayBackoffType.Exponential,
                ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                    .Handle<HttpRequestException>()
                    .HandleResult(r => r.StatusCode >= HttpStatusCode.InternalServerError),
            });

            // Timeout
            builder.AddTimeout(TimeSpan.FromSeconds(3));
        });

        return services;
    }
}

Threshold Tuning Guide (AI-Generated)

Use this prompt to get tuning advice for existing deployments:

My circuit breaker trips 3-4 times per day even when PaymentService is healthy.
Current config: failureRateThreshold=10, slidingWindowSize=10, waitDuration=15s

Datadog metrics show:
- PaymentService p99 latency: 650ms
- Payment error rate: 1.2% (within SLO)
- Circuit opens when a single slow burst occurs (e.g., DB query at PaymentService)

How should I tune the circuit breaker to reduce nuisance trips while still
protecting against real outages?

Claude’s tuning recommendations:

  1. Increase slidingWindowSize from 10 to 30 — 10 calls is too small; one slow response out of 10 = 10% instantly
  2. Increase failureRateThreshold from 10% to 25% — your SLO allows 5% errors; 25% is a clear signal
  3. Increase minimumNumberOfCalls from default 5 to 15 — avoid trips on startup or low-traffic periods
  4. Add slowCallDurationThreshold=1500ms + slowCallRateThreshold=50 — catch degradation separately from errors

Built by theluckystrike — More at zovo.one