Circuit breakers prevent cascading failures in distributed systems. Configuring them correctly is harder than it looks: wrong failure thresholds cause nuisance trips, wrong wait durations cause extended outages, and missing fallbacks turn circuit trips into user-visible errors. AI tools help by recommending thresholds based on your SLOs and generating the configuration code.
Circuit Breaker Configuration Parameters
Before generating any config, you need to understand the parameters:
| Parameter | What It Controls | Common Mistake |
|---|---|---|
slidingWindowSize |
N calls to evaluate failure rate | Too small = hypersensitive |
failureRateThreshold |
% failures before opening | Too low = nuisance trips |
waitDurationInOpenState |
How long to stay open before retrying | Too short = thundering herd |
permittedCallsInHalfOpenState |
Calls to test in half-open | Too many = slow recovery |
slowCallDurationThreshold |
What counts as “slow” | Must be set per-service SLO |
slowCallRateThreshold |
% slow calls before opening | Often forgotten |
Using AI to Recommend Thresholds
Prompt:
I'm configuring a circuit breaker for calls from my OrderService to a PaymentService.
Context:
- PaymentService p99 latency: 800ms
- PaymentService p50 latency: 120ms
- Acceptable failure rate: 5% (per SLO)
- During a payment outage, I want to stop retrying after 20% failure rate
- Retry attempts should resume after 30 seconds
- Traffic: ~200 payment requests/minute
Recommend Resilience4j circuit breaker configuration with explanations.
Claude’s configuration with reasoning:
# application.yml — Resilience4j configuration
resilience4j:
circuitbreaker:
instances:
payment-service:
# Use count-based window of 20 calls (covers ~6 seconds at 200 req/min)
# Large enough to be statistically meaningful, small enough to react quickly
sliding-window-type: COUNT_BASED
sliding-window-size: 20
# Open circuit when 20%+ of calls fail (4/20 calls)
# Your SLO allows 5%, so 20% means something is definitely wrong
failure-rate-threshold: 20
# Open circuit when 40%+ of calls take >1.5s (2x p99 latency)
# This catches partial degradation before full outages
slow-call-duration-threshold: 1500ms
slow-call-rate-threshold: 40
# Stay open for 30 seconds — let PaymentService recover
# Shorter risks thundering herd on recovery
wait-duration-in-open-state: 30s
# Test with 3 calls in half-open (conservative — don't hammer recovering service)
permitted-number-of-calls-in-half-open-state: 3
# Close circuit when all 3 half-open test calls succeed
# (or failure rate drops below threshold)
# Minimum calls before evaluating (avoids tripping on cold start)
minimum-number-of-calls: 10
# Health indicator for monitoring
register-health-indicator: true
# Automatic state transition (half-open → open if still failing)
automatic-transition-from-open-to-half-open-enabled: true
Claude explains each setting and its rationale. GPT-4 generates similar config but often omits slowCallDurationThreshold and slowCallRateThreshold — missing the slow-call circuit breaker is a common misconfiguration.
Resilience4j Java Implementation
// PaymentServiceClient.java — full circuit breaker + retry + fallback
@Service
@Slf4j
@RequiredArgsConstructor
public class PaymentServiceClient {
private final WebClient webClient;
private final CircuitBreakerRegistry circuitBreakerRegistry;
private final RetryRegistry retryRegistry;
/**
* Process payment with circuit breaker, retry, and fallback.
* Decorators applied in order: Retry → CircuitBreaker
*/
public Mono<PaymentResult> processPayment(PaymentRequest request) {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("payment-service");
Retry retry = retryRegistry.retry("payment-service");
return Mono.fromCallable(() -> request)
.flatMap(this::callPaymentApi)
.transformDeferred(CircuitBreakerOperator.of(cb))
.transformDeferred(RetryOperator.of(retry))
.onErrorResume(CallNotPermittedException.class, ex -> {
log.warn("Payment circuit breaker OPEN — using fallback for request {}",
request.getOrderId());
return processFallbackPayment(request);
})
.onErrorResume(ex -> {
log.error("Payment service failed after retries for order {}: {}",
request.getOrderId(), ex.getMessage());
return Mono.error(new PaymentServiceUnavailableException(
"Payment service temporarily unavailable", ex));
});
}
private Mono<PaymentResult> callPaymentApi(PaymentRequest request) {
return webClient.post()
.uri("/payments")
.bodyValue(request)
.retrieve()
.onStatus(status -> status.is5xxServerError(),
resp -> resp.bodyToMono(String.class)
.flatMap(body -> Mono.error(
new PaymentServerException("Payment service error: " + body))))
.bodyToMono(PaymentResult.class)
.timeout(Duration.ofMillis(2000)); // Hard timeout
}
/**
* Fallback: queue the payment for async processing when circuit is open.
*/
private Mono<PaymentResult> processFallbackPayment(PaymentRequest request) {
return paymentQueueService.enqueue(request)
.map(queueId -> PaymentResult.pending(request.getOrderId(), queueId));
}
@EventListener
public void onCircuitBreakerStateChange(CircuitBreakerOnStateTransitionEvent event) {
log.warn("Payment circuit breaker state: {} → {}",
event.getStateTransition().getFromState(),
event.getStateTransition().getToState());
// Alert on-call if circuit opens
if (event.getStateTransition().getToState() == CircuitBreaker.State.OPEN) {
alertingService.sendAlert(Alert.builder()
.severity(Severity.HIGH)
.message("Payment service circuit breaker opened")
.build());
}
}
}
.NET Polly Configuration
Claude handles multi-language circuit breaker configuration:
Prompt: “Configure a Polly v8 circuit breaker for a .NET 8 service calling an external inventory API.”
// InventoryServiceExtensions.cs — generated by Claude
using Polly;
using Polly.CircuitBreaker;
public static class InventoryServiceExtensions
{
public static IServiceCollection AddInventoryService(
this IServiceCollection services,
IConfiguration configuration)
{
services.AddHttpClient<IInventoryService, InventoryService>(client =>
{
client.BaseAddress = new Uri(configuration["Services:Inventory:BaseUrl"]!);
client.Timeout = TimeSpan.FromSeconds(5);
})
.AddResilienceHandler("inventory-pipeline", builder =>
{
// Circuit Breaker
builder.AddCircuitBreaker(new CircuitBreakerStrategyOptions
{
// Open after 30% failure rate over 20 calls
FailureRatio = 0.3,
SamplingDuration = TimeSpan.FromSeconds(30),
MinimumThroughput = 20,
// Stay open for 45 seconds
BreakDuration = TimeSpan.FromSeconds(45),
// Count server errors and timeouts as failures
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.Handle<TimeoutRejectedException>()
.HandleResult(r => r.StatusCode >= HttpStatusCode.InternalServerError),
OnOpened = args =>
{
logger.LogWarning(
"Inventory circuit breaker opened. Duration: {Duration}",
args.BreakDuration);
return ValueTask.CompletedTask;
},
OnClosed = args =>
{
logger.LogInformation("Inventory circuit breaker closed.");
return ValueTask.CompletedTask;
},
});
// Retry (applied before circuit breaker)
builder.AddRetry(new RetryStrategyOptions<HttpResponseMessage>
{
MaxRetryAttempts = 2,
Delay = TimeSpan.FromMilliseconds(200),
BackoffType = DelayBackoffType.Exponential,
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(r => r.StatusCode >= HttpStatusCode.InternalServerError),
});
// Timeout
builder.AddTimeout(TimeSpan.FromSeconds(3));
});
return services;
}
}
Threshold Tuning Guide (AI-Generated)
Use this prompt to get tuning advice for existing deployments:
My circuit breaker trips 3-4 times per day even when PaymentService is healthy.
Current config: failureRateThreshold=10, slidingWindowSize=10, waitDuration=15s
Datadog metrics show:
- PaymentService p99 latency: 650ms
- Payment error rate: 1.2% (within SLO)
- Circuit opens when a single slow burst occurs (e.g., DB query at PaymentService)
How should I tune the circuit breaker to reduce nuisance trips while still
protecting against real outages?
Claude’s tuning recommendations:
- Increase
slidingWindowSizefrom 10 to 30 — 10 calls is too small; one slow response out of 10 = 10% instantly - Increase
failureRateThresholdfrom 10% to 25% — your SLO allows 5% errors; 25% is a clear signal - Increase
minimumNumberOfCallsfrom default 5 to 15 — avoid trips on startup or low-traffic periods - Add
slowCallDurationThreshold=1500ms+slowCallRateThreshold=50— catch degradation separately from errors
Related Reading
- Claude vs Copilot for Java Spring Boot Microservices
- AI Tools for Automated Rate Limiter Configuration
- AI CI/CD Pipeline Optimization
Built by theluckystrike — More at zovo.one