AI-Powered Observability Configuration Tools 2026
Observability configuration is verbose and error-prone. OpenTelemetry collector pipelines, Prometheus alerting rules, and distributed tracing setup all involve YAML that’s hard to write from memory. AI tools are effective here when you provide clear requirements.
OpenTelemetry Collector Configuration
Prompt to Claude:
Write an OpenTelemetry Collector config for a microservices environment.
Receivers: OTLP (gRPC 4317, HTTP 4318), Prometheus scrape of localhost:8080/metrics.
Processors: batch (10s, max 512 spans), memory limiter (80% limit, 25% spike),
resource detection (add cloud provider, region).
Exporters: Jaeger (http://jaeger:14268), Prometheus remote_write to http://prometheus:9090.
Add a service pipeline for traces and metrics separately.
Claude output:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins: ["*"]
prometheus:
config:
scrape_configs:
- job_name: "local-service"
scrape_interval: 15s
static_configs:
- targets: ["localhost:8080"]
processors:
batch:
timeout: 10s
send_batch_size: 512
memory_limiter:
limit_percentage: 80
spike_limit_percentage: 25
check_interval: 1s
resourcedetection:
detectors: [env, system, gcp, aws]
timeout: 5s
override: false
resource:
attributes:
- key: service.environment
value: production
action: insert
exporters:
jaeger:
endpoint: http://jaeger:14268/api/traces
tls:
insecure: true
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
tls:
insecure: true
resource_to_telemetry_conversion:
enabled: true
logging:
verbosity: normal
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1777
service:
extensions: [health_check, pprof]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, resource, batch]
exporters: [jaeger, logging]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, resourcedetection, resource, batch]
exporters: [prometheusremotewrite, logging]
Claude puts memory_limiter first in the processor chain — this is the correct order (prevents OOM before batching). It adds health_check extension for liveness probes and uses resource_to_telemetry_conversion on the Prometheus exporter so resource attributes become labels.
Prometheus Alerting Rules
Prompt:
Write Prometheus alerting rules for a web service. Alerts needed:
1. High error rate: >1% 5xx errors over 5 minutes (severity: warning), >5% (critical)
2. Slow response time: p99 latency >500ms for 10 minutes (warning), >2s (critical)
3. Pod restarts: >3 restarts in 1 hour
4. High memory: >85% of limit for 15 minutes
Include meaningful labels and runbook URLs.
Claude output:
groups:
- name: web-service-slos
interval: 30s
rules:
- alert: HighErrorRateWarning
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.01
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighErrorRateCritical
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Critical error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
- alert: SlowResponseTimeWarning
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Slow p99 latency for {{ $labels.service }}"
description: "p99 latency is {{ $value | humanizeDuration }}"
runbook: "https://wiki.example.com/runbooks/slow-latency"
- alert: SlowResponseTimeCritical
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 2.0
for: 10m
labels:
severity: critical
annotations:
summary: "Critical p99 latency for {{ $labels.service }}"
description: "p99 latency is {{ $value | humanizeDuration }}"
runbook: "https://wiki.example.com/runbooks/slow-latency"
- name: kubernetes-service
rules:
- alert: PodRestartingFrequently
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 3
for: 0m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} restarting frequently"
description: "{{ $value }} restarts in the last hour"
runbook: "https://wiki.example.com/runbooks/pod-restarts"
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes{container!=""}
/
container_spec_memory_limit_bytes{container!=""}
) > 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "High memory usage in {{ $labels.container }}"
description: "Memory at {{ $value | humanizePercentage }} of limit"
runbook: "https://wiki.example.com/runbooks/high-memory"
Claude uses rate() inside histogram_quantile (not irate — correct for alerting), uses container!="" to exclude pause containers, and uses the humanizePercentage/humanizeDuration template functions.
Structured Logging Configuration
Prompt:
Write a Python logging configuration using structlog for a FastAPI service.
Requirements: JSON output in production, colored console in dev, include request_id,
service name, and version in every log entry. Add a middleware to set request_id.
# logging_config.py
import os
import uuid
import structlog
from contextvars import ContextVar
from fastapi import Request
request_id_var: ContextVar[str] = ContextVar("request_id", default="")
def configure_logging():
shared_processors = [
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.stdlib.add_logger_name,
]
is_production = os.getenv("ENVIRONMENT", "development") == "production"
if is_production:
processors = shared_processors + [
structlog.processors.dict_tracebacks,
structlog.processors.JSONRenderer(),
]
else:
processors = shared_processors + [
structlog.dev.ConsoleRenderer(colors=True),
]
structlog.configure(
processors=processors,
wrapper_class=structlog.make_filtering_bound_logger(20), # INFO level
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(),
cache_logger_on_first_use=True,
)
async def request_id_middleware(request: Request, call_next):
req_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
structlog.contextvars.bind_contextvars(
request_id=req_id,
service="my-service",
version=os.getenv("APP_VERSION", "unknown"),
)
response = await call_next(request)
response.headers["X-Request-ID"] = req_id
structlog.contextvars.clear_contextvars()
return response
Distributed Tracing Instrumentation
Auto-instrumentation covers the basics, but explicit span creation gives you visibility into business-level operations that HTTP frameworks don’t capture. AI handles the boilerplate effectively.
Prompt:
Write Python OpenTelemetry instrumentation for a FastAPI service that:
1. Creates spans for database queries with the query as an attribute
2. Propagates trace context to downstream HTTP calls using requests
3. Records span status on exceptions
4. Exports to an OTLP endpoint at http://otel-collector:4317
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.propagate import inject
from opentelemetry.trace import Status, StatusCode
import requests as http_client
def configure_tracing(service_name: str, version: str):
resource = Resource.create({
"service.name": service_name,
"service.version": version,
"deployment.environment": os.getenv("ENVIRONMENT", "development"),
})
exporter = OTLPSpanExporter(
endpoint="http://otel-collector:4317",
insecure=True,
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instrument FastAPI and requests library
FastAPIInstrumentor.instrument()
RequestsInstrumentor.instrument()
tracer = trace.get_tracer(__name__)
# Manual span for database queries
async def fetch_order(order_id: str, db):
query = "SELECT * FROM orders WHERE id = $1"
with tracer.start_as_current_span("db.query.fetch_order") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", query)
span.set_attribute("db.params.order_id", order_id)
try:
result = await db.fetchrow(query, order_id)
span.set_attribute("db.rows_returned", 1 if result else 0)
return result
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
# Downstream HTTP call with trace propagation
def call_inventory_service(product_id: str) -> dict:
headers = {}
inject(headers) # Injects traceparent / tracestate headers
with tracer.start_as_current_span("http.client.inventory") as span:
span.set_attribute("http.url", f"http://inventory-service/products/{product_id}")
response = http_client.get(
f"http://inventory-service/products/{product_id}",
headers=headers,
timeout=5,
)
span.set_attribute("http.status_code", response.status_code)
if response.status_code >= 400:
span.set_status(Status(StatusCode.ERROR, f"HTTP {response.status_code}"))
return response.json()
RequestsInstrumentor.instrument() handles trace context propagation automatically for most HTTP calls, but explicit inject(headers) is needed when you construct headers manually or use non-standard HTTP clients.
Grafana Dashboard as Code
AI can generate Grafana dashboard JSON that you check into version control alongside your application code.
Prompt:
Write a Grafana dashboard JSON for a web service with 4 panels:
1. Request rate (requests/s by status code)
2. p50/p95/p99 latency over time
3. Error rate percentage
4. Active pods count
Use templating variable for the service name. Prometheus datasource.
Claude generates a complete dashboard definition. A representative panel for the request rate:
{
"title": "Request Rate",
"type": "timeseries",
"datasource": {"type": "prometheus", "uid": "${datasource}"},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=\"$service\"}[5m])) by (status_code)",
"legendFormat": "{{status_code}}",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"custom": {"lineWidth": 2}
},
"overrides": [
{"matcher": {"id": "byRegexp", "options": "5.*"}, "properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "red"}}]},
{"matcher": {"id": "byRegexp", "options": "2.*"}, "properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "green"}}]}
]
}
}
Claude correctly uses byRegexp matchers for color-coding status codes rather than hardcoding specific codes. It also adds the $datasource template variable automatically so dashboards can be moved between Grafana instances without editing JSON.
Alertmanager Routing
Prompt:
Write an Alertmanager routing config that:
- Sends critical alerts to PagerDuty
- Sends warning alerts to Slack #alerts channel
- Groups alerts by alertname and service, 30s group wait
- Silences alerts between 11pm-7am UTC on weekends for non-critical
global:
resolve_timeout: 5m
pagerduty_url: "https://events.pagerduty.com/v2/enqueue"
route:
group_by: ["alertname", "service"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: slack-warnings
routes:
- match:
severity: critical
receiver: pagerduty-critical
continue: false
- match:
severity: warning
receiver: slack-warnings
mute_time_intervals:
- weekend-nights
receivers:
- name: pagerduty-critical
pagerduty_configs:
- routing_key: "${PAGERDUTY_INTEGRATION_KEY}"
severity: critical
description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"
- name: slack-warnings
slack_configs:
- api_url: "${SLACK_WEBHOOK_URL}"
channel: "#alerts"
title: "[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
send_resolved: true
mute_time_intervals:
- name: weekend-nights
time_intervals:
- weekdays: ["saturday", "sunday"]
times:
- start_time: "23:00"
end_time: "07:00"
Claude correctly uses mute_time_intervals (the modern Alertmanager approach) rather than the deprecated time_intervals with inhibition rules. Environment variable references for secrets avoid hardcoding credentials in the config file.
One gap to watch: Claude’s weekend-nights interval covers Saturday and Sunday 23:00-07:00, but this crosses midnight. Alertmanager time intervals within a single entry are evaluated as ranges within a calendar day — to mute from 23:00 Saturday to 07:00 Sunday, you need two separate entries: one for Saturday 23:00-24:00 and one for Sunday 00:00-07:00. Always test mute intervals with amtool before relying on them in production:
# Test if an alert would be muted at a specific time
amtool --alertmanager.url=http://alertmanager:9093 \
silence query alertname="HighErrorRateWarning"
# Validate config before applying
amtool config check /etc/alertmanager/alertmanager.yml
Choosing the Right AI Tool for Observability Config
For YAML-heavy configuration (OTel collector, Prometheus rules, Alertmanager), Claude and GPT-4 both perform well. The differentiators show up in edge cases:
- Processor ordering — Claude consistently puts
memory_limiterbeforebatchin OTel pipelines. GPT-4 occasionally reverses this, causing OOM under load. - PromQL correctness — Both tools handle simple rate expressions. For complex multi-label aggregations or recording rules, Claude tends to produce more syntactically correct PromQL on the first attempt.
- Config validation awareness — Claude proactively mentions tools like
promtool check rulesandotelcol validatefor verifying generated configs. This matters because invalid YAML silently fails in some environments.
For instrumentation code (OpenTelemetry SDK setup, span creation), Claude produces more idiomatic output that follows current OpenTelemetry specification conventions — particularly around resource attributes and semantic conventions for database and HTTP spans.
Related Reading
- AI-Powered CI/CD Pipeline Optimization
- AI-Powered Feature Flag Management Tools
-
AI-Powered API Gateway Configuration Tools 2026
Related Articles
- AI-Powered API Gateway Configuration Tools 2026
- AI-Powered Service Mesh Configuration 2026
- AI Tools for Generating Prometheus Alerting Rules (2026)
- AI Tools for Generating Renovate Bot Configuration for
- AI Tools for Generating Nginx Configuration Files 2026
Built by theluckystrike — More at zovo.one