AI Tools Compared

AI-Powered Observability Configuration Tools 2026

Observability configuration is verbose and error-prone. OpenTelemetry collector pipelines, Prometheus alerting rules, and distributed tracing setup all involve YAML that’s hard to write from memory. AI tools are effective here when you provide clear requirements.

OpenTelemetry Collector Configuration

Prompt to Claude:

Write an OpenTelemetry Collector config for a microservices environment.
Receivers: OTLP (gRPC 4317, HTTP 4318), Prometheus scrape of localhost:8080/metrics.
Processors: batch (10s, max 512 spans), memory limiter (80% limit, 25% spike),
resource detection (add cloud provider, region).
Exporters: Jaeger (http://jaeger:14268), Prometheus remote_write to http://prometheus:9090.
Add a service pipeline for traces and metrics separately.

Claude output:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ["*"]

  prometheus:
    config:
      scrape_configs:
        - job_name: "local-service"
          scrape_interval: 15s
          static_configs:
            - targets: ["localhost:8080"]

processors:
  batch:
    timeout: 10s
    send_batch_size: 512

  memory_limiter:
    limit_percentage: 80
    spike_limit_percentage: 25
    check_interval: 1s

  resourcedetection:
    detectors: [env, system, gcp, aws]
    timeout: 5s
    override: false

  resource:
    attributes:
      - key: service.environment
        value: production
        action: insert

exporters:
  jaeger:
    endpoint: http://jaeger:14268/api/traces
    tls:
      insecure: true

  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"
    tls:
      insecure: true
    resource_to_telemetry_conversion:
      enabled: true

  logging:
    verbosity: normal

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: 0.0.0.0:1777

service:
  extensions: [health_check, pprof]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resourcedetection, resource, batch]
      exporters: [jaeger, logging]

    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resourcedetection, resource, batch]
      exporters: [prometheusremotewrite, logging]

Claude puts memory_limiter first in the processor chain — this is the correct order (prevents OOM before batching). It adds health_check extension for liveness probes and uses resource_to_telemetry_conversion on the Prometheus exporter so resource attributes become labels.

Prometheus Alerting Rules

Prompt:

Write Prometheus alerting rules for a web service. Alerts needed:
1. High error rate: >1% 5xx errors over 5 minutes (severity: warning), >5% (critical)
2. Slow response time: p99 latency >500ms for 10 minutes (warning), >2s (critical)
3. Pod restarts: >3 restarts in 1 hour
4. High memory: >85% of limit for 15 minutes
Include meaningful labels and runbook URLs.

Claude output:

groups:
  - name: web-service-slos
    interval: 30s
    rules:
      - alert: HighErrorRateWarning
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.01
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: HighErrorRateCritical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Critical error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }}"
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: SlowResponseTimeWarning
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow p99 latency for {{ $labels.service }}"
          description: "p99 latency is {{ $value | humanizeDuration }}"
          runbook: "https://wiki.example.com/runbooks/slow-latency"

      - alert: SlowResponseTimeCritical
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 2.0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Critical p99 latency for {{ $labels.service }}"
          description: "p99 latency is {{ $value | humanizeDuration }}"
          runbook: "https://wiki.example.com/runbooks/slow-latency"

  - name: kubernetes-service
    rules:
      - alert: PodRestartingFrequently
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 3
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} restarting frequently"
          description: "{{ $value }} restarts in the last hour"
          runbook: "https://wiki.example.com/runbooks/pod-restarts"

      - alert: HighMemoryUsage
        expr: |
          (
            container_memory_working_set_bytes{container!=""}
            /
            container_spec_memory_limit_bytes{container!=""}
          ) > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage in {{ $labels.container }}"
          description: "Memory at {{ $value | humanizePercentage }} of limit"
          runbook: "https://wiki.example.com/runbooks/high-memory"

Claude uses rate() inside histogram_quantile (not irate — correct for alerting), uses container!="" to exclude pause containers, and uses the humanizePercentage/humanizeDuration template functions.

Structured Logging Configuration

Prompt:

Write a Python logging configuration using structlog for a FastAPI service.
Requirements: JSON output in production, colored console in dev, include request_id,
service name, and version in every log entry. Add a middleware to set request_id.
# logging_config.py
import os
import uuid
import structlog
from contextvars import ContextVar
from fastapi import Request

request_id_var: ContextVar[str] = ContextVar("request_id", default="")

def configure_logging():
    shared_processors = [
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.stdlib.add_logger_name,
    ]

    is_production = os.getenv("ENVIRONMENT", "development") == "production"

    if is_production:
        processors = shared_processors + [
            structlog.processors.dict_tracebacks,
            structlog.processors.JSONRenderer(),
        ]
    else:
        processors = shared_processors + [
            structlog.dev.ConsoleRenderer(colors=True),
        ]

    structlog.configure(
        processors=processors,
        wrapper_class=structlog.make_filtering_bound_logger(20),  # INFO level
        context_class=dict,
        logger_factory=structlog.PrintLoggerFactory(),
        cache_logger_on_first_use=True,
    )

async def request_id_middleware(request: Request, call_next):
    req_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
    structlog.contextvars.bind_contextvars(
        request_id=req_id,
        service="my-service",
        version=os.getenv("APP_VERSION", "unknown"),
    )
    response = await call_next(request)
    response.headers["X-Request-ID"] = req_id
    structlog.contextvars.clear_contextvars()
    return response

Distributed Tracing Instrumentation

Auto-instrumentation covers the basics, but explicit span creation gives you visibility into business-level operations that HTTP frameworks don’t capture. AI handles the boilerplate effectively.

Prompt:

Write Python OpenTelemetry instrumentation for a FastAPI service that:
1. Creates spans for database queries with the query as an attribute
2. Propagates trace context to downstream HTTP calls using requests
3. Records span status on exceptions
4. Exports to an OTLP endpoint at http://otel-collector:4317
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.propagate import inject
from opentelemetry.trace import Status, StatusCode
import requests as http_client

def configure_tracing(service_name: str, version: str):
    resource = Resource.create({
        "service.name": service_name,
        "service.version": version,
        "deployment.environment": os.getenv("ENVIRONMENT", "development"),
    })

    exporter = OTLPSpanExporter(
        endpoint="http://otel-collector:4317",
        insecure=True,
    )

    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    # Auto-instrument FastAPI and requests library
    FastAPIInstrumentor.instrument()
    RequestsInstrumentor.instrument()

tracer = trace.get_tracer(__name__)

# Manual span for database queries
async def fetch_order(order_id: str, db):
    query = "SELECT * FROM orders WHERE id = $1"
    with tracer.start_as_current_span("db.query.fetch_order") as span:
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.statement", query)
        span.set_attribute("db.params.order_id", order_id)
        try:
            result = await db.fetchrow(query, order_id)
            span.set_attribute("db.rows_returned", 1 if result else 0)
            return result
        except Exception as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise

# Downstream HTTP call with trace propagation
def call_inventory_service(product_id: str) -> dict:
    headers = {}
    inject(headers)  # Injects traceparent / tracestate headers
    with tracer.start_as_current_span("http.client.inventory") as span:
        span.set_attribute("http.url", f"http://inventory-service/products/{product_id}")
        response = http_client.get(
            f"http://inventory-service/products/{product_id}",
            headers=headers,
            timeout=5,
        )
        span.set_attribute("http.status_code", response.status_code)
        if response.status_code >= 400:
            span.set_status(Status(StatusCode.ERROR, f"HTTP {response.status_code}"))
        return response.json()

RequestsInstrumentor.instrument() handles trace context propagation automatically for most HTTP calls, but explicit inject(headers) is needed when you construct headers manually or use non-standard HTTP clients.

Grafana Dashboard as Code

AI can generate Grafana dashboard JSON that you check into version control alongside your application code.

Prompt:

Write a Grafana dashboard JSON for a web service with 4 panels:
1. Request rate (requests/s by status code)
2. p50/p95/p99 latency over time
3. Error rate percentage
4. Active pods count
Use templating variable for the service name. Prometheus datasource.

Claude generates a complete dashboard definition. A representative panel for the request rate:

{
  "title": "Request Rate",
  "type": "timeseries",
  "datasource": {"type": "prometheus", "uid": "${datasource}"},
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{service=\"$service\"}[5m])) by (status_code)",
      "legendFormat": "{{status_code}}",
      "refId": "A"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",
      "custom": {"lineWidth": 2}
    },
    "overrides": [
      {"matcher": {"id": "byRegexp", "options": "5.*"}, "properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "red"}}]},
      {"matcher": {"id": "byRegexp", "options": "2.*"}, "properties": [{"id": "color", "value": {"mode": "fixed", "fixedColor": "green"}}]}
    ]
  }
}

Claude correctly uses byRegexp matchers for color-coding status codes rather than hardcoding specific codes. It also adds the $datasource template variable automatically so dashboards can be moved between Grafana instances without editing JSON.

Alertmanager Routing

Prompt:

Write an Alertmanager routing config that:
- Sends critical alerts to PagerDuty
- Sends warning alerts to Slack #alerts channel
- Groups alerts by alertname and service, 30s group wait
- Silences alerts between 11pm-7am UTC on weekends for non-critical
global:
  resolve_timeout: 5m
  pagerduty_url: "https://events.pagerduty.com/v2/enqueue"

route:
  group_by: ["alertname", "service"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: slack-warnings

  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      continue: false

    - match:
        severity: warning
      receiver: slack-warnings
      mute_time_intervals:
        - weekend-nights

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_INTEGRATION_KEY}"
        severity: critical
        description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"

  - name: slack-warnings
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_URL}"
        channel: "#alerts"
        title: "[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
        send_resolved: true

mute_time_intervals:
  - name: weekend-nights
    time_intervals:
      - weekdays: ["saturday", "sunday"]
        times:
          - start_time: "23:00"
            end_time: "07:00"

Claude correctly uses mute_time_intervals (the modern Alertmanager approach) rather than the deprecated time_intervals with inhibition rules. Environment variable references for secrets avoid hardcoding credentials in the config file.

One gap to watch: Claude’s weekend-nights interval covers Saturday and Sunday 23:00-07:00, but this crosses midnight. Alertmanager time intervals within a single entry are evaluated as ranges within a calendar day — to mute from 23:00 Saturday to 07:00 Sunday, you need two separate entries: one for Saturday 23:00-24:00 and one for Sunday 00:00-07:00. Always test mute intervals with amtool before relying on them in production:

# Test if an alert would be muted at a specific time
amtool --alertmanager.url=http://alertmanager:9093 \
  silence query alertname="HighErrorRateWarning"

# Validate config before applying
amtool config check /etc/alertmanager/alertmanager.yml

Choosing the Right AI Tool for Observability Config

For YAML-heavy configuration (OTel collector, Prometheus rules, Alertmanager), Claude and GPT-4 both perform well. The differentiators show up in edge cases:

For instrumentation code (OpenTelemetry SDK setup, span creation), Claude produces more idiomatic output that follows current OpenTelemetry specification conventions — particularly around resource attributes and semantic conventions for database and HTTP spans.

Built by theluckystrike — More at zovo.one