AI Tools Compared

Prometheus alerting rules require deep knowledge of PromQL syntax, time series semantics, and operational thresholds. Writing rules manually means wrestling with query syntax, threshold tuning, and escalation logic. AI assistants can accelerate this process significantly—but only tools trained on Prometheus monitoring patterns generate rules that survive production use.

Why AI for Prometheus Rules

Manual rule-writing involves:

Poor rules create alert fatigue (false positives) or miss real incidents (false negatives), both costly for incident response.

AI Tools Comparison

Claude (Opus 4.6, Haiku 4.5)

Price: $3/month (Claude.ai Pro) or $20 per 1M input tokens (API) Best for: Complex multi-condition rules, recording rules, escalation logic

Claude excels at PromQL-heavy rules. It understands:

Example: You ask Claude to write a rule that alerts when p99 latency exceeds 500ms for 5 minutes, excluding internal services.

Claude produces:

alert: HighP99Latency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service!~"internal.*"}[5m])) > 0.5
for: 5m
annotations:
  summary: "High P99 latency for {{ $labels.service }}"
  description: "P99 latency is {{ $value }}s, threshold 0.5s"

The rule is production-ready: uses histogram_quantile correctly, excludes internal services with label matcher, has appropriate duration (5m).

Strengths:

Weaknesses:

Cost per rule: $0.15 (API) for complex multi-condition rules, $0.03 for simple counters.

OpenAI GPT-4o

Price: $20/month (ChatGPT Plus) or $0.03/$0.15 per 1K input/output tokens (API) Best for: Basic counter and gauge rules

GPT-4o handles straightforward rules. Effective for:

Example prompt: “Write a Prometheus alert rule for when CPU usage exceeds 80%.”

GPT-4o produces:

alert: HighCPU
expr: node_cpu_seconds_total > 0.8
for: 5m

This is incomplete: node_cpu_seconds_total is a counter, not CPU percentage. Should be 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100).

Weaknesses:

Cost per rule: $0.004 per simple rule, $0.015 for complex ones.

GitHub Copilot

Price: $10/month or $100/year Best for: Inline alert additions, repo-aware context

Copilot shines if your repo has existing alert rules.

Strengths:

Weaknesses:

Best workflow: Use Copilot for 80% boilerplate, Claude for 20% validation.

Cost per rule: $0 (already paid for).

Grafana AI (Grafana Cloud)

Price: $0 (included with Grafana Cloud, starts at $1/day) Best for: Rules based on existing Grafana dashboards

Grafana’s AI integration can generate rules from dashboard panels.

Workflow:

  1. Build a dashboard panel visualizing the metric
  2. Click “Create Alert from Dashboard”
  3. Grafana AI suggests rule based on panel query

Example: Panel shows rate(http_errors_total[5m]). Grafana suggests:

alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 10
for: 5m

Strengths:

Weaknesses:

Cost per rule: $0 (included with Grafana Cloud).

Google Gemini (Advanced)

Price: $20/month Best for: Learning PromQL, prototype rules

Gemini is comparable to GPT-4o for basics, weaker on Prometheus-specific patterns.

Weakness: Limited Prometheus training data compared to Claude. Generates syntactically correct but semantically questionable rules.

Cost per rule: $0.004 per 1K tokens.

Comparison Table

Tool PromQL Depth Histogram Support Cardinality Awareness Alertmanager Config Cost/Rule Best For
Claude Excellent Excellent Good Good $0.15 Complex rules, p99 latencies
GPT-4o Fair Poor Fair Fair $0.015 Simple thresholds
Copilot Fair Fair Poor Poor $0 Inline completions
Grafana AI Fair Good Excellent Excellent $0 Dashboard-based rules
Gemini Fair Fair Fair Fair $0.004 Learning & prototypes

Practical Workflow

For production alert rules:

  1. Start with Claude Opus.
  2. Provide context:
    • Your metric name (e.g., http_request_duration_seconds)
    • What constitutes “bad” (e.g., p99 > 500ms)
    • Expected duration before alerting (e.g., 5m, 15m)
    • Severity level (critical, warning, info)
  3. Ask Claude:
    • “Is this cardinality-safe?” (won’t explode label combinations)
    • “How do I reduce false positives?”
    • “What’s the recording rule version?” (pre-compute expensive aggregations)
  4. Validate with promtool:
promtool check rules /path/to/rules.yaml
  1. Test threshold with one-off PromQL query in Prometheus UI before enabling alert.

For batch rule generation:

Use Claude’s Prompt Cache:

Real-World Rule Examples

P99 Latency Alert

Prompt to Claude: “Write a rule that alerts when p99 latency exceeds 500ms for any service, but exclude cache services (they have higher latency by design).”

Claude generates:

alert: HighP99Latency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service!~"cache.*"}[5m])) > 0.5
for: 5m
labels:
  severity: warning
annotations:
  summary: "High p99 latency for {{ $labels.service }}"
  description: "p99 is {{ $value }}s, threshold 0.5s"

Error Rate Alert with Escalation

Prompt: “Alert when error rate (5xx responses / total requests) exceeds 5% for 2 minutes. If it stays above 5% for 10 minutes, escalate to critical.”

Claude produces two rules:

alert: HighErrorRateWarning
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
  severity: warning

alert: HighErrorRateCritical
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 10m
labels:
  severity: critical

Memory Pressure Alert with Headroom

Prompt: “Alert when available memory drops below 10% of system memory for 5 minutes. But only if the system isn’t in a natural shrink period (drop slower than 100MB/min).”

Claude generates:

alert: MemoryPressure
expr: |
  (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9
  and on(instance)
  rate(node_memory_MemAvailable_bytes[5m]) > -100_000_000
for: 5m

The rule handles both conditions: memory < 10% AND not shrinking too fast.

Red Flags to Avoid

When reviewing AI-generated rules:

  1. Raw metric names instead of normalized counters: Alert uses node_cpu_seconds_total directly (a counter) instead of rate(node_cpu_seconds_total[5m]) (normalized to rate). Counters aren’t comparable without normalization.

  2. Missing label aggregations: Rule http_request_duration_seconds without histogram_quantile() or rate(). These are raw histograms, not usable for alerting.

  3. Cardinality explosion: Rule rate(http_requests_total[5m]) without grouping by service/endpoint. If your system has 10K unique endpoints, this creates 10K time series and degrades Prometheus performance.

  4. Hardcoded thresholds without context: Alert for “CPU > 50%” without knowing your baseline. Your app might run at 60% CPU normally.

  5. No exclusion for maintenance: Rule fires during deployments/maintenance windows. Should exclude based on job="maintenance" or similar labels.

  6. Unrealistic durations: Alert for: 0m (fires immediately) or for: 30m (too slow to respond). Balance between noise and detection speed.

Decision Framework

FAQ

Q: What’s the difference between alert rules and recording rules? A: Alert rules trigger notifications. Recording rules pre-compute expensive aggregations (e.g., histogram_quantile()) and store results, reducing query load. Use recording rules for queries used by multiple alerts.

Q: How do I avoid false positives? A: Three strategies: (1) Increase the for: duration (wait longer before alerting), (2) Use multi-condition rules (AND multiple metrics), (3) Exclude known noisy periods (deployments, batch jobs).

Q: Can AI tools generate Alertmanager routing configs? A: Claude can. It understands matchers, grouping, and escalation logic. Ask: “Generate an Alertmanager config that routes P99 latency alerts to #on-call and database alerts to #dba.”

Q: How do I test alert thresholds without firing them? A: Use Prometheus UI to query the rule expression, then adjust the threshold until it would/wouldn’t fire. Use for: 0m (temporarily) to test without waiting.

Q: What PromQL functions does Claude understand best? A: rate(), increase(), sum(), avg(), histogram_quantile(), absent(), topk(). For obscure functions (deriv, predict_linear), provide examples.


Built by theluckystrike — More at zovo.one