Prometheus alerting rules require deep knowledge of PromQL syntax, time series semantics, and operational thresholds. Writing rules manually means wrestling with query syntax, threshold tuning, and escalation logic. AI assistants can accelerate this process significantly—but only tools trained on Prometheus monitoring patterns generate rules that survive production use.
Why AI for Prometheus Rules
Manual rule-writing involves:
- Learning PromQL syntax and aggregation functions (rate, increase, histogram_quantile)
- Understanding cardinality explosion and label selector optimization
- Tuning alert thresholds without generating false positives
- Designing escalation paths through Alertmanager routing
- Managing rule dependencies and cross-service correlations
Poor rules create alert fatigue (false positives) or miss real incidents (false negatives), both costly for incident response.
AI Tools Comparison
Claude (Opus 4.6, Haiku 4.5)
Price: $3/month (Claude.ai Pro) or $20 per 1M input tokens (API) Best for: Complex multi-condition rules, recording rules, escalation logic
Claude excels at PromQL-heavy rules. It understands:
- histogram_quantile() for latency percentiles
- by() and without() label aggregations
- rate() and increase() for counter derivatives
- Boolean algebra for multi-condition alerts
Example: You ask Claude to write a rule that alerts when p99 latency exceeds 500ms for 5 minutes, excluding internal services.
Claude produces:
alert: HighP99Latency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service!~"internal.*"}[5m])) > 0.5
for: 5m
annotations:
summary: "High P99 latency for {{ $labels.service }}"
description: "P99 latency is {{ $value }}s, threshold 0.5s"
The rule is production-ready: uses histogram_quantile correctly, excludes internal services with label matcher, has appropriate duration (5m).
Strengths:
- Understands label matching and cardinality implications
- Generates recording rules that pre-compute expensive aggregations
- Handles multi-step alert chains (alert A fires, wait 5m, then check B)
Weaknesses:
- Requires explicit mention of metric names (doesn’t auto-discover from Prometheus)
- Needs context on what constitutes “bad” for your service
Cost per rule: $0.15 (API) for complex multi-condition rules, $0.03 for simple counters.
OpenAI GPT-4o
Price: $20/month (ChatGPT Plus) or $0.03/$0.15 per 1K input/output tokens (API) Best for: Basic counter and gauge rules
GPT-4o handles straightforward rules. Effective for:
- Simple threshold-based alerts (CPU > 80%)
- Memory usage rules
- Disk space warnings
Example prompt: “Write a Prometheus alert rule for when CPU usage exceeds 80%.”
GPT-4o produces:
alert: HighCPU
expr: node_cpu_seconds_total > 0.8
for: 5m
This is incomplete: node_cpu_seconds_total is a counter, not CPU percentage. Should be 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100).
Weaknesses:
- Doesn’t understand counter vs gauge semantics
- Generates rules that compile but fire incorrectly
- Limited understanding of label aggregations
Cost per rule: $0.004 per simple rule, $0.015 for complex ones.
GitHub Copilot
Price: $10/month or $100/year Best for: Inline alert additions, repo-aware context
Copilot shines if your repo has existing alert rules.
Strengths:
- Learns your organization’s naming conventions (e.g.,
myservice_requests_total) - Suggests completions based on similar rules in your repo
- Integrates into your YAML editor
Weaknesses:
- Doesn’t validate PromQL syntax
- No multi-turn conversation to refine thresholds
- Can’t explain why a rule triggers
Best workflow: Use Copilot for 80% boilerplate, Claude for 20% validation.
Cost per rule: $0 (already paid for).
Grafana AI (Grafana Cloud)
Price: $0 (included with Grafana Cloud, starts at $1/day) Best for: Rules based on existing Grafana dashboards
Grafana’s AI integration can generate rules from dashboard panels.
Workflow:
- Build a dashboard panel visualizing the metric
- Click “Create Alert from Dashboard”
- Grafana AI suggests rule based on panel query
Example: Panel shows rate(http_errors_total[5m]). Grafana suggests:
alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 10
for: 5m
Strengths:
- Extracts PromQL directly from your dashboards
- Understands your metric naming (auto-discovery)
- Real-time threshold suggestions
Weaknesses:
- Limited to simple threshold rules
- Can’t generate complex multi-metric correlations
- Locked into Grafana ecosystem
Cost per rule: $0 (included with Grafana Cloud).
Google Gemini (Advanced)
Price: $20/month Best for: Learning PromQL, prototype rules
Gemini is comparable to GPT-4o for basics, weaker on Prometheus-specific patterns.
Weakness: Limited Prometheus training data compared to Claude. Generates syntactically correct but semantically questionable rules.
Cost per rule: $0.004 per 1K tokens.
Comparison Table
| Tool | PromQL Depth | Histogram Support | Cardinality Awareness | Alertmanager Config | Cost/Rule | Best For |
|---|---|---|---|---|---|---|
| Claude | Excellent | Excellent | Good | Good | $0.15 | Complex rules, p99 latencies |
| GPT-4o | Fair | Poor | Fair | Fair | $0.015 | Simple thresholds |
| Copilot | Fair | Fair | Poor | Poor | $0 | Inline completions |
| Grafana AI | Fair | Good | Excellent | Excellent | $0 | Dashboard-based rules |
| Gemini | Fair | Fair | Fair | Fair | $0.004 | Learning & prototypes |
Practical Workflow
For production alert rules:
- Start with Claude Opus.
- Provide context:
- Your metric name (e.g.,
http_request_duration_seconds) - What constitutes “bad” (e.g., p99 > 500ms)
- Expected duration before alerting (e.g., 5m, 15m)
- Severity level (critical, warning, info)
- Your metric name (e.g.,
- Ask Claude:
- “Is this cardinality-safe?” (won’t explode label combinations)
- “How do I reduce false positives?”
- “What’s the recording rule version?” (pre-compute expensive aggregations)
- Validate with
promtool:
promtool check rules /path/to/rules.yaml
- Test threshold with one-off PromQL query in Prometheus UI before enabling alert.
For batch rule generation:
Use Claude’s Prompt Cache:
- Load your SLA targets in the system prompt
- Load your metric naming conventions
- Generate 20+ rules in one session
- Cost drops 50% after caching hits
Real-World Rule Examples
P99 Latency Alert
Prompt to Claude: “Write a rule that alerts when p99 latency exceeds 500ms for any service, but exclude cache services (they have higher latency by design).”
Claude generates:
alert: HighP99Latency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service!~"cache.*"}[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High p99 latency for {{ $labels.service }}"
description: "p99 is {{ $value }}s, threshold 0.5s"
Error Rate Alert with Escalation
Prompt: “Alert when error rate (5xx responses / total requests) exceeds 5% for 2 minutes. If it stays above 5% for 10 minutes, escalate to critical.”
Claude produces two rules:
alert: HighErrorRateWarning
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: warning
alert: HighErrorRateCritical
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 10m
labels:
severity: critical
Memory Pressure Alert with Headroom
Prompt: “Alert when available memory drops below 10% of system memory for 5 minutes. But only if the system isn’t in a natural shrink period (drop slower than 100MB/min).”
Claude generates:
alert: MemoryPressure
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9
and on(instance)
rate(node_memory_MemAvailable_bytes[5m]) > -100_000_000
for: 5m
The rule handles both conditions: memory < 10% AND not shrinking too fast.
Red Flags to Avoid
When reviewing AI-generated rules:
-
Raw metric names instead of normalized counters: Alert uses
node_cpu_seconds_totaldirectly (a counter) instead ofrate(node_cpu_seconds_total[5m])(normalized to rate). Counters aren’t comparable without normalization. -
Missing label aggregations: Rule
http_request_duration_secondswithouthistogram_quantile()orrate(). These are raw histograms, not usable for alerting. -
Cardinality explosion: Rule
rate(http_requests_total[5m])without grouping by service/endpoint. If your system has 10K unique endpoints, this creates 10K time series and degrades Prometheus performance. -
Hardcoded thresholds without context: Alert for “CPU > 50%” without knowing your baseline. Your app might run at 60% CPU normally.
-
No exclusion for maintenance: Rule fires during deployments/maintenance windows. Should exclude based on
job="maintenance"or similar labels. -
Unrealistic durations: Alert
for: 0m(fires immediately) orfor: 30m(too slow to respond). Balance between noise and detection speed.
Decision Framework
- Complex multi-condition rules, p99 latencies: Claude Opus. Cost-justified.
- Simple threshold rules: GPT-4o or Grafana AI.
- Team collaboration, inline generation: Copilot + Claude.
- Dashboard-driven rules: Grafana AI.
FAQ
Q: What’s the difference between alert rules and recording rules?
A: Alert rules trigger notifications. Recording rules pre-compute expensive aggregations (e.g., histogram_quantile()) and store results, reducing query load. Use recording rules for queries used by multiple alerts.
Q: How do I avoid false positives?
A: Three strategies: (1) Increase the for: duration (wait longer before alerting), (2) Use multi-condition rules (AND multiple metrics), (3) Exclude known noisy periods (deployments, batch jobs).
Q: Can AI tools generate Alertmanager routing configs? A: Claude can. It understands matchers, grouping, and escalation logic. Ask: “Generate an Alertmanager config that routes P99 latency alerts to #on-call and database alerts to #dba.”
Q: How do I test alert thresholds without firing them?
A: Use Prometheus UI to query the rule expression, then adjust the threshold until it would/wouldn’t fire. Use for: 0m (temporarily) to test without waiting.
Q: What PromQL functions does Claude understand best? A: rate(), increase(), sum(), avg(), histogram_quantile(), absent(), topk(). For obscure functions (deriv, predict_linear), provide examples.
Related Articles
- PromQL Query Optimization Techniques
- Building Effective Alertmanager Routing
- Cardinality Management in Prometheus
- AI Tools for Infrastructure-as-Code
- Monitoring Distributed Systems
Built by theluckystrike — More at zovo.one