AI Tools Compared

AI tools can generate complex PromQL queries and Grafana panel configurations for SLO dashboards with burn rate calculations, reducing hours of manual query writing to minutes. By providing SLO targets, metric names, and time windows, AI generates queries that calculate short-term and long-term burn rates, error budgets, and remaining budget percentages. These tools handle multi-layer SLOs with proper aggregation and support templated queries for multi-service dashboards.

Understanding SLO Burn Rate Calculations

Before diving into AI-assisted creation, let’s establish what burn rate panels need. Burn rate represents how quickly your error budget is being consumed. A burn rate of 100% means you’re using your error budget at the expected rate. Values above 100% indicate faster consumption (warning sign), while values below 100% mean you’re under-consuming your budget (potentially too conservative with SLO targets).

The basic burn rate formula compares error rates over two windows:

sum(rate(http_requests_total{status=~"5.."}[{{window}}]))
/
sum(rate(http_requests_total[{{window}}]))

For a 7-day SLO with 30-day rolling window, you typically calculate both short-term (1-hour) and long-term (7-day) burn rates to detect both immediate issues and sustained problems.

AI Tools for Generating SLO Queries

Modern AI coding assistants excel at generating PromQL queries when given clear specifications. To get accurate SLO burn rate queries, provide the AI tool with:

A well-structured prompt might look like this:

“Generate PromQL queries for a Grafana SLO dashboard tracking API availability. The SLO is 99.9% over 7 days. Error metric is api_errors_total with label error_type. Total requests are api_requests_total with labels method and endpoint. Create queries for: current error rate, burn rate (1-hour window), burn rate (7-day window), and error budget remaining percentage.”

Example AI-Generated Burn Rate Query

-- 1-hour burn rate (short window)
sum(rate(api_errors_total[1h])) / sum(rate(api_requests_total[1h])) * 100

-- 7-day burn rate (long window)
sum(rate(api_errors_total[7d])) / sum(rate(api_requests_total[7d])) * 100

-- Error budget remaining (for 99.9% SLO)
(1 - (sum(rate(api_errors_total[7d])) / sum(rate(api_requests_total[7d]))))
* 1000
- (1 - 0.999) * 1000

The error budget remaining calculation requires careful attention. For a 99.9% SLO (0.1% error budget), you calculate how much of that budget has been consumed and what’s left.

Creating Multi-Panel SLO Dashboards

Beyond single queries, AI tools can generate complete dashboard JSON with multiple coordinated panels. A SLO dashboard typically includes:

  1. Current Status Panel - Real-time error rate with threshold visualization

  2. Short-term Burn Rate - 1-hour rolling burn rate to detect immediate issues

  3. Long-term Burn Rate - 7-day rolling burn rate for sustained problems

  4. Error Budget Timeline - Remaining budget over time

  5. Error Budget Consumption - Percentage consumed with projections

AI tools can generate the complete panel configurations when you describe the layout and data requirements:

{
  "panels": [
    {
      "title": "API Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(api_errors_total[5m])) / sum(rate(api_requests_total[5m])) * 100",
          "legendFormat": "Error Rate %"
        }
      ],
      "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
    },
    {
      "title": "Burn Rate (1h window)",
      "type": "gauge",
      "targets": [
        {
          "expr": "sum(rate(api_errors_total[1h])) / sum(rate(api_requests_total[1h])) / 0.001",
          "legendFormat": "Burn Rate"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {"value": 0, "color": "green"},
              {"value": 100, "color": "yellow"},
              {"value": 150, "color": "red"}
            ]
          }
        }
      }
    }
  ]
}

Handling Multi-Layer SLOs

Many organizations track SLOs at multiple granularity levels—per service, per endpoint, or per customer tier. AI tools can generate queries that aggregate appropriately across these dimensions.

For example, when creating an SLO dashboard that tracks burn rates across multiple services:

-- Burn rate by service with group concatenation
sum by (service) (rate(api_errors_total[1h]))
/
sum by (service) (rate(api_requests_total[1h]))

AI assistance proves particularly valuable when you need to create templated dashboards that work across multiple services. Provide the AI with your label names and it can generate reusable queries with variable substitutions:

-- Using Grafana template variables
sum(rate(${error_metric}[$__rate_interval]))
/
sum(rate(${total_requests}[$__rate_interval]))
* 100

Practical Workflow for SLO Dashboard Creation

Start with a specification document describing your SLO structure. Include the error metrics, total request metrics, SLO percentages, and time windows. Then use AI to generate initial queries and panel layouts.

After receiving AI-generated output, verify each query against your actual metric names and label keys. Metrics rarely match documentation exactly, so expect some adjustment. The AI provides a strong starting point but validation against your Prometheus data is essential.

Test burn rate calculations during both normal and incident conditions. A good burn rate panel should clearly distinguish between healthy (below 100%), warning (100-150%), and critical (above 150%) states. Adjust thresholds based on your organization’s risk tolerance and error budget policies.

Best AI Tools for SLO Query Generation

Claude 3.5 Sonnet and ChatGPT-4 both excel at generating PromQL queries. Claude tends to produce more nuanced explanations of the mathematical reasoning behind burn rate calculations. ChatGPT-4 provides faster, more direct query generation with fewer clarifications needed. For SLO-specific workflows, provide Claude with your SLO percentile targets and it will calculate error budget implications automatically.

When prompting any AI tool for SLO queries, include:

Example prompt: “Generate PromQL for a 99.9% SLO measured over 7-day rolling windows. I track errors with http_requests_total{status=~"5.."} and total requests with http_requests_total. Create queries for 1-hour and 7-day burn rates, and calculate percentage of error budget consumed.”

Advanced Burn Rate Query Patterns

For organizations tracking SLOs across multiple services, use grouping with conditional aggregation:

-- Burn rate by service, showing only services exceeding threshold
sum by (service) (
  rate(http_errors_total[1h]) / rate(http_requests_total[1h])
) as burn_rate
| burn_rate > 0.001  -- Alerts when exceeding 1 per 1000 requests

For composite SLOs combining latency and error metrics:

-- Combined SLO: 99% of requests complete within 500ms AND error rate below 0.1%
(
  histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) < 0.5
  and
  rate(http_errors_total[5m]) / rate(http_requests_total[5m]) < 0.001
)

Dashboard JSON Generation Tips

AI tools can generate complete Grafana JSON, but structure matters. Ask the AI to generate dashboard JSON in sections: panels first, then variables, then alerting rules. This modular approach makes it easier to verify each section.

{
  "dashboard": {
    "title": "SLO Burn Rate Dashboard",
    "panels": [
      {
        "title": "Error Budget Remaining (7-day)",
        "type": "stat",
        "targets": [{
          "expr": "(1 - sum(rate(http_errors_total[7d])) / sum(rate(http_requests_total[7d]))) * 100"
        }],
        "thresholds": {"steps": [{"value": 0, "color": "red"}, {"value": 10, "color": "yellow"}, {"value": 20, "color": "green"}]}
      }
    ],
    "templating": {
      "list": [
        {"name": "service", "type": "query", "datasource": "Prometheus"},
        {"name": "environment", "type": "query", "datasource": "Prometheus"}
      ]
    }
  }
}

Validating Queries Against Live Prometheus Data

After AI generates queries, test them immediately against your Prometheus instance:

# Query Prometheus directly via HTTP to validate syntax
curl 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m])) * 100' \
  -s | jq '.data.result'

# Test range queries for burn rate calculations
curl 'http://prometheus:9090/api/v1/query_range' \
  --data-urlencode 'query=sum(rate(http_errors_total[1h]))' \
  --data-urlencode 'start=1678800000' \
  --data-urlencode 'end=1678886400' \
  --data-urlencode 'step=300' \
  -s | jq '.data.result'

Common Mistakes AI Tools Sometimes Make

  1. Forgetting label cardinality: AI might suggest queries that explode cardinality. Ask it to use sum() aggregation to reduce dimensionality.

  2. Incorrect error budget math: Verify the SLO percentage conversion. For 99.9%, error budget is 0.1% or 0.001. AI occasionally flips this.

  3. Window mismatches: AI may generate queries with different time windows for burn rate calculation. Explicitly confirm that short-window and long-window burn rates use appropriate intervals (1h and 7d).

  4. Missing _total suffixes: When mentioning metric names, always specify whether they include _total suffix, as AI sometimes omits or duplicates it.

Integration with Alerting Rules

AI tools can also generate alert rules to notify teams when burn rate exceeds safe thresholds. Request alert rules in Prometheus format:

groups:
  - name: slo_alerts
    rules:
      - alert: SLOBurnRateExceeded
        expr: |
          sum(rate(http_errors_total[1h]))
          /
          sum(rate(http_requests_total[1h]))
          > 0.01  # 1% burn rate
        for: 5m
        annotations:
          summary: "SLO burn rate exceeding 1% for {{ $labels.service }}"

When working with AI to generate alert rules, specify:

Built by theluckystrike — More at zovo.one