Remote Work Tools

Grafana dashboards in co-located teams are glanced at on a monitor on the wall. Remote teams need dashboards designed for async consumption: clear annotations, shareable panels, and automated summaries that land in Slack without anyone having to remember to look. This guide covers the setup that makes Grafana useful for distributed teams.

Table of Contents

The difference between a useful remote dashboard and an useless one is not the metrics — it is the context. A panel showing “error rate: 0.3%” tells a co-located engineer something because they have been watching it all day. It tells a remote engineer nothing at 9am without a baseline, a threshold, and an indication of whether this is better or worse than yesterday.

Installation with Docker Compose

# docker-compose.grafana.yml
services:
  grafana:
    image: grafana/grafana-oss:latest
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: false
      GF_SMTP_ENABLED: true
      GF_SMTP_HOST: ${SMTP_HOST}
      GF_SMTP_USER: ${SMTP_USER}
      GF_SMTP_PASSWORD: ${SMTP_PASSWORD}
      GF_SMTP_FROM_ADDRESS: grafana@yourcompany.com
      # Slack webhook for alert notifications
      GF_ALERTING_SLACK_WEBHOOK_URL: ${SLACK_WEBHOOK_URL}
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards

  prometheus:
    image: prom/prometheus:latest
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus

volumes:
  grafana_data:
  prometheus_data:

For remote teams who do not want to self-host, Grafana Cloud offers a free tier (10,000 metrics, 50GB logs, 50GB traces, 14-day retention). The free tier covers most small distributed engineering teams without any infra overhead. For larger teams, the Pro tier at $8/user/month adds unlimited retention and SSO.

Dashboard Provisioning (Dashboard-as-Code)

Store dashboards in git. This prevents dashboard drift — where production dashboards diverge from what’s documented.

# grafana/provisioning/dashboards/default.yaml
apiVersion: 1
providers:
  - name: Default
    folder: Team Dashboards
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true
# grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"

The allowUiUpdates: true setting lets engineers iterate on dashboards through the UI, but changes should be exported and committed back to git. Add a comment to the provisioning folder’s README: “If you change a dashboard in the UI, export the JSON and commit it — otherwise your changes will be overwritten on next deploy.”

Team Dashboard Structure

For remote teams, organize dashboards by audience, not by metric type:

Folders:
├── Executive (SLA, uptime, error rates — simple, text-heavy)
├── Engineering (detailed metrics, per-service breakdown)
│   ├── Platform Overview (cross-service health at a glance)
│   ├── API Service
│   ├── Background Jobs
│   └── Databases
├── On-Call (optimized for incident response — large panels, clear thresholds)
└── Deploy (before/after comparison for deploys)

The Executive folder should have no more than 5 panels per dashboard, all using stat panels with large text. Engineers’ dashboards can be dense — on-call dashboards must be scannable in 5 seconds when someone is woken at 3am.

The Async-Friendly Dashboard Panel

Every panel in a remote team dashboard should answer the question “what is this telling me without context?” on first glance.

Good panel structure:

{
  "title": "API Error Rate — 5m avg (alert at >1%)",
  "description": "HTTP 5xx errors as % of total requests. Baseline: ~0.1%. Previous week P95: 0.3%",
  "type": "timeseries",
  "options": {
    "tooltip": {
      "mode": "multi"
    }
  },
  "fieldConfig": {
    "defaults": {
      "unit": "percentunit",
      "thresholds": {
        "steps": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 0.005},
          {"color": "red", "value": 0.01}
        ]
      }
    }
  }
}

Key elements:

What to avoid in remote dashboards:

Deploy Annotations

Annotations mark deploys on every panel, making it obvious when a metric change correlates with a deploy:

# Post annotation via Grafana API after every deploy
post_deploy_annotation() {
  local VERSION=$1
  local DEPLOYER=$2

  curl -X POST \
    "http://grafana:3000/api/annotations" \
    -H "Authorization: Bearer $GRAFANA_API_KEY" \
    -H "Content-Type: application/json" \
    -d "{
      \"time\": $(date +%s%3N),
      \"tags\": [\"deploy\", \"production\"],
      \"text\": \"Deploy: $VERSION by $DEPLOYER\"
    }"
}

# Add to your deploy script
post_deploy_annotation "$VERSION" "$GITHUB_ACTOR"

In your GitHub Actions deploy workflow:

- name: Post Grafana annotation
  run: |
    curl -X POST "https://grafana.internal/api/annotations" \
      -H "Authorization: Bearer ${{ secrets.GRAFANA_API_KEY }}" \
      -H "Content-Type: application/json" \
      -d "{
        \"time\": $(date +%s%3N),
        \"tags\": [\"deploy\", \"production\"],
        \"text\": \"Deploy: ${{ github.ref_name }} by ${{ github.actor }}\"
      }"

Beyond deploys, annotate other meaningful events: database migrations, config changes, traffic spikes from marketing campaigns. Each annotation gives future async readers a reference point when reviewing historical metrics.

Alerting Configuration for Remote Teams

Grafana Alerting (unified alerting, enabled by default since Grafana 9) requires a contact point and notification policy.

Contact point — Slack:

In Grafana UI: Alerting → Contact points → Add contact point → Slack

Configure with your webhook URL and a message template that includes a link to the relevant dashboard:

{{ define "slack.message" }}
*[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}*
{{ range .Alerts }}
• {{ .Annotations.summary }}
• <{{ .GeneratorURL }}|View in Grafana>
{{ end }}
{{ end }}

Notification policy:

Default policy:
  Contact point: #eng-alerts
  Group by: [alertname, cluster]
  Group wait: 30s
  Group interval: 5m
  Repeat interval: 4h

Nested policy (P1 severity):
  Matcher: severity = critical
  Contact point: #eng-incidents + PagerDuty
  Repeat interval: 30m

The repeat interval for critical alerts should be short enough that an on-call engineer is not waiting 4 hours for a reminder, but long enough that the channel does not flood during a sustained incident.

Slack Digest: Daily Health Report

Instead of requiring engineers to check Grafana, send a daily digest to Slack:

# scripts/grafana-digest.py
import httpx
import os
from datetime import datetime, timedelta

GRAFANA_URL = os.environ["GRAFANA_URL"]
GRAFANA_API_KEY = os.environ["GRAFANA_API_KEY"]
SLACK_WEBHOOK = os.environ["SLACK_DAILY_DIGEST_WEBHOOK"]

def query_prometheus(query: str) -> float:
    resp = httpx.get(
        f"{GRAFANA_URL}/api/datasources/proxy/1/api/v1/query",
        headers={"Authorization": f"Bearer {GRAFANA_API_KEY}"},
        params={"query": query}
    )
    result = resp.json().get("data", {}).get("result", [])
    if result:
        return float(result[0]["value"][1])
    return 0.0

def send_daily_digest():
    # Fetch key metrics
    error_rate = query_prometheus(
        'sum(rate(http_requests_total{status=~"5.."}[24h])) / sum(rate(http_requests_total[24h]))'
    )
    p95_latency = query_prometheus(
        'histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[24h])) by (le))'
    )
    uptime = query_prometheus(
        'avg_over_time(up{job="api-service"}[24h]) * 100'
    )

    # Determine emoji for each metric
    error_emoji = "green_circle" if error_rate < 0.005 else "yellow_circle" if error_rate < 0.01 else "red_circle"
    latency_emoji = "green_circle" if p95_latency < 0.3 else "yellow_circle" if p95_latency < 0.5 else "red_circle"
    uptime_emoji = "green_circle" if uptime > 99.9 else "yellow_circle" if uptime > 99 else "red_circle"

    message = {
        "blocks": [
            {
                "type": "header",
                "text": {"type": "plain_text", "text": f"Daily Health Report — {datetime.utcnow().strftime('%Y-%m-%d')}"}
            },
            {
                "type": "section",
                "fields": [
                    {"type": "mrkdwn", "text": f":{error_emoji}: *Error Rate (24h):*\n{error_rate:.3%}"},
                    {"type": "mrkdwn", "text": f":{latency_emoji}: *p95 Latency (24h):*\n{p95_latency*1000:.0f}ms"},
                    {"type": "mrkdwn", "text": f":{uptime_emoji}: *Uptime (24h):*\n{uptime:.2f}%"},
                ]
            },
            {
                "type": "actions",
                "elements": [
                    {
                        "type": "button",
                        "text": {"type": "plain_text", "text": "Open Dashboard"},
                        "url": f"{GRAFANA_URL}/d/platform-overview"
                    }
                ]
            }
        ]
    }

    httpx.post(SLACK_WEBHOOK, json=message)

if __name__ == "__main__":
    send_daily_digest()
# Run daily at 9am UTC via cron
0 9 * * 1-5 python /opt/scripts/grafana-digest.py

When discussing an anomaly async in Slack, link directly to the relevant time range:

# Generate a panel link for the last 4 hours
GRAFANA_URL="https://grafana.yourcompany.com"
DASHBOARD_UID="platform-overview"
PANEL_ID=5
FROM=$(date -d '4 hours ago' +%s%3N)  # 4 hours ago in ms
TO=$(date +%s%3N)  # now in ms

echo "${GRAFANA_URL}/d/${DASHBOARD_UID}?orgId=1&viewPanel=${PANEL_ID}&from=${FROM}&to=${TO}"

Add this to your incident response bot: when an alert fires, automatically include a pre-linked panel URL showing the 30 minutes around the alert time. This removes a meaningful friction point for remote engineers — instead of navigating to Grafana and manually adjusting the time range, they click a link and immediately see the relevant window.

Dashboard-as-Code with Grafonnet

For teams managing many dashboards, generate them with code:

// dashboards/api-overview.libsonnet
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local row = grafana.row;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;

dashboard.new(
  'API Service Overview',
  tags=['api', 'production'],
  time_from='now-3h',
  refresh='30s',
)
.addRow(
  row.new(title='Error Rates')
  .addPanel(
    graphPanel.new(
      'HTTP Error Rate',
      datasource='Prometheus',
      format='percentunit',
    )
    .addTarget(prometheus.target(
      'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))',
      legendFormat='Error Rate'
    ))
  )
)
# Generate JSON from jsonnet
jsonnet dashboards/api-overview.libsonnet > grafana/dashboards/api-overview.json

# Add to CI to validate dashboards on every push
jsonnet --lint dashboards/*.libsonnet

An alternative to Grafonnet is grizzly, a CLI tool that manages Grafana dashboards declaratively from YAML or JSON files and can apply them via CI. It is simpler than Grafonnet for teams that do not want to learn Jsonnet, and works well with Grafana Cloud.

Grafana vs. Alternatives for Remote Teams

Tool Strength Weakness Best for
Grafana Flexible, open source, massive plugin ecosystem Complex to configure well Teams with Prometheus/Loki already running
Datadog Excellent APM, easy setup $15-23/host/month Teams willing to pay for convenience
New Relic Strong distributed tracing Per-user pricing adds up APM-focused teams
Honeycomb Best-in-class for distributed tracing and high-cardinality queries Expensive at scale Microservices-heavy teams
CloudWatch Native for AWS workloads Poor UX, vendor lock-in AWS-only shops

For most remote engineering teams self-hosting on Hetzner, DigitalOcean, or similar, Grafana with Prometheus and Loki is the right default. The tooling is mature, free, and integrates with everything.