Setting Up Jaeger for Distributed Tracing

Distributed tracing shows you where time goes across service boundaries. Jaeger collects OpenTelemetry spans and lets your team trace a request from API gateway through microservices to database. This guide deploys Jaeger all-in-one for development and a production-ready setup with Elasticsearch for persistence.

Development: Jaeger All-in-One
Production: Docker Compose with Elasticsearch
Instrumenting a Python Service
Instrumenting a Node.js Service
Manual Span Creation
Jaeger Query API for Automation
Grafana Integration
Trace Sampling Configuration
Trace Retention and Index Lifecycle Management
Adding Context Propagation Across Queues
Kubernetes Deployment with Jaeger Operator
Alerting on Trace Anomalies
Correlating Traces with Logs

Development: Jaeger All-in-One

# Quick start for development — all components in one container
docker run -d \
  --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \   # Jaeger UI
  -p 14250:14250 \
  -p 14268:14268 \
  -p 14269:14269 \
  -p 4317:4317 \     # OTLP gRPC
  -p 4318:4318 \     # OTLP HTTP
  --restart unless-stopped \
  jaegertracing/all-in-one:1.55

# Open Jaeger UI
open http://localhost:16686

Production: Docker Compose with Elasticsearch

# docker-compose.yml
version: "3.8"

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.1
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    volumes:
      - es_data:/usr/share/elasticsearch/data
    ulimits:
      memlock:
        soft: -1
        hard: -1
    restart: unless-stopped
    networks:
      - tracing

  jaeger-collector:
    image: jaegertracing/jaeger-collector:1.55
    container_name: jaeger-collector
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200
      - COLLECTOR_OTLP_ENABLED=true
      - LOG_LEVEL=info
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "14268:14268" # HTTP collector
      - "14250:14250" # gRPC collector
    depends_on:
      - elasticsearch
    restart: unless-stopped
    networks:
      - tracing

  jaeger-query:
    image: jaegertracing/jaeger-query:1.55
    container_name: jaeger-query
    environment:
      - SPAN_STORAGE_TYPE=elasticsearch
      - ES_SERVER_URLS=http://elasticsearch:9200
      - LOG_LEVEL=info
    ports:
      - "16686:16686"  # Jaeger UI
      - "16687:16687"  # Admin port
    depends_on:
      - elasticsearch
    restart: unless-stopped
    networks:
      - tracing

networks:
  tracing:
    driver: bridge

volumes:
  es_data:

docker compose up -d
# Wait for Elasticsearch to start (~30 seconds)
docker compose logs -f jaeger-collector

Instrumenting a Python Service

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-fastapi \
  opentelemetry-instrumentation-httpx \
  opentelemetry-instrumentation-sqlalchemy

# tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

def configure_tracing(service_name: str, otlp_endpoint: str = "http://jaeger-collector:4317"):
    resource = Resource(attributes={
        SERVICE_NAME: service_name,
        "service.version": "1.0.0",
        "deployment.environment": "production",
    })

    provider = TracerProvider(resource=resource)

    otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True)
    provider.add_span_processor(BatchSpanProcessor(otlp_exporter))

    trace.set_tracer_provider(provider)

    # Auto-instrument frameworks
    FastAPIInstrumentor().instrument()
    HTTPXClientInstrumentor().instrument()

    return trace.get_tracer(service_name)

# main.py
from fastapi import FastAPI
from tracing import configure_tracing
import httpx

app = FastAPI()
tracer = configure_tracing("order-service")

@app.get("/orders/{order_id}")
async def get_order(order_id: str):
    # Manual span for custom operations
    with tracer.start_as_current_span("fetch-order-details") as span:
        span.set_attribute("order.id", order_id)

        # This HTTP call will be auto-instrumented
        async with httpx.AsyncClient() as client:
            user = await client.get(f"http://user-service/users/{order_id}")

        span.set_attribute("user.id", user.json()["id"])
        return {"order_id": order_id, "user": user.json()}

Instrumenting a Node.js Service

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc

// tracing.js (must be required before other modules)
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'payment-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '2.1.0',
    'deployment.environment': process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTLP_ENDPOINT || 'http://jaeger-collector:4317',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

// package.json start script
{
  "scripts": {
    "start": "node -r ./tracing.js server.js"
  }
}

Manual Span Creation

# Python: add custom spans for business logic
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_payment(payment_id: str, amount: float):
    with tracer.start_as_current_span("process-payment") as span:
        span.set_attribute("payment.id", payment_id)
        span.set_attribute("payment.amount", amount)
        span.set_attribute("payment.currency", "USD")

        try:
            result = charge_card(payment_id, amount)
            span.set_attribute("payment.status", "success")
            span.set_attribute("payment.transaction_id", result.transaction_id)
            return result
        except PaymentDeclinedException as e:
            span.set_attribute("payment.status", "declined")
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            raise

Jaeger Query API for Automation

# Find traces with errors
curl "http://localhost:16686/api/traces?service=order-service&tags=%7B%22error%22%3A%22true%22%7D&limit=20"

# Get trace by ID
curl "http://localhost:16686/api/traces/abcdef1234567890"

# List all services
curl "http://localhost:16686/api/services"

# Search slow traces (>1 second)
curl "http://localhost:16686/api/traces?service=order-service&minDuration=1000ms&limit=50"

Grafana Integration

Add Jaeger as a Grafana data source:

# grafana/provisioning/datasources/jaeger.yml
apiVersion: 1

datasources:
  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger-query:16686
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        filterByTraceID: true
        mapTagNamesEnabled: true
        mappedTags:
          - key: service.name
            value: service
      nodeGraph:
        enabled: true

In Grafana dashboards, add a trace panel:

Panel Type: Traces
Data Source: Jaeger
Query: { service="order-service" }

Trace Sampling Configuration

For high-traffic production services, sample selectively:

from opentelemetry.sdk.trace.sampling import (
    TraceIdRatioBased,
    ParentBased,
    ALWAYS_ON,
    ALWAYS_OFF,
)

# Sample 10% of traces in production
sampler = ParentBased(
    root=TraceIdRatioBased(0.1),
    remote_parent_sampled=ALWAYS_ON,    # Always sample if parent was sampled
    remote_parent_not_sampled=ALWAYS_OFF,
)

provider = TracerProvider(resource=resource, sampler=sampler)

Trace Retention and Index Lifecycle Management

Jaeger with Elasticsearch accumulates data quickly. A busy service generating 1,000 traces per minute fills tens of gigabytes per day. Configure an ILM policy in Elasticsearch to roll over and delete old trace indices automatically:

# Create ILM policy for Jaeger span indices
curl -X PUT "http://localhost:9200/_ilm/policy/jaeger-span-policy" \
  -H 'Content-Type: application/json' \
  -d '{
    "policy": {
      "phases": {
        "hot": {
          "actions": {
            "rollover": {
              "max_age": "1d",
              "max_size": "10gb"
            }
          }
        },
        "warm": {
          "min_age": "3d",
          "actions": {
            "shrink": { "number_of_shards": 1 },
            "forcemerge": { "max_num_segments": 1 }
          }
        },
        "delete": {
          "min_age": "14d",
          "actions": {
            "delete": {}
          }
        }
      }
    }
  }'

This keeps 14 days of traces and aggressively merges warm shards after 3 days to reduce heap pressure. Adjust min_age under delete based on your incident response SLA — most teams find 7-30 days sufficient.

Jaeger also ships a jaeger-es-index-cleaner utility for simpler retention without ILM:

# Delete indices older than 14 days
docker run --rm \
  -e ROLLOVER=true \
  jaegertracing/jaeger-es-index-cleaner:1.55 \
  14 http://elasticsearch:9200

Schedule this as a cron job or a daily Docker Compose service to keep storage bounded without configuring full ILM.

Adding Context Propagation Across Queues

Auto-instrumentation handles HTTP calls automatically, but message queues require explicit context propagation. Here is a pattern for RabbitMQ using the W3C TraceContext format:

# Producer: inject trace context into message headers
from opentelemetry import trace, propagate
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

def publish_order_event(channel, order_id: str, payload: dict):
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("publish-order-event") as span:
        span.set_attribute("messaging.system", "rabbitmq")
        span.set_attribute("messaging.destination", "orders")
        span.set_attribute("order.id", order_id)

        headers = {}
        propagate.inject(headers)  # Injects traceparent + tracestate

        channel.basic_publish(
            exchange="orders",
            routing_key="order.created",
            body=json.dumps(payload),
            properties=pika.BasicProperties(headers=headers)
        )

# Consumer: extract trace context from message headers
def process_message(channel, method, properties, body):
    tracer = trace.get_tracer(__name__)
    ctx = propagate.extract(properties.headers or {})

    with tracer.start_as_current_span(
        "process-order-event",
        context=ctx,
        kind=trace.SpanKind.CONSUMER
    ) as span:
        span.set_attribute("messaging.system", "rabbitmq")
        order = json.loads(body)
        handle_order(order)

This ensures that a trace from the HTTP request that triggered the publish appears connected to the consumer span in Jaeger — giving you end-to-end visibility across the queue boundary.

Kubernetes Deployment with Jaeger Operator

For Kubernetes environments, the Jaeger Operator simplifies lifecycle management:

# Install cert-manager (prerequisite)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# Install Jaeger Operator
kubectl create namespace observability
kubectl apply -n observability -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.55.0/jaeger-operator.yaml

# jaeger-production.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger-production
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
        index-prefix: jaeger
  collector:
    replicas: 2
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  query:
    replicas: 1
    resources:
      limits:
        cpu: 250m
        memory: 256Mi
  ingress:
    enabled: true
    hosts:
      - tracing.example.com

The operator manages rolling updates and handles schema migration for Elasticsearch indices automatically, which is a significant operational advantage over managing the collector and query components separately.

Alerting on Trace Anomalies

Jaeger itself does not ship alerting, but you can build lightweight trace-based alerts by querying the Jaeger HTTP API from a scheduled script and routing results to PagerDuty or Slack. A practical pattern for remote teams:

#!/usr/bin/env python3
# scripts/jaeger-alert.py — run every 5 minutes via cron
import requests, json, os, sys
from datetime import datetime, timedelta

JAEGER_URL = os.getenv("JAEGER_URL", "http://localhost:16686")
SLACK_WEBHOOK = os.getenv("SLACK_WEBHOOK")
ERROR_THRESHOLD = int(os.getenv("ERROR_THRESHOLD", "10"))
SERVICES = ["order-service", "payment-service", "user-service"]

now = datetime.utcnow()
five_min_ago = now - timedelta(minutes=5)
start_us = int(five_min_ago.timestamp() * 1_000_000)
end_us   = int(now.timestamp() * 1_000_000)

alerts = []
for service in SERVICES:
    resp = requests.get(
        f"{JAEGER_URL}/api/traces",
        params={
            "service": service,
            "tags": '{"error":"true"}',
            "start": start_us,
            "end": end_us,
            "limit": 100,
        },
        timeout=10,
    )
    traces = resp.json().get("data", [])
    if len(traces) >= ERROR_THRESHOLD:
        alerts.append(f"*{service}*: {len(traces)} error traces in the last 5 minutes")

if alerts and SLACK_WEBHOOK:
    payload = {"text": "Jaeger trace alert:\n" + "\n".join(alerts)}
    requests.post(SLACK_WEBHOOK, json=payload, timeout=5)
    sys.exit(1)

print("No anomalies detected.")

Add this to a cron job on your monitoring host or run it as a Kubernetes CronJob. It keeps alerting logic simple and avoids the complexity of a full APM platform for teams that only need error rate signals from traces.

Correlating Traces with Logs

The highest-value Jaeger integration for most teams is log correlation: clicking a span in Jaeger and jumping directly to the logs that span generated, without copying trace IDs manually. This requires two things: your services must include the trace ID in log output, and Grafana must link the Jaeger trace ID to Loki.

Inject the trace ID into structured logs automatically using the OpenTelemetry logging bridge:

# logging_config.py
import logging
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter

def configure_logging(service_name: str):
    logger_provider = LoggerProvider()
    otlp_exporter = OTLPLogExporter(endpoint="http://jaeger-collector:4317", insecure=True)
    logger_provider.add_log_record_processor(BatchLogRecordProcessor(otlp_exporter))
    set_logger_provider(logger_provider)

    handler = LoggingHandler(level=logging.DEBUG, logger_provider=logger_provider)

    # Add trace context fields to every log record
    logging.basicConfig(handlers=[handler], level=logging.INFO)
    return logging.getLogger(service_name)

With this in place, every logger.info(...) call automatically includes trace_id and span_id fields in the OTLP payload. Loki receives these via a Promtail pipeline, and Grafana’s trace-to-logs linking uses the trace_id field to jump between the two data sources with a single click.

Table of Contents