Distributed tracing with OpenTelemetry, Jaeger and Prometheus

Why metrics alone aren't enough

Prometheus tells you that your p99 latency spiked at 14:32. It doesn't tell you why. In a microservices architecture with 20+ services, a slow database query in service D can manifest as elevated latency in service A — and your dashboards will happily show you the symptom while hiding the cause.

Distributed tracing links every request across service boundaries into a single trace. When latency spikes, you find the culprit span in seconds instead of hours of log correlation.

The OpenTelemetry data model

OpenTelemetry (OTel) standardises three signal types:

Traces — a tree of spans representing a single request's journey
Metrics — aggregated numerical measurements over time
Logs — structured event records, correlatable by trace ID

Each span has: a trace ID (same across all spans in a request), a span ID (unique per span), a parent span ID, timestamps, status, and arbitrary key-value attributes. The W3C traceparent header propagates this context between services over HTTP.

Deploying the OTel Collector

# otel-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
  namespace: observability
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 1s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
      # Add k8s metadata to every span
      k8sattributes:
        auth_type: serviceAccount
        passthrough: false
        extract:
          metadata:
            - k8s.namespace.name
            - k8s.pod.name
            - k8s.node.name
            - k8s.deployment.name

    exporters:
      jaeger:
        endpoint: jaeger-collector.observability:14250
        tls:
          insecure: true
      prometheus:
        endpoint: "0.0.0.0:8889"
        resource_to_telemetry_conversion:
          enabled: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [jaeger]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheus]

Instrumenting a Go service

// tracing/setup.go
package tracing

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

func InitTracer(serviceName, version string) (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res := resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceName(serviceName),
        semconv.ServiceVersion(version),
    )

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        sdktrace.WithSampler(sdktrace.ParentBased(
            sdktrace.TraceIDRatioBased(0.1), // sample 10% of new traces
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

Adding custom spans and attributes

func (s *OrderService) ProcessOrder(ctx context.Context, orderID string) error {
    tracer := otel.Tracer("order-service")
    ctx, span := tracer.Start(ctx, "ProcessOrder")
    defer span.End()

    // Add business-relevant attributes
    span.SetAttributes(
        attribute.String("order.id", orderID),
        attribute.String("order.region", s.region),
    )

    items, err := s.fetchItems(ctx, orderID)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }

    span.SetAttributes(attribute.Int("order.item_count", len(items)))

    return s.submitToWarehouse(ctx, items)
}

Linking traces to Prometheus metrics with exemplars

Exemplars are the bridge between metrics and traces. A Prometheus counter or histogram can carry an exemplar — a sample data point that includes a trace ID. When you click on a latency spike in Grafana, you jump directly to the Jaeger trace for that specific request.

// Record a histogram observation with an exemplar
histogram.With(prometheus.Labels{"route": "/api/orders"}).
    ObserveWithExemplar(
        latency.Seconds(),
        prometheus.Labels{
            "traceID": span.SpanContext().TraceID().String(),
        },
    )

Sampling strategy

Never sample 100% in production — at 10k req/s, that's 864 million spans per day. Use head-based sampling of 1-10% for normal traffic, and tail-based sampling to capture 100% of errored or slow traces.

Distributed tracing with OpenTelemetry, Jaeger and Prometheus in a microservices mesh

Why metrics alone aren't enough

The OpenTelemetry data model

Deploying the OTel Collector

Instrumenting a Go service

Adding custom spans and attributes

Linking traces to Prometheus metrics with exemplars