Why metrics alone aren't enough
Prometheus tells you that your p99 latency spiked at 14:32. It doesn't tell you why. In a microservices architecture with 20+ services, a slow database query in service D can manifest as elevated latency in service A — and your dashboards will happily show you the symptom while hiding the cause.
Distributed tracing links every request across service boundaries into a single trace. When latency spikes, you find the culprit span in seconds instead of hours of log correlation.
The OpenTelemetry data model
OpenTelemetry (OTel) standardises three signal types:
- Traces — a tree of spans representing a single request's journey
- Metrics — aggregated numerical measurements over time
- Logs — structured event records, correlatable by trace ID
Each span has: a trace ID (same across all spans in a request), a span ID (unique per span), a parent span ID, timestamps, status, and arbitrary key-value attributes. The W3C traceparent header propagates this context between services over HTTP.
Deploying the OTel Collector
# otel-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-collector
namespace: observability
spec:
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
# Add k8s metadata to every span
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.namespace.name
- k8s.pod.name
- k8s.node.name
- k8s.deployment.name
exporters:
jaeger:
endpoint: jaeger-collector.observability:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
resource_to_telemetry_conversion:
enabled: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
Instrumenting a Go service
// tracing/setup.go
package tracing
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)
func InitTracer(serviceName, version string) (*sdktrace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(context.Background(),
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
res := resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName(serviceName),
semconv.ServiceVersion(version),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(res),
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.1), // sample 10% of new traces
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
Adding custom spans and attributes
func (s *OrderService) ProcessOrder(ctx context.Context, orderID string) error {
tracer := otel.Tracer("order-service")
ctx, span := tracer.Start(ctx, "ProcessOrder")
defer span.End()
// Add business-relevant attributes
span.SetAttributes(
attribute.String("order.id", orderID),
attribute.String("order.region", s.region),
)
items, err := s.fetchItems(ctx, orderID)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}
span.SetAttributes(attribute.Int("order.item_count", len(items)))
return s.submitToWarehouse(ctx, items)
}
Linking traces to Prometheus metrics with exemplars
Exemplars are the bridge between metrics and traces. A Prometheus counter or histogram can carry an exemplar — a sample data point that includes a trace ID. When you click on a latency spike in Grafana, you jump directly to the Jaeger trace for that specific request.
// Record a histogram observation with an exemplar
histogram.With(prometheus.Labels{"route": "/api/orders"}).
ObserveWithExemplar(
latency.Seconds(),
prometheus.Labels{
"traceID": span.SpanContext().TraceID().String(),
},
)
Never sample 100% in production — at 10k req/s, that's 864 million spans per day. Use head-based sampling of 1-10% for normal traffic, and tail-based sampling to capture 100% of errored or slow traces.