OpenTelemetry Distributed Tracing
Instrumentation, Collector, and backends.
OpenTelemetry Distributed Tracing
In a microservices architecture, a single user request can span dozens of services. When something goes wrong, finding the root cause without distributed tracing is like debugging in the dark. OpenTelemetry provides vendor-neutral instrumentation for traces, metrics, and logs.
OpenTelemetry Architecture
- SDK — Instrumentation libraries for your application language
- Collector — Agent that receives, processes, and exports telemetry data
- Backend — Storage and visualization (Jaeger, Tempo, Zipkin, Datadog)
Auto-Instrumentation
OpenTelemetry provides auto-instrumentation for common frameworks:
# Node.js
npm install @opentelemetry/auto-instrumentations-node
node --require @opentelemetry/auto-instrumentations-node/register app.js
# Python
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-instrument python app.py
# .NET
dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol.NET Configuration Example
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddSqlClientInstrumentation()
.AddOtlpExporter(opts => {
opts.Endpoint = new Uri("http://otel-collector:4317");
}))
.WithMetrics(metrics => metrics
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddOtlpExporter());Collector Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1000
memory_limiter:
limit_mib: 512
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]Trace Propagation
Ensure trace context is propagated across service boundaries:
- W3C TraceContext — Standard propagation format (traceparent header)
- B3 — Zipkin-compatible format
- Auto-instrumentation handles HTTP propagation automatically
- For message queues (SQS, Kafka), propagate context via message attributes
Sampling Strategies
In high-traffic systems, tracing every request is expensive. Use sampling:
processors:
probabilistic_sampler:
sampling_percentage: 10 # Sample 10% of traces
tail_sampling:
policies:
- name: error-policy
type: status_code
status_code: {status_codes: [ERROR]} # Always sample errors
- name: slow-policy
type: latency
latency: {threshold_ms: 2000} # Always sample slow requests
- name: default
type: probabilistic
probabilistic: {sampling_percentage: 5}What to Trace
- HTTP requests — Inbound and outbound (auto-instrumented)
- Database queries — SQL queries with timing (auto-instrumented)
- Cache operations — Redis/Memcached hits and misses
- Message queue operations — Publish and consume with context propagation
- External API calls — Third-party service latency and errors
Eazy SaaS Tip: We implement OpenTelemetry with tail sampling as standard for our microservices clients. This captures 100% of errors and slow requests while sampling 5% of normal traffic — providing full debugging capability at 1/20th the storage cost.