DevOps / SRE Engineer Hub

Observability and Monitoring

In short

Three pillars (metrics, logs, traces) refactored through Charity Majors' observability-2.0 lens: high-cardinality structured events as the substrate, with Prometheus + Grafana, OpenTelemetry, and the USE/RED methods as the working toolkit for production systems.

Key takeaways

The three pillars and the observability-2.0 critique

Peter Bourgon's 2017 essay codified observability as three pillars: metrics, logs, and traces. The framing is useful as a vocabulary but it is also the source of most observability dysfunction. Charity Majors at Honeycomb has spent years arguing that the pillars are an artifact of tooling history, not a description of the problem. Metrics are pre-aggregated numeric time series. Logs are unstructured strings. Traces are causal graphs of spans. Each pillar lives in a separate system, with its own query language, its own retention budget, and its own pricing model. When an incident hits, an engineer ping-pongs across three UIs trying to correlate by timestamp and request ID, and the answer is usually 'we cannot tell from here.' Observability-2.0 reframes the substrate as wide, structured events: one event per unit of work (an HTTP request, a job, a batch step), with every dimension you might ever want to slice by attached as a key-value field. Metrics, logs, and traces become projections of that single event store. A timer becomes 'the distribution of duration_ms grouped by service.' A log line becomes 'the message field of an event.' A trace becomes 'all events sharing a trace_id, ordered by parent_span_id.' The thesis: if your event has high cardinality (user_id, build_sha, feature_flag, region, customer_tier, k8s pod, request payload size), you can ask questions you did not predict in advance. Pre-aggregated metrics force you to commit to your hypotheses at write time. Monitoring asks 'is the known-bad thing happening?' Observability asks 'what is happening that I have not yet thought to ask about?' The two are complementary. Run Prometheus for known-unknowns (SLOs, capacity, alert routing) and a high-cardinality event store (Honeycomb, ClickHouse, or self-hosted columnar) for unknown-unknowns. The pillars are not wrong; they are a starting taxonomy. Treat them as a budget conversation, not an architecture diagram. Cardinality is the dividing line between the two camps. Prometheus is explicitly hostile to high cardinality; every distinct label combination is a new in-memory series, and 100k active series per instance is a normal soft ceiling. Event stores invert the trade: every event is independently indexed by every field, so adding a new dimension is a query-time decision rather than a write-time commit. This is why log aggregation tools that grew up structured (Honeycomb, Lightstep, Datadog logs with attributes) feel different from log tools that grew up as grep-over-strings. Loki, Grafana's logs product, sits in the middle: it stores log content unindexed and only indexes a small set of labels, which keeps it cheap but means you cannot trivially slice by user_id without ingesting it as a label and exploding cardinality. Pick the tool that matches the cardinality of the question you actually need to answer in an outage.

Prometheus and Grafana stack: PromQL, recording rules, federation

Prometheus is a pull-based time-series database with a built-in scrape scheduler, a multi-dimensional data model (metric name plus labels), and a query language (PromQL) designed for range vectors. The standard deployment: applications expose `/metrics` in the text exposition format, Prometheus scrapes every 15-30 seconds, and Grafana queries Prometheus over HTTP for dashboards. Alertmanager handles deduplication, grouping, silencing, and routing of fired alert rules. PromQL has four selector types. Instant vectors (one sample per series at a moment), range vectors (a window of samples), scalars, and strings. Most useful queries combine `rate()` over a range vector with an aggregation operator. Counters always go up and reset on restart, so you almost never query a counter directly; you query its rate. Histograms expose `_bucket`, `_sum`, and `_count` series, and you query percentiles via `histogram_quantile()`. Gauges (memory, queue depth) you query directly. # Request rate per service over the last 5 minutes sum by (service) (rate(http_requests_total[5m])) # 95th percentile latency per route, last 5m histogram_quantile( 0.95, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])) ) # Error ratio per service (RED 'errors') sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (service) (rate(http_requests_total[5m])) # Availability burn over a 30d SLO window 1 - ( sum(increase(http_requests_total{status=~"5.."}[30d])) / sum(increase(http_requests_total[30d])) ) Recording rules pre-compute expensive expressions on a schedule and write the result back as a new time series. Anything that fans out across many series (a histogram_quantile across thousands of pods, a 30d burn rate) belongs in a recording rule; otherwise the dashboard blows up the server. Naming convention: `level:metric:operation`, e.g. `service:http_requests:rate5m`. Federation is the answer when one Prometheus cannot hold everything. A global Prometheus scrapes only the recording-rule output of datacenter-local Prometheis via `/federate`. For long-term storage and true horizontal scale, use Thanos, Cortex, or Mimir; they implement the Prometheus query API on top of object storage and add downsampling. Cardinality is the killer. Every unique combination of labels is a new series, and series count, not sample volume, is what blows up Prometheus. Never label by user_id, request_id, or anything unbounded; those belong in the event-store layer, not in metrics. Grafana is the standard query and dashboarding layer over Prometheus, Loki, Tempo, and any other data source that speaks its plugin protocol. The opinionated workflow: dashboards are code (JSON in git, rendered through Grafonnet or Terraform), variables are first-class (every dashboard takes `service`, `env`, `cluster` as templated selectors), and panels link to one another so a user can click from a service-level overview down into a specific endpoint, then over to logs filtered to that trace. Alertmanager rules belong in the same repo as the recording rules; co-locating thresholds with the expressions that feed them is what keeps alerts honest. The classic mistake is to write alerts on raw metrics rather than on SLO burn rates; the former pages on every blip, the latter pages only when the error budget is genuinely at risk.

Distributed tracing: OpenTelemetry, sampling, service maps

A distributed trace is a tree of spans bound together by a shared trace_id. Each span has a name, start time, duration, parent span, and a bag of attributes. Spans propagate across process boundaries through trace context headers (W3C `traceparent` and `tracestate` are the standard). Without propagation you have local profiling, not distributed tracing. OpenTelemetry (OTel) is the CNCF-graduated successor to OpenTracing and OpenCensus. It is the right abstraction in 2026: a vendor-neutral API and SDK plus a Collector that receives, processes, and exports to Jaeger, Tempo, Honeycomb, Datadog, or any OTLP-compatible backend. Pick OTel and pick your backend separately; you can swap the backend without re-instrumenting. from opentelemetry import trace from opentelemetry.sdk.resources import Resource from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import ( OTLPSpanExporter, ) from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor from opentelemetry.instrumentation.requests import RequestsInstrumentor resource = Resource.create({"service.name": "resume-api"}) provider = TracerProvider(resource=resource) provider.add_span_processor( BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317")) ) trace.set_tracer_provider(provider) FastAPIInstrumentor.instrument_app(app) RequestsInstrumentor().instrument() tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("render_resume") as span: span.set_attribute("user.tier", user.tier) span.set_attribute("template.id", template_id) result = render(template_id) span.set_attribute("result.bytes", len(result)) Sampling is the brutal trade-off. At full traffic you cannot store every span, but uniform random sampling discards exactly the rare interesting traces you wanted to find. Three strategies: head sampling decides at ingress (cheap, dumb); tail sampling decides at the Collector after the trace completes (you can keep all errors and slow traces); dynamic/probabilistic sampling weights by attribute (1.0 for errors, 0.01 for healthchecks). Tail sampling is the right default for production. Backends: Jaeger is the reference open-source UI; Grafana Tempo is object-storage-backed and integrates with Loki and Prometheus through exemplars; Honeycomb is the canonical observability-2.0 product and treats traces as queries over a wide-event store. Service maps are auto-derived from span parent/child links; do not hand-maintain them. Use exemplars to jump from a Prometheus latency spike directly to a trace exhibiting that spike, which is the single highest-leverage integration in the stack. Instrumentation strategy: get OTel auto-instrumentation in place first (FastAPI, requests, SQLAlchemy, redis, kafka), then add manual spans only at meaningful business boundaries (`render_resume`, `charge_card`, `enqueue_email`). Resist the urge to wrap every function; spans cost CPU on the hot path and noise in the UI. Set high-signal attributes on those manual spans: customer tier, feature flag values, build SHA, the size of the work unit. Those attributes are what let you ask 'why is p99 bad on Tuesday for paid customers on build abc123' without writing a custom dashboard. The OTel Collector is also where you do PII scrubbing, attribute renaming, and protocol translation; do not bake any of that into the application SDK. Structured logs (Loki, Elastic) join the trace through trace_id and span_id fields injected by the OTel logging instrumentation, so a single click in Grafana takes you from metric exemplar to trace span to the log line emitted inside that span.

USE method (Gregg) vs RED method (Wilkie): when each wins

Brendan Gregg's USE method (2012) is for resources: for every resource, track Utilization, Saturation, and Errors. Resources are CPU, memory, disk, network interfaces, file descriptors, kernel locks, GPU, anything finite that work contends for. Utilization is the percent of time the resource was busy. Saturation is the queue depth of work waiting on it (run-queue length, swap pressure, nf_conntrack table fullness). Errors are device errors, dropped packets, ECC events. USE is a checklist for capacity and infrastructure: walk every resource, produce a number for each of the three, find the one that is red. It wins for hardware, kernel, and platform-layer questions. 'Why is the node slow?' is a USE question. Tom Wilkie's RED method (2015, while at Weaveworks) is for services: for every service, track Rate, Errors, and Duration. Rate is requests per second. Errors is the count or fraction of failed requests. Duration is the latency distribution (always percentiles, never means). RED is the right vocabulary for HTTP services, gRPC handlers, queue consumers, anything that processes a stream of work units. It maps cleanly onto the four golden signals from the Google SRE book (latency, traffic, errors, saturation), with saturation moved to the USE side. 'Why are users seeing 500s?' is a RED question. The two are complementary, not competing. A complete dashboard has a RED panel at the top (the user-visible truth: rate, errors, latency p50/p95/p99) and USE panels below it (the infrastructure substrate: CPU, memory, disk IO, network). When RED goes red, you check USE to find which resource is the bottleneck. When USE goes red but RED is green, you have headroom and time to fix it before users notice. When RED is red and USE is green, the bug is in your code or a downstream dependency, not in capacity. Where each fails: USE on serverless or fully-managed services is mostly invisible; you do not own the resources, so you cannot measure their utilization. RED on fan-out batch jobs is awkward; 'rate' is ambiguous when one user request triggers ten thousand sub-tasks. For those cases, fall back to wide-event observability and ask the question directly: group by job_id, sum durations, find the long tail. USE and RED are guidance, not gospel; the underlying discipline is still 'measure what the user experiences, then measure the substrate that produces it.' A common operational pattern: start every new service with a RED dashboard generated automatically from a Grafana template that takes `service_name` as a variable; this is enforced by infra so no service ships without one. Add USE panels per node pool, per database, per Redis cluster, owned by the platform team and shared across all consumers. Layer SLOs (availability and latency) over the RED metrics with multi-window multi-burn-rate alerts as described in the Google SRE workbook: a fast-burn alert fires when you would exhaust the 30-day budget in under an hour, a slow-burn alert fires when you would exhaust it in under six hours. This combination keeps pager load low while still catching real degradation, and it gives incident commanders a budget-grounded answer to 'should we roll back?' instead of a vibes-based one. USE and RED give you the metric vocabulary; SLOs give you the decision rule; observability-2.0 wide events give you the debugger when neither vocabulary matches the shape of the actual incident.

Frequently asked questions

Sources

  1. Prometheus documentation
  2. OpenTelemetry documentation
  3. The USE Method
  4. Observability 101: Terminology and Concepts
  5. Observability — A 3-Year Retrospective
  6. Grafana documentation

About the author. Blake Crosley founded ResumeGeni and writes about site reliability engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.