When should I use the USE method vs the RED method?

USE on resources, RED on services. USE (Utilization, Saturation, Errors) finds a saturated CPU, disk, or network interface on a host or container. RED (Rate, Errors, Duration) finds a degraded endpoint across a fleet of microservices. Together they cover both the infrastructure layer and the request-handling layer. Brendan Gregg's USE method page (brendangregg.com/usemethod.html) and Google's four golden signals are the canonical references.

Why is histogram_quantile applied after the rate, not before?

Because percentiles are not linearly aggregatable. A p99 across two pods is not the average of each pod's p99. The PromQL pattern is to take rate() of each bucket count, sum by (le) to combine pod-level histograms into a fleet-level histogram, then apply histogram_quantile to interpolate the percentile from the merged buckets. The Prometheus query basics page covers this in depth.

How do I avoid cardinality explosion in metrics?

Never put unbounded high-cardinality fields in metric labels. User ID, request ID, full URL paths, IP addresses are all cardinality bombs. The rule of thumb: a label is fine if its set of distinct values is small and stable (HTTP method, status code, route template). For high-cardinality investigation, use traces and structured logs — metrics are for low-cardinality aggregates.

What's the difference between a span attribute and a metric label?

Span attributes are per-trace, high-cardinality metadata for one specific request — user ID, order ID, cart size are all fine on a span. Metric labels are aggregated dimensions for time-series — they must be low cardinality. The same value (e.g., user ID) belongs on a span but does not belong on a metric label. The OpenTelemetry primer covers the distinction.

Should I instrument every function with a span?

No. Spans have non-zero overhead in CPU, memory, and visual noise in the trace. The senior pattern: auto-instrument the boundaries (HTTP server, DB client, HTTP client, message queue), then add manual spans only around domain operations that you want to see in the waterfall. A trace with 500 spans is worse than a trace with 20 well-chosen spans.

How do I read a flamegraph if I've never seen one?

Width is total time in a function and its callees; height is stack depth. Look for the widest blocks at the top — those are the hot leaves where CPU is actually burning. To find why, read the tower below them top-to-bottom: that's the call path. The X-axis is alphabetical (not time-ordered), so two adjacent blocks are not necessarily related. Brendan Gregg's flamegraphs page is the canonical primer.

What is tail-latency amplification?

When a service fans out to N dependencies in parallel, its p99 is dominated by the probability that any one of the N calls hits the tail. A service calling 10 dependencies, each with a p99 of 100ms, has an aggregate response that hits the dependency's p99.9 territory — not its p99. The fix is hedged requests, tighter dependency p99 SLOs, or reducing fan-out. Google's SRE Book covers the math.

How do I correlate traces with logs?

Inject the trace_id into every log line emitted during request handling. OpenTelemetry's logging integrations do this automatically for Python's logging module, structlog, and most Java/Go loggers. With trace_id in both places, you can pivot from a slow trace to its log lines, or from an error log line to its full trace. This is the practical foundation of observability — each pillar reinforces the others.

What's the right alert: high CPU or high p99 latency?

High p99 latency. Alerts should fire on user-visible symptoms (latency, error rate, availability), not on resource metrics. A CPU-bound batch job is supposed to run hot; alerting on CPU > 80% will wake you up at 3am for a healthy system. Use resource metrics for diagnosis after a symptom alert fires. The SRE Book chapter on Practical Alerting covers the symptom-vs-cause split.

Backend Engineer Hub

Performance and Observability for Backend Engineers (2026)

By Blake Crosley · Last verified 2026-04-30

In short

Backend performance in 2026 is observability-first. Senior engineers reach for the USE method (Utilization, Saturation, Errors) for resources and the RED method (Rate, Errors, Duration) for services, then drop into PromQL histograms for p99 latency and OpenTelemetry traces to find the slow span. The bar: you can read a flamegraph, write a histogram_quantile query, instrument a FastAPI service with OTel, reason about tail-latency amplification across fan-out RPCs, and turn a tracing waterfall into a one-line EXPLAIN that fixes the slow query.

Key takeaways

Brendan Gregg's USE method (Utilization, Saturation, Errors) is the canonical checklist for resources — CPUs, memory, disks, NICs. For every resource, check all three. The reference is brendangregg.com/usemethod.html.
Tom Wilkie's RED method (Rate, Errors, Duration) is the canonical checklist for services — every request-driven endpoint should have a Rate (req/s), an Error rate, and a Duration histogram. USE is for what you run on; RED is for what you serve.
p99 latency is the senior bar, not p50. Tail-latency amplification means a service that calls 10 dependencies in parallel sees its p99 dominated by each dependency's p99. Google's SRE Book chapter on Monitoring Distributed Systems is the canonical reference.
PromQL is the lingua franca of metrics in 2026. The two queries every backend engineer should know cold: rate(...[5m]) for per-second event rates, and histogram_quantile(0.99, sum by (le) (rate(...[5m]))) for p99 latency from a histogram.
OpenTelemetry is the standard instrumentation API — vendor-neutral SDKs for traces, metrics, and logs. The OTel auto-instrumentation libraries cover Flask, FastAPI, Django, requests, SQLAlchemy, psycopg, and most popular HTTP and DB clients with one line each.
Distributed tracing is the only way to debug fan-out latency. A trace shows the full waterfall: which service called which, how long each span took, and where the tail latency lives. Charity Majors's writing (charity.wtf) is the canonical practitioner reference.
Flamegraphs collapse a sampled call stack into a single picture: width is total CPU time spent in a function (and its descendants); height is stack depth. Read top-down for hot leaves; read bottom-up for hot ancestors. brendangregg.com/flamegraphs.html is canonical.

Why observability is the new SRE skill

Observability is the property of a system that lets you ask new questions of its behavior in production without shipping new code. The 2026 backend engineer doesn't get a free pass on this — the senior bar now assumes you can instrument a service, query its metrics, and read its traces without escalating to an SRE.

Three pillars cover the practical bar:

Metrics — numeric time-series with low cardinality. Cheap to store, cheap to aggregate, expensive to slice by high-cardinality dimensions like user ID. Prometheus and PromQL are the dominant open standard.
Traces — request-scoped causal records. A trace shows the full waterfall of spans (work units) and their parent-child relationships across services. OpenTelemetry is the dominant open instrumentation API; backends include Tempo, Jaeger, Honeycomb, Datadog APM, and others.
Logs — structured event records. Cheap per-event but unbounded in volume; the senior pattern is structured JSON logs with a trace_id field so a single trace can be reconstructed across log lines.

Charity Majors's foundational essay Observability — a 3-year retrospective is the canonical practitioner reference for why observability is not just "more logs". The summary: logging is what you knew to ask; observability is what you can ask after the fact. If you have to ship a new log line every time production breaks, your system isn't observable yet.

The Google SRE Book chapter on Monitoring Distributed Systems introduces the four golden signals — latency, traffic, errors, saturation — which is the parent abstraction over both USE and RED. Senior engineers know all three frameworks (golden signals, USE, RED) and pick the right one for the question being asked.

USE method vs RED method — when to use each

The two checklists answer different questions:

USE method (Brendan Gregg) is for resources. For every resource (CPU, memory, disk, network interface, file descriptor pool), check Utilization (percent busy), Saturation (queue depth or wait time), and Errors (count of failed operations). USE is the fastest way to find a saturated resource on a single host or container. The canonical reference is brendangregg.com/usemethod.html.
RED method (Tom Wilkie at Weaveworks) is for services. For every request-driven endpoint, track Rate (requests per second), Errors (failed requests per second or as a ratio), and Duration (histogram of request latency). RED is the fastest way to find a degraded endpoint across a fleet of microservices.

The practical rule: USE on the host / pod / container; RED on the service / endpoint. Together they cover the resource layer and the service layer.

The most-used PromQL pattern in 2026 is computing p99 request latency from a histogram metric. A canonical query against a Prometheus histogram named http_request_duration_seconds_bucket labeled by route and status:

# p99 request duration per route over the last 5 minutes
histogram_quantile(
  0.99,
  sum by (le, route) (
    rate(http_request_duration_seconds_bucket{job="api"}[5m])
  )
)

# Request rate (RED — Rate) per route
sum by (route) (rate(http_requests_total{job="api"}[5m]))

# Error ratio (RED — Errors) per route
sum by (route) (rate(http_requests_total{job="api",status=~"5.."}[5m]))
  /
sum by (route) (rate(http_requests_total{job="api"}[5m]))

What this query gets right: it uses rate() over a 5-minute window (Prometheus needs at least four scrape intervals inside the range for a stable rate), it sums the bucket counts before applying histogram_quantile (you must aggregate the per-bucket rates, not the percentiles), and it groups by le (the bucket boundary) plus route so you get one p99 series per route. The Prometheus query basics page is the canonical reference.

Tail-latency amplification is the senior reasoning step. If a service fans out to 10 dependencies in parallel and each dependency has a p99 of 100ms, your service's p99 is roughly 100ms times the probability that any of those 10 calls hit the tail — which trends toward the dependency's p99.9, not its p99. The Google SRE Book chapter on Practical Alerting covers the alerting implications.

Instrumentation that pays off in production

OpenTelemetry is the vendor-neutral instrumentation API for traces, metrics, and logs. The OpenTelemetry observability primer is the canonical entry point. The senior pattern for a Python FastAPI service in 2026: auto-instrumentation for HTTP and DB calls, manual span creation around domain operations, and a custom histogram metric for any latency that matters to the business.

from fastapi import FastAPI
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

resource = Resource.create({"service.name": "checkout-api", "service.version": "2.4.1"})
trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
metrics.set_meter_provider(
    MeterProvider(resource=resource,
        metric_readers=[PeriodicExportingMetricReader(OTLPMetricExporter())])
)

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
checkout_latency = meter.create_histogram(
    name="checkout.duration_seconds",
    unit="s",
    description="End-to-end checkout latency",
)

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=db_engine)

@app.post("/checkout")
async def checkout(payload: CheckoutRequest):
    with tracer.start_as_current_span("checkout.process") as span:
        span.set_attribute("cart.item_count", len(payload.items))
        start = time.perf_counter()
        try:
            order = await process_checkout(payload)
            span.set_attribute("order.id", order.id)
            return order
        finally:
            checkout_latency.record(time.perf_counter() - start,
                                    attributes={"tenant": payload.tenant_id})

What this code gets right: a service.name Resource attribute (every backend in your fleet must set this — it's how the trace backend groups spans into services); auto-instrumentation for FastAPI and SQLAlchemy (one line each, no manual span code for HTTP or DB); a manual span around the domain operation with a few high-value attributes (cart.item_count, order.id); and a histogram metric so you can write the p99 PromQL query above against it.

The senior taste call is what to not instrument: don't add a span for every function call (tracing has overhead and visual noise); don't put PII in span attributes (traces leave your service — assume they're as widely visible as logs); don't record a high-cardinality attribute like a user ID on a metric (cardinality explosion will blow up your metrics backend). Use the trace for high-cardinality investigation; use metrics for low-cardinality aggregates.

Reading a flamegraph + finding the slow query

Flamegraphs are the canonical visualization for sampled CPU profiles. Brendan Gregg's brendangregg.com/flamegraphs.html is the source-of-truth reference. The rules:

X-axis is total time spent in a function and its descendants — wider means more CPU. The X-axis is not time-ordered; it's alphabetical or aggregated.
Y-axis is stack depth — the function at the top of a tower called the function below it.
Hot leaves are the wide blocks at the top — those are where CPU is actually being burned. Read top-down to find them.
Hot ancestors are the wide blocks at the bottom — those are the call paths that triggered the work. Read bottom-up to find them.

A simplified flamegraph excerpt for a checkout request showing a slow ORM query:

|------------------- checkout_handler [86%] -------------------|---|
|------ process_checkout [78%] ------|-- charge_card [10%] -|...|
|---- compute_totals [9%] ----|----- load_cart_items [62%] -|...|
                              |-- SQLAlchemy.execute [60%] -|
                              |---- psycopg.fetch [58%] ----|
                              |--- (kernel: epoll_wait) ----|

The reading: 60% of the request is spent in a single SQLAlchemy execute, 58% of which is the network/IO wait for Postgres to return rows. The flamegraph tells you the call path; it does not tell you why the query is slow. For that, drop into EXPLAIN (ANALYZE, BUFFERS):

EXPLAIN (ANALYZE, BUFFERS)
SELECT ci.id, ci.product_id, ci.quantity, p.name, p.price_cents
FROM cart_items ci
JOIN products p ON p.id = ci.product_id
WHERE ci.cart_id = '\x9f3c...';

-- Output (abridged):
-- Hash Join (cost=12.34..1842.55 rows=240 actual time=58.412..612.901 rows=240 loops=1)
--   Hash Cond: (ci.product_id = p.id)
--   ->  Seq Scan on cart_items ci  (cost=0.00..1820.00 rows=240 width=24)
--         Filter: (cart_id = '\x9f3c...')
--         Rows Removed by Filter: 184612
--   ->  Hash  (cost=10.10..10.10 rows=180 width=40)
--         ->  Seq Scan on products p
-- Planning Time: 0.412 ms
-- Execution Time: 614.221 ms

The diagnosis is now obvious: a Seq Scan on cart_items filtering on cart_id with 184,612 rows removed by filter — a missing index on cart_items(cart_id). The fix is one CREATE INDEX; the next deploy drops checkout p99 by 500ms. This is the senior loop: trace shows the slow service, flamegraph shows the slow function, EXPLAIN shows the slow query, index fixes it. Each tool answers a different question; you need all of them.

Capacity planning and the four golden signals

Capacity planning is the discipline of running production at a known headroom from saturation. The senior pattern in 2026: use the four golden signals (latency, traffic, errors, saturation) as the input to a capacity model, and re-run the model whenever traffic shape changes.

Latency. Measure p50, p95, and p99 — not just the average. Averages hide the tail; the tail is what users feel and what cascades into upstream timeouts.
Traffic. Requests per second per endpoint, broken out by method. Capacity is dictated by the shape of traffic, not just the volume — a 100 RPS read-heavy workload behaves nothing like a 100 RPS write-heavy workload.
Errors. Both 5xx (server-fault) and 4xx (client-fault) rates. Watch for 4xx spikes — they often indicate a deploy regression or a client misuse, not just bad input.
Saturation. The hardest of the four to measure. Useful proxies: CPU run-queue length, connection-pool wait time, work-queue depth, JVM/CLR GC pause time. Saturation leads latency — the queue grows before the request times out.

The Google SRE Book chapter on Monitoring Distributed Systems is the canonical reference. The practical capacity-planning loop: load-test until you find the knee of the latency curve (p99 starts climbing faster than traffic), measure the saturation signal at the knee (e.g., CPU at 70%, pool wait at 5ms), set the production headroom alert at 60% of the knee, and re-run the test on every major release.

The senior taste call: don't alert on raw resource metrics (CPU > 80% is not actionable on its own — a CPU-bound batch job is supposed to run hot). Alert on user-visible symptoms (p99 latency, error rate) and use resource metrics for diagnosis, not for paging. The Practical Alerting chapter covers the symptom-vs-cause distinction in depth.

Frequently asked questions

When should I use the USE method vs the RED method?: USE on resources, RED on services. USE (Utilization, Saturation, Errors) finds a saturated CPU, disk, or network interface on a host or container. RED (Rate, Errors, Duration) finds a degraded endpoint across a fleet of microservices. Together they cover both the infrastructure layer and the request-handling layer. Brendan Gregg's USE method page (brendangregg.com/usemethod.html) and Google's four golden signals are the canonical references.
Why is histogram_quantile applied after the rate, not before?: Because percentiles are not linearly aggregatable. A p99 across two pods is not the average of each pod's p99. The PromQL pattern is to take rate() of each bucket count, sum by (le) to combine pod-level histograms into a fleet-level histogram, then apply histogram_quantile to interpolate the percentile from the merged buckets. The Prometheus query basics page covers this in depth.
How do I avoid cardinality explosion in metrics?: Never put unbounded high-cardinality fields in metric labels. User ID, request ID, full URL paths, IP addresses are all cardinality bombs. The rule of thumb: a label is fine if its set of distinct values is small and stable (HTTP method, status code, route template). For high-cardinality investigation, use traces and structured logs — metrics are for low-cardinality aggregates.
What's the difference between a span attribute and a metric label?: Span attributes are per-trace, high-cardinality metadata for one specific request — user ID, order ID, cart size are all fine on a span. Metric labels are aggregated dimensions for time-series — they must be low cardinality. The same value (e.g., user ID) belongs on a span but does not belong on a metric label. The OpenTelemetry primer covers the distinction.
Should I instrument every function with a span?: No. Spans have non-zero overhead in CPU, memory, and visual noise in the trace. The senior pattern: auto-instrument the boundaries (HTTP server, DB client, HTTP client, message queue), then add manual spans only around domain operations that you want to see in the waterfall. A trace with 500 spans is worse than a trace with 20 well-chosen spans.
How do I read a flamegraph if I've never seen one?: Width is total time in a function and its callees; height is stack depth. Look for the widest blocks at the top — those are the hot leaves where CPU is actually burning. To find why, read the tower below them top-to-bottom: that's the call path. The X-axis is alphabetical (not time-ordered), so two adjacent blocks are not necessarily related. Brendan Gregg's flamegraphs page is the canonical primer.
What is tail-latency amplification?: When a service fans out to N dependencies in parallel, its p99 is dominated by the probability that any one of the N calls hits the tail. A service calling 10 dependencies, each with a p99 of 100ms, has an aggregate response that hits the dependency's p99.9 territory — not its p99. The fix is hedged requests, tighter dependency p99 SLOs, or reducing fan-out. Google's SRE Book covers the math.
How do I correlate traces with logs?: Inject the trace_id into every log line emitted during request handling. OpenTelemetry's logging integrations do this automatically for Python's logging module, structlog, and most Java/Go loggers. With trace_id in both places, you can pivot from a slow trace to its log lines, or from an error log line to its full trace. This is the practical foundation of observability — each pillar reinforces the others.
What's the right alert: high CPU or high p99 latency?: High p99 latency. Alerts should fire on user-visible symptoms (latency, error rate, availability), not on resource metrics. A CPU-bound batch job is supposed to run hot; alerting on CPU > 80% will wake you up at 3am for a healthy system. Use resource metrics for diagnosis after a symptom alert fires. The SRE Book chapter on Practical Alerting covers the symptom-vs-cause split.

Sources

About the author. Blake Crosley founded ResumeGeni and writes about backend engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.