Software Engineer Hub

Observability for Software Engineers (2026)

In short

Observability separates engineers who can ship from engineers who can keep things shipped. The 2026 bar is concrete: you can instrument a service with OpenTelemetry, write PromQL or LogQL queries that surface issues before users do, and calculate an error budget from an SLI/SLO without consulting a wiki. This page walks a real instrumentation example, three production-grade query patterns, the SLI/SLO/error-budget math from Google's SRE Workbook (sre.google/workbook/), and the postmortem pattern that turns incidents into lessons.

Key takeaways

  • Logs, metrics, and traces are not interchangeable — logs answer 'what happened in this one request', metrics answer 'how is the system behaving overall', traces answer 'where in the call graph did time go'. Charity Majors's 'Observability Engineering' (honeycomb.io/observability-engineering) is the canonical 2026 framing.
  • OpenTelemetry (opentelemetry.io) is the open standard for instrumentation; production-readiness as of 2026 is mature for traces and logs, near-mature for metrics. Use the OTel SDK in your service, ship to whatever backend (Honeycomb, Datadog, Grafana Cloud).
  • An SLO is not aspirational — it's a contract. Google SRE workbook chapter 2 (sre.google/workbook/implementing-slos/) defines the error budget = 1 - SLO over a window; if you exceed the budget, feature work stops until reliability is restored.
  • p99 latency is a more honest metric than mean — Gil Tene's 'How Not to Measure Latency' talk (youtube.com/watch?v=lJ8ydIuPFeU) is required viewing on coordinated omission and percentile math.
  • Distributed tracing alone does not solve incidents — the senior practice is high-cardinality structured logging plus traces, with the ability to filter on any field (the Honeycomb pattern documented in 'Observability Engineering' ch. 3-4).
  • Postmortems are valuable when blameless and when they generate concrete action items — Etsy's 'Debriefing Facilitation Guide' (extfiles.etsy.com/DebriefingFacilitationGuide.pdf) is the public reference.

Three pillars: when to use logs, metrics, or traces

The three pillars of observability are often presented as interchangeable; they are not. Each answers a different question, and using the wrong tool means slower incidents.

LogsMetricsTraces
Question answeredWhat happened in this specific request?How is the system behaving in aggregate?Where in the call graph did time/errors go?
CardinalityHigh (any field can be unique)Low (pre-aggregated)High per trace
Cost modelStorage + indexingCheap (numbers, not text)Storage; sampled in production
ToolsSplunk, ELK, Loki, Datadog LogsPrometheus, Datadog Metrics, M3Jaeger, Zipkin, Honeycomb, Lightstep
Use whenDebugging a specific reported issueAlerting, dashboards, capacity planningLatency hunt, dependency analysis

The senior+ pattern: structured logs with high cardinality replace many traditional metrics. Charity Majors's argument in 'Observability Engineering' (chapter 3) is that pre-aggregated metrics force you to predict what you'll need to query; high-cardinality structured logs let you ask new questions after the fact. The Honeycomb model is essentially 'wide events' with optional aggregation, which subsumes both metrics and logs at the cost of storage.

Concrete example of when each wins:

  • Logs win when a customer reports 'my order #12345 failed at 3:47 PM' — you grep that order ID and trace ID across the request lifecycle.
  • Metrics win when you want to alert: 'p99 latency on /checkout exceeded 500ms for 5 minutes'.
  • Traces win when /checkout p99 latency degraded but you don't know which downstream service is slow — the trace shows the slow span.

Real OpenTelemetry instrumentation in Python

OpenTelemetry (opentelemetry.io) is the dominant open standard. Vendor-agnostic SDK; ship to whatever backend you have. Below is a working FastAPI instrumentation that captures HTTP requests, database calls, and a custom span for business logic.

# requirements.txt
# opentelemetry-api==1.27.0
# opentelemetry-sdk==1.27.0
# opentelemetry-instrumentation-fastapi==0.48b0
# opentelemetry-instrumentation-sqlalchemy==0.48b0
# opentelemetry-exporter-otlp==1.27.0

from fastapi import FastAPI, HTTPException
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# 1. Configure provider with service identity
resource = Resource.create({
    "service.name": "checkout-api",
    "service.version": "2.4.1",
    "deployment.environment": "production",
})
provider = TracerProvider(resource=resource)

# 2. Configure exporter (OTLP HTTP to Honeycomb / Datadog / etc.)
exporter = OTLPSpanExporter(
    endpoint="https://api.honeycomb.io/v1/traces",
    headers={"x-honeycomb-team": HONEYCOMB_API_KEY},
)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# 3. Auto-instrument FastAPI and SQLAlchemy
app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=db_engine)

# 4. Add custom spans for business logic
tracer = trace.get_tracer(__name__)

@app.post("/checkout/{order_id}")
async def checkout(order_id: str):
    with tracer.start_as_current_span("checkout.process") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("checkout.validate"):
            order = await fetch_order(order_id)
            if not order:
                span.set_attribute("error", True)
                span.set_attribute("error.reason", "order_not_found")
                raise HTTPException(404)

        with tracer.start_as_current_span("checkout.charge") as charge_span:
            charge_span.set_attribute("amount.cents", order.total_cents)
            charge_span.set_attribute("customer.id", order.customer_id)
            result = await stripe_client.charge(order)
            charge_span.set_attribute("stripe.charge_id", result.id)

        return {"status": "ok", "charge_id": result.id}

What this gives you: for every /checkout call, a trace with three nested spans (process → validate → charge), each tagged with order ID, amount, customer ID, and Stripe charge ID. When p99 latency spikes, you query 'show me traces with checkout.charge span > 500ms grouped by stripe.charge_id' and see exactly which Stripe calls are slow.

The senior+ instrumentation discipline:

  • Tag with high-cardinality attributes (user ID, customer ID, order ID, request ID). The query value pays back the storage cost.
  • Use semantic conventions (opentelemetry.io/docs/specs/semconv/) — http.status_code, db.system, messaging.system. Standardized attribute names make cross-service queries possible.
  • Sample intelligently in production. Tail-based sampling (sample errors and slow requests at 100%, sample fast successful requests at 1%) keeps cost reasonable while preserving signal.
  • Propagate context across service boundaries. OpenTelemetry's W3C Trace Context (w3.org/TR/trace-context/) is the standard; verify your HTTP client and message bus pass it.

PromQL: three queries every senior engineer should write fluently

PromQL is Prometheus's query language. Most senior+ on-calls require it. Three patterns that cover ~80% of dashboard and alerting queries.

Query 1: error rate as a percentage of all requests over the last 5 minutes.

sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

Read it as: numerator is 5xx requests per second over the last 5 minutes; denominator is all requests per second over the last 5 minutes; multiply by 100 for a percentage. Use as the basis for an alert: 'page if error rate > 1% for 5 minutes'.

Query 2: p99 latency by endpoint, grouped.

histogram_quantile(
  0.99,
  sum by (le, route) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

This requires a histogram metric (http_request_duration_seconds_bucket) which OpenTelemetry's Prometheus exporter produces by default. The query computes the p99 latency per route. Inverting 0.99 to 0.50 gives you the median; 0.999 the p999.

Why p99 and not mean: Gil Tene's 'How Not to Measure Latency' (youtube.com/watch?v=lJ8ydIuPFeU) walks the math. The mean hides tail behavior; the p99 reveals it. A service can have 50ms mean and 5000ms p99 — the slow 1% is where users churn. Senior+ engineers always quote percentiles.

Query 3: SLO error budget burn rate (the SRE workbook formula).

# 99.9% SLO over 30 days = 0.1% error budget = 43.2 minutes
# Burn rate = (errors_in_window / total_in_window) / (1 - SLO)
# Burn rate > 14.4 means budget will exhaust in < 1 hour at this rate

# Multi-window multi-burn-rate alert (per SRE Workbook ch. 5):
# Page if 5m burn > 14.4 AND 1h burn > 14.4
( sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m])) ) / 0.001 > 14.4
and
( sum(rate(http_errors_total[1h])) / sum(rate(http_requests_total[1h])) ) / 0.001 > 14.4

The two-window pattern (the canonical SRE alert) avoids both flaky single-window alerts and slow long-window alerts. Reference: Google SRE Workbook chapter 5 'Alerting on SLOs' (sre.google/workbook/alerting-on-slos/).

SLI/SLO/SLA and error-budget math

SLI/SLO is the single most-confused topic at senior+ interviews and the most-misunderstood concept in production engineering. The crisp definitions:

  • SLI (Service Level Indicator): the metric you measure. Examples: 99.9% of requests return successfully; p99 latency under 500ms.
  • SLO (Service Level Objective): the threshold you commit to internally. The SLI hitting the SLO is the goal.
  • SLA (Service Level Agreement): the contractual commitment to customers, usually with financial penalties. SLAs are typically a notch weaker than internal SLOs (e.g., internal SLO 99.95%, customer SLA 99.9%).

The error budget formula (canonical, from Google SRE Book chapter 4, sre.google/sre-book/embracing-risk/):

Error budget = 1 - SLO

99.9% SLO over 30 days:
  Error budget = 0.1% = 0.001
  30 days = 43,200 minutes
  Allowed downtime = 43,200 * 0.001 = 43.2 minutes/month

99.99% SLO over 30 days:
  Error budget = 0.01%
  Allowed downtime = 4.32 minutes/month

99.999% SLO over 30 days ("five nines"):
  Error budget = 0.001%
  Allowed downtime = 25.9 seconds/month  (basically impossible without huge effort)

The SRE-workbook policy (the load-bearing concept): when the error budget exhausts, feature work stops. The team focuses on reliability until the budget recovers. This sounds harsh; it's the only mechanism that forces tradeoffs between speed and reliability to be made explicitly.

The senior+ trick: pick SLOs that match user happiness, not engineering pride. A 99.99% SLO on an internal admin tool is engineering vanity. A 99.9% SLO on a checkout endpoint is pragmatic. Google's SRE Workbook chapter 2 has the framework: 'identify the customer journey, identify the metric for that journey, set the SLO at the level where customers stop being happy.'

Common mistake: alerting on SLI thresholds, not on SLO burn. 'Page if error rate > 1% for 5 minutes' is brittle (false positives during deploy spikes). 'Page if SLO budget burning at 14.4x rate over 5 min AND 1 hour' is what the SRE workbook recommends — it correlates with actually-running-out-of-budget.

Production-incident walkthrough: a real pattern

Worked example. Real-shaped scenario, with the queries and decision points.

Scenario: Friday 2pm. /checkout p99 latency starts climbing. PagerDuty fires at 14:03 ('p99 > 800ms for 5 min'). You're on call.

Step 1 (14:03-14:05): triage. First questions: is it user-impacting? Is it isolated or widespread?

# Quick PromQL on the dashboard:
sum(rate(http_request_duration_seconds_count{route="/checkout"}[1m])) by (status)
# Output: 95% are 200, 5% are 500. 5xx rate is up 3x from baseline.

# Error budget burn:
( error_rate / 0.001 ) = 50  # burning 50x normal rate. Page-worthy.

Step 2 (14:05-14:10): scope. Where in the call graph?

# Honeycomb / Datadog APM: query traces with /checkout span where duration > 1s
# Group by service.name: what downstream is slow?
# Result: 80% of slow traces have stripe.charge span > 800ms.
# Stripe API status page: 'elevated latency in payments processing' — confirmed.

Step 3 (14:10-14:12): mitigation. Stripe is slow; we own the response. Two options:

  • Wait it out (Stripe will recover).
  • Activate degradation mode: queue the charge async, return 'order received' to user, charge happens in 5 min. Trades immediate-charge for availability.

You activate degradation. /checkout p99 returns to normal in 60 seconds.

Step 4 (14:12-15:00): verify and communicate. Tweet on the status page: 'Brief checkout slowness; resolved.' Slack the team. Watch the next hour to confirm Stripe recovered.

Step 5 (next business day): postmortem. Blameless. Five questions (Etsy debriefing format):

  1. What happened? Stripe API latency spike at 14:00; our p99 climbed because we synchronously charged.
  2. Why didn't we catch it sooner? We had no SLO on the synchronous Stripe call; only on aggregate /checkout latency.
  3. What did we do well? Degradation mode was already built and tested; activation took 90 seconds.
  4. What was hard? Identifying the root cause (Stripe vs our DB vs network) took 5 minutes; with better tracing dashboards, this is 30 seconds.
  5. Action items: (a) Add SLO on stripe.charge span p99; (b) Add a one-click 'who's slow' dashboard that auto-segments by downstream service; (c) Document the degradation runbook in the on-call wiki.

The senior+ practice: action items are concrete, owned, and dated. 'Improve monitoring' is not an action item. 'Sarah adds Stripe SLO by Friday' is.

What senior+ interviews probe on observability

From job postings (Stripe, Anthropic, Databricks senior SRE/SWE roles) and Hello Interview's senior rubric, the interview signals on observability:

  • Can you instrument a service from scratch? 'Walk me through adding tracing to this Python service' — expects OpenTelemetry SDK, semantic conventions, sampling discussion.
  • Can you debug from data? Given a dashboard with anomalous behavior, can you propose hypotheses and queries to test them? This is a coded interview at SRE-track interviews.
  • Can you talk about an incident? Bring a real incident story to senior+ interviews — what happened, what you did, what you learned, what you'd change. Specific numbers help: 'p99 climbed from 80ms to 1.4s; identified Stripe in 8 minutes; degraded gracefully; 0 customer complaints'.
  • Do you know the math? Error budget calculation, percentile semantics (p99 vs mean), why coordinated omission breaks naive latency measurement (Gil Tene's talk).
  • Do you have an opinion on metrics vs logs vs traces? The Charity Majors high-cardinality-events argument vs the traditional three-pillar argument is a real debate at senior+ levels. Have a position; defend it.

The book if you read one: Google's SRE Workbook (sre.google/workbook/, free online). Chapters 2 (SLOs), 5 (alerting on SLOs), and 11 (training site reliability engineers) are the highest-leverage chapters for SWE+ candidates. Followed by Charity Majors et al, 'Observability Engineering' (honeycomb.io/observability-engineering, free PDF) for the modern wide-events framing.

Frequently asked questions

Should I instrument with OpenTelemetry or with a vendor SDK?
OpenTelemetry, almost always. Vendor lock-in on observability is expensive: switching costs include re-instrumenting every service. OTel SDK to OTel-format export, plus a vendor-specific exporter, gives you portability. The exception: if you're using a vendor's auto-instrumentation that's specifically richer than OTel (Datadog APM has historically been ahead on Python integrations), the trade-off can favor the vendor. As of 2026, OTel has caught up for most languages.
How do I sample traces in production without losing the signal?
Tail-based sampling: keep 100% of error traces, 100% of slow traces (above some threshold), 1-10% of normal-success traces. This is supported by Honeycomb's Refinery (github.com/honeycombio/refinery), Grafana Tempo's tail sampler, and Datadog's APM. Head-based sampling (sample at trace start) loses the rare-event signal — don't use it past prototype scale.
What's the difference between a span attribute and a span event?
Attributes are tags on the span (key-value, valid for the span duration): order_id, user_id, status_code. Events are timestamped points within a span: 'cache_miss', 'retry_attempted'. Use attributes for filterable dimensions; use events for sub-span timing without creating a child span. OpenTelemetry semantic conventions (opentelemetry.io/docs/specs/semconv/) document standardized names for both.
How much should I log at INFO level vs DEBUG?
INFO level should give you a coarse audit trail: requests in, responses out, key business events. DEBUG should let you reconstruct the internal flow when something goes wrong. The senior practice: INFO at production volume is cheap; DEBUG at production volume is not. Use feature flags to enable DEBUG selectively for specific users or request patterns when you need it. Reference: Google SRE Book chapter 6 covers this in 'Effective Troubleshooting'.
Why does p99 latency matter more than p50 (median) or mean?
Two reasons. (1) The mean is dragged by outliers but doesn't expose them — a 10ms median with a 5000ms p99 still has a 50ms mean. (2) Users experience the tail. If 1% of users see p99 latency, and you have 100k users/day, that's 1,000 users/day having a slow experience. Most of them remember the slow experience, not the median. Gil Tene's 'How Not to Measure Latency' (youtube.com/watch?v=lJ8ydIuPFeU) is canonical on this; required viewing for senior+ candidates.
How do I write a postmortem that doesn't blame people?
Use the Etsy debriefing format (extfiles.etsy.com/DebriefingFacilitationGuide.pdf). The principle: 'human error' is never the root cause; it's a symptom of a system that allows human error to cause an outage. The right framing: 'Sarah's deploy at 14:00 triggered the cascade because the deploy system has no canary stage' — Sarah is named neutrally; the actionable lesson is about the deploy system. Blameless postmortems generate fixes; blaming postmortems generate cover.
What's the right ratio of alerts to engineers?
Roughly: an on-call engineer should get fewer than 2 paging alerts per shift, and false-positive rate should be under 5%. More than that and the team has alert fatigue, which kills response quality. The fix: tune SLO-based alerts to match real customer impact, not engineering anxiety. The Google SRE workbook chapter 5 covers tuning in detail. The pattern that breaks teams: alerting on every dashboard widget; engineers stop reading alerts; a real incident gets ignored.
Should application engineers own their service's observability or is it a separate team?
Application engineers own it. SRE / Platform teams provide the tooling (OTel libraries, dashboards, alerting framework); application engineers instrument their services and own their SLOs. The shared-responsibility split is documented in Google's 'Site Reliability Workbook' chapter 6. The wrong pattern (common in 2010s): a separate ops team writes the dashboards and the application team doesn't read them. The right pattern: ops/SRE defines the abstraction; engineering uses it.

Sources

  1. Google — Site Reliability Workbook (free online), especially ch. 2 (SLOs), ch. 5 (alerting), ch. 6 (effective troubleshooting).
  2. Google — SRE Book, ch. 4 'Embracing Risk' (canonical error-budget framing).
  3. OpenTelemetry — open instrumentation standard.
  4. OpenTelemetry — semantic conventions for span attributes.
  5. Charity Majors et al — 'Observability Engineering' (free PDF, the high-cardinality-events framing).
  6. Gil Tene — 'How Not to Measure Latency' (Strange Loop 2015).
  7. Etsy — Debriefing Facilitation Guide (canonical blameless-postmortem reference).
  8. Honeycomb Refinery — production reference implementation of tail-based sampling.

About the author. Blake Crosley founded ResumeGeni and writes about product design, hiring technology, and ATS optimization. More writing at blakecrosley.com.