Incident Management and On-Call — SRE Deep-Skill Guide for Tech Companies (2026)
In short
Incident command framework, severity tiers, blameless postmortems, and sustainable on-call practices. The operational backbone of SRE: how teams move from page to resolution to learning without burning out the humans who keep production running.
Key takeaways
- Incident command (IC, SME, Comms, Scribe) is not optional structure; it is what separates resolution from chaos.
- Severity must be declared early and revised explicitly when impact changes.
- Blameless postmortems investigate systems, not people; action items must have owners, priorities, and ticket links.
- Sustainable on-call requires <=25% load per engineer, <=2 pages per shift, and ruthless alert tuning.
- Follow-the-sun rotations work only when runbooks are legible to engineers who did not write the code.
- Status-page comms: post within 15 minutes, update on cadence, resolve when SLIs recover, summarize after.
- Tools accelerate a working process; they do not create one.
Incident command framework + severity tiers
Incident response without structure devolves into a chat-room scrum where ten engineers ask the same questions and nobody owns the decision. The Google SRE book formalizes the cure: borrow the Incident Command System (ICS) from wildfire response. One person commands. Everyone else has a role. The four core ICS roles are Incident Commander (IC), Operations Lead / Subject Matter Expert (Ops/SME), Communications Lead, and Scribe. The IC is the single decision-maker. They do not type at the keyboard. They ask 'what do we know, what are we doing, who owns it' and break ties. The SME drives technical investigation and proposes mitigations. The Communications Lead owns customer and internal updates so the SME is not interrupted every five minutes by Slack DMs from VPs. The Scribe captures the timeline in a running document so the postmortem writes itself. Crucially, the IC role is not seniority-based. A new SRE can command an incident with a staff engineer as SME. Google's doctrine is explicit: rank does not transfer. Whoever holds the IC hat owns the decisions for the duration. Handoffs are formal ('I am handing IC to Sarah, Sarah do you accept?') and announced in the incident channel. Severity classification gates how the framework activates. A common four-tier scheme: - Sev0 (Critical): Customer-facing outage with significant revenue, safety, or trust impact. Full ICS activation. Page executive on-call. Status-page incident posted within 15 minutes. Examples: complete login outage, payment processing down, data loss in progress. - Sev1 (Major): Significant degradation affecting many users or a critical workflow. Full ICS. Status-page updates every 30-60 minutes. Examples: search broken, checkout 50% error rate, regional outage. - Sev2 (Moderate): Limited impact or workaround exists. Lightweight IC structure. No status-page post unless duration exceeds threshold. Examples: one feature degraded, elevated latency without errors, single-tenant problem. - Sev3 (Minor): Low-impact issue tracked but not actively fought. Often resolved during business hours. Examples: minor UI bug, non-critical alert tuning, internal tooling slow. Severity must be declared early and revised explicitly. A Sev2 that turns out to be a data-loss event becomes Sev0 the moment you know, and the IC announces the upgrade. Severity drift, where an incident silently grows without re-classification, is one of the most common postmortem findings and a leading cause of customer-trust damage.Blameless postmortems: the Etsy/Google heritage
The blameless postmortem is one of the most important cultural exports from modern operations engineering. Etsy popularized the practice in their 2012 Code as Craft post 'Blameless PostMortems and a Just Culture,' building on John Allspaw's work and earlier safety-science research from aviation and healthcare. Google then codified it in the SRE book chapter on Postmortem Culture and made it mandatory for every significant incident. The premise: humans do not cause incidents. Systems that allow humans to cause incidents cause incidents. If a deploy script let an engineer drop the production database, the script is the defect. Punishing the engineer trains the team to hide mistakes, which is the opposite of what a learning organization needs. Allspaw's framing: assume people made the best decision they could with the information they had at the time, then ask what made the wrong decision look right. Blameless does not mean accountability-free. The postmortem asks 'what in the system allowed this' not 'who is to blame.' Action items still have owners. Recurring patterns still get escalated. But the conversation in the room is investigative, not prosecutorial. A working postmortem template captures the structure most high-functioning teams converge on: ```markdown # Postmortem: [Incident Title] Status: Draft | Final | Owner: @ic-handle | Date: YYYY-MM-DD ## Summary One-paragraph plain-English description. What broke, who was affected, how long, how it was resolved. ## Impact - Duration: HH:MM to HH:MM UTC (NN minutes) - Users affected: count or percentage - Revenue / SLO budget burned: $ and % of monthly budget - Severity: Sev0 / Sev1 / Sev2 ## Timeline (UTC) - HH:MM Trigger event (deploy, traffic spike, dependency) - HH:MM First alert fires / first user report - HH:MM IC declared, war room opened - HH:MM Mitigation applied (rollback / failover / flag) - HH:MM Recovery confirmed by SLI dashboards - HH:MM All-clear, status page resolved ## Root cause(s) Use the Five Whys or a fishbone. Stop when you hit a system boundary you can change, not when you hit a person. ## What went well ## What went poorly ## Where we got lucky ## Action items | Owner | Action | Priority | Due | Ticket | |-------|--------|----------|-----|--------| | @x | ... | P0 | ... | LINK | ``` Two structural details matter. First, 'where we got lucky' is not optional, it is the section that surfaces near-misses and is often where the highest-value action items hide. Second, every action item has an owner, a priority, and a ticket link. Postmortems with unticketed action items have a near-zero completion rate, which Honeycomb and incident.io have both documented in their engineering blogs. Action-item tracking is its own discipline. A common SRE anti-pattern is producing thirty postmortem action items per quarter and completing six. Mature teams cap action items at three to five P0/P1 items per incident, accept that P2 items are aspirational, and run a monthly review that closes or deletes stale tickets so the backlog stays honest.Sustainable on-call: rotations, follow-the-sun, fatigue
On-call is the part of SRE that breaks people. Done badly, it is a stochastic tax on sleep that drives senior engineers out of operations roles within eighteen months. Done well, it is a tractable rotation that engineers tolerate and sometimes prefer to feature work because it surfaces real system behavior. PagerDuty's published research on on-call best practices, echoed in Google SRE doctrine, converges on a few hard constraints. A primary on-call rotation should never exceed 25 percent of an engineer's working time when measured including page response, follow-up, and recovery. Two pages per shift is the upper bound for sustainability; three or more is the threshold at which Google SRE policy requires the team to stop feature work and address operational load. Pages outside business hours count double for fatigue accounting. An on-call rotation policy excerpt that codifies these principles: ```yaml rotation: sre-platform-primary shift_length: 7d # Mon 09:00 PT to Mon 09:00 PT primary_count: 1 secondary_count: 1 # backup for missed page or escalation min_engineers: 6 # ensures <=25% on-call load per person compensation: weekday_business_hours: included weekday_off_hours: stipend per shift weekend_or_holiday: 1.5x stipend, plus comp day if paged page_budget_per_shift: 2 # >2 triggers operational review handoff: meeting: required, 30 min, video on artifact: handoff doc with open incidents + watch-items post_incident_recovery: rule: any page between 22:00 and 06:00 grants a half-day off rule: three pages in one shift triggers immediate swap ``` Follow-the-sun is the architectural answer to off-hours pages. Two or three regional teams (e.g., Dublin, San Francisco, Sydney) each cover their daylight hours and hand off at shift boundaries. The technical investment is significant, runbooks must be legible to engineers who did not write the code, and handoff tooling must capture state without ambiguity, but the result is that nobody is paged at 3am on a Tuesday. Companies that have published on this pattern (Stripe, Shopify, GitLab) all emphasize that follow-the-sun fails without ruthless runbook hygiene; without it, the next region just escalates back to the engineer who built the system, which defeats the purpose. Pager-fatigue countermeasures fall into three buckets. Tuning: every page that did not require human action in the first five minutes is a defect, file a ticket to delete or auto-remediate it. SLO-based alerting: replace threshold alerts with multi-window burn-rate alerts on error budget so noise drops by an order of magnitude. Cultural: track 'pages per engineer per week' as a first-class metric on the SRE dashboard, surface it in staff reviews, and treat sustained elevation as a planning emergency, not a heroism opportunity.Tooling: PagerDuty / incident.io / status-page mechanics
The tooling layer for incident management has matured into three reasonably interoperable categories: paging (PagerDuty, Opsgenie, VictorOps), incident response platforms (incident.io, FireHydrant, Rootly), and status pages (Statuspage.io, Status.io, self-hosted Cachet). A working stack uses one from each, wired together so a single page creates an incident channel, an incident.io record, and an automatic status-page draft. PagerDuty and Opsgenie are mature and feature-equivalent for most teams. The choice usually comes down to existing ecosystem integrations (Atlassian shops lean Opsgenie; broader SaaS environments lean PagerDuty). What matters more than the vendor is escalation policy hygiene. Every service should have a primary on-call, a secondary that gets paged after fifteen minutes of no acknowledgment, and a manager escalation after thirty. Schedules should be generated from version-controlled rotation policies, not edited by hand in the UI, because hand-edited schedules drift and produce midnight pages to people on vacation. incident.io and similar response platforms do work that used to live in a wiki and a spreadsheet: spin up a Slack channel with the right people, post a structured incident summary, track severity changes, capture the timeline automatically, and generate a postmortem template at resolution. The engineering blogs at incident.io document the patterns most teams converge on, including tiered severity declaration flows, automated stakeholder summaries, and integration between incident state and status-page updates. The value is less in any single feature than in removing the cognitive tax of remembering 'what do I do next' during a Sev0. Status-page communications are a craft of their own. Atlassian's Statuspage best-practice guidance and the incident.io blog converge on a handful of rules. Post within fifteen minutes of detection for any Sev0 or Sev1, even if you only have 'we are investigating elevated error rates on checkout.' Update at a documented cadence (every 30 minutes for Sev0, every 60 for Sev1) even when the update is 'still investigating' because silence reads as incompetence. Use plain language, no internal jargon, no blame, no speculation about cause until you are sure. Resolve only when the SLI has recovered, not when the rollback completes, because users care about their experience, not your deploy state. And always post a short retrospective summary 24-72 hours later, ideally with a link to the public-facing portion of the postmortem when the culture supports it (Cloudflare and GitLab are the gold standard here). The tooling is necessary but not sufficient. Every team that buys incident.io and PagerDuty and still has chaotic incidents has the same diagnosis: the runbooks are stale, the severity definitions live only in someone's head, and the postmortem action items never close. Tools accelerate a working process. They do not create one.Frequently asked questions
Sources
About the author. Blake Crosley founded ResumeGeni and writes about site reliability engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.