DevOps / SRE Engineer Hub

SRE at Google: Levels, Comp, Hiring Committee, and the Borg + Spanner Stack (2026)

In short

Google is the birthplace of Site Reliability Engineering. Ben Treynor Sloss founded the discipline in 2003 and coined the term; the Google SRE Book and SRE Workbook (sre.google) are the canonical industry references. Google SREs run production for Search, Ads, YouTube, Gmail, Cloud, and Maps on Borg (the cluster manager that became Kubernetes) using Spanner, Stubby (Google's predecessor to gRPC), Mondrian for code review, and an internal observability stack older than most public alternatives. Levels run L3 (junior) through L9+ (Fellow); senior L5 SRE total comp lands $360,000-$510,000 and staff L6 $510,000-$760,000 per levels.fyi 2026. The interview is the FAANG-standard algorithmic loop plus a non-abstract production-troubleshooting round, finalized by Google's distinctive hiring committee.

Key takeaways

  • Google is the birthplace of SRE. Ben Treynor Sloss founded the discipline in 2003; the term 'Site Reliability Engineering' was coined inside Google before it existed anywhere else. The Google SRE Book (sre.google/sre-book/table-of-contents) and SRE Workbook (sre.google/workbook/table-of-contents) are the canonical industry references; both are free.
  • The error budget framework - the contract that an unmet SLO converts to a feature-freeze - originated at Google and is documented in Chapter 3 of the SRE Book. Every modern SRE org's reliability vocabulary (SLI, SLO, error budget, toil) traces directly to this Google text.
  • Levels at Google for SRE: L3 (junior) -> L4 (mid) -> L5 (senior) -> L6 (staff) -> L7 (principal / Senior Staff) -> L8 (Distinguished) -> L9+ (Fellow). Total comp at L5 lands $360K-$510K, L6 lands $510K-$760K, L7 lands $700K-$1.3M+ per levels.fyi 2026 (levels.fyi/companies/google/salaries/site-reliability-engineer).
  • The production stack is Google-internal: Borg (the cluster manager), Spanner (globally consistent SQL), Stubby (the internal RPC framework that became open-source gRPC), Bigtable, Colossus (the GFS successor), Monarch (planet-scale monitoring), Mondrian (code review), Dapper (distributed tracing). Most of these have public papers on research.google.
  • Kubernetes is Google's open-source descendant of Borg. The Borg paper (research.google/pubs/large-scale-cluster-management-at-google-with-borg) and the Omega + Kubernetes lineage are documented; SREs who joined post-2014 increasingly work in Kubernetes for Google Cloud surfaces while internal product SREs still run on Borg directly.
  • The interview process ends in hiring committee - a calibration body of senior Google engineers (not the interviewers themselves) who decide hire/no-hire from the candidate packet. This adds 1-3 weeks to timeline vs Meta but produces measurably more consistent leveling.
  • Honest scope note: SRE roles at Google are not monolithic. Product SRE (embedded with Search, Ads, YouTube, Maps, Gmail teams) is the largest population. SWE-SRE (production-engineering software work, building the platform tools) and Cloud SRE (Google Cloud customer-facing reliability) are distinct sub-tracks with their own hiring norms.

SRE at Google in 2026: birthplace + canonical reference

Google is where Site Reliability Engineering was born. Ben Treynor Sloss joined Google in 2003 as VP of Engineering and was asked to run a small "production team." He inverted the standard sysadmin model: instead of hiring operators to keep services up, he hired software engineers and gave them an explicit mandate to spend at least 50% of their time writing code, with the other half capped on operational toil. He coined the term "Site Reliability Engineering" to describe the resulting discipline. Twenty-plus years later, every reliability organization at every other major tech company traces its vocabulary, its rituals, and its promotion ladders directly back to this decision.

The two canonical Google publications are Site Reliability Engineering (2016) and The Site Reliability Workbook (2018). Both are free to read in full at sre.google. Together they define the modern SRE lexicon: Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, toil, blameless post-mortems, the four golden signals (latency, traffic, errors, saturation), the production readiness review, the on-call rotation structure, and the distinction between aspirational and agreed reliability targets. Most reliability conversations outside Google in 2026 are downstream paraphrases of these chapters; an SRE candidate who has not read them is at a meaningful interview disadvantage because every behavioral and judgment question is implicitly scored against this vocabulary.

The error-budget framework is the most influential single chapter. The premise: a 99.9% SLO is also permission to be unavailable for 0.1% of the time. That 0.1% is a budget. When the budget is healthy, the team ships features aggressively. When the budget burns out, the team freezes feature work and invests purely in reliability until the budget recovers. This converts the eternal product-versus-reliability argument into a number with a feedback loop. Every modern reliability org has imported some version of this contract, though few have replicated the structural authority Google grants its SREs to enforce it.

Inside Google, SRE is organized as a peer engineering function to SWE, with its own promotion ladder, its own performance calibration, and its own VP-level leadership. SREs are embedded with product teams (Search SRE, Ads SRE, YouTube SRE, Gmail SRE, Maps SRE, Google Cloud SRE) but report up an SRE chain. The structural separation is load-bearing: it is what allows the on-call team to push back on a launch that would consume the error budget without the politics of being on the launching team itself. SREs can also formally hand pages back to the developing team if a service exceeds its toil threshold - a contractual exit valve that reinforces the 50% engineering-time floor and is unique among major reliability organizations.

Interview process + hiring committee

The Google SRE interview process per public Hello Interview reports, levels.fyi candidate writeups, and Google's careers page (google.com/about/careers/applications):

  1. Recruiter screen. 30 minutes. Logistics, motivation, level calibration, location.
  2. Technical phone screen. 45-60 minutes. Algorithmic coding in any language. The Google coding bar is the highest at FAANG even for SRE roles - LeetCode-medium-to-hard problems with multiple-solution exploration. Frontend or systems specialty does not exempt you from this round.
  3. Onsite - 2 coding rounds. 45 minutes each. More algorithmic coding. The bar is solving the problem, articulating complexity trade-offs, and discussing alternative solutions out loud.
  4. Onsite - troubleshooting / non-abstract design. 60 minutes. Distinctive to Google SRE. The interviewer presents a production scenario - a memory leak, a latency regression after a config push, a cross-region outage with conflicting metrics. You troubleshoot live, narrating tools and reasoning. The Workbook chapter on non-abstract large system design is the closest public reference.
  5. Onsite - system design. 45-60 minutes for mid+ candidates. Distributed-systems-leaning: design a global rate limiter, design the storage tier for a write-heavy service, design the rollout strategy for a config change across 200 datacenters.
  6. Onsite - behavioral / Googleyness. 45 minutes. Past incidents you ran, conflict with senior engineers, how you communicate during an outage.
  7. Hiring committee. Distinctive to Google. After onsites, the candidate packet (feedback, code samples, behavioral notes) is reviewed by a hiring committee of Google senior engineers who did not interview the candidate. The committee decides hire/no-hire from the packet. This adds 1-3 weeks to the timeline relative to Meta but produces more consistent leveling.

What is NOT typically tested: framework-specific knowledge, certifications, AWS/Azure depth (Google uses internal infrastructure plus GCP), or recall of specific runbooks. The bar is algorithmic depth + production troubleshooting judgment + Google-style code-review fluency. Candidates who try to substitute reliability-engineering vocabulary for actual coding fluency at L3-L5 routinely under-perform; the troubleshooting and design rounds are how you demonstrate SRE depth, not the coding rounds.

The end-to-end timeline is typically 6-10 weeks. Recruiter screens often fire the same week you apply at well-resourced teams; the gap that grows the timeline is the wait for the hiring committee meeting cadence, which is weekly to bi-weekly per region. Plan for two to three weeks between final onsite and verbal offer, then one to two weeks for a written offer with full numbers. Competing offers can compress the back half of this timeline materially - Google recruiters will accelerate a packet through committee when there is documented competition.

Compensation by level

Total comp for SRE at Google by level (US, per levels.fyi 2026 - levels.fyi/companies/google/salaries/site-reliability-engineer):

LevelBaseTotal comp
L3 (junior)$140K-$190K$200K-$280K
L4 (mid)$170K-$220K$280K-$390K
L5 (senior)$200K-$260K$360K-$510K
L6 (staff)$240K-$300K$510K-$760K
L7 (principal / senior staff)$280K-$360K$700K-$1.3M+
L8 (Distinguished)$320K-$440K$1.0M-$2.0M+

Two notes. First, SRE compensation at Google is essentially identical to SWE compensation at the same level. Treynor's 2003 mandate that SREs be hired at the SWE bar with SWE pay is structurally enforced in 2026 - there is no SRE pay penalty. Second, on-call carries an additional flat per-shift compensation on top of base + RSU + bonus, documented internally and confirmed by multiple public levels.fyi self-reports; the amount varies by team and rotation intensity but typically adds $5K-$25K annually for SREs on primary rotations.

Promotion timelines at Google run longer than Meta - L4-to-L5 averages 2-3 years versus Meta E4-to-E5 at 1.5-2. The hiring-committee dynamic that calibrates initial leveling also calibrates promotion, with similar timeline costs. RSU refresh grants at L5+ are substantial and meaningfully shift years 2-4 total comp above the year-one offer number; ask the recruiter for the projected four-year curve.

Tech stack: Borg + Kubernetes + Spanner + Stubby + Mondrian production tooling

Google's production stack is almost entirely internal. Public papers on research.google document most major systems; SREs work with these directly.

  • Borg. The cluster manager that schedules every Google production workload. The Borg paper (research.google/pubs/large-scale-cluster-management-at-google-with-borg) is the canonical reference. Borg is the direct intellectual ancestor of Kubernetes - Joe Beda, Brendan Burns, and Craig McLuckie open-sourced Kubernetes in 2014 explicitly as "Borg lessons learned." Internal product SREs still run on Borg in 2026; Google Cloud SREs work in Kubernetes on GKE.
  • Kubernetes / GKE. Google's public-facing Borg descendant. SREs on the Cloud SRE track operate GKE control planes and customer clusters. The Kubernetes project is stewarded by Google plus the broader CNCF community.
  • Spanner. Globally consistent, externally consistent SQL database. Powers Ads, parts of Search, and many Google Cloud services. The Spanner paper (research.google) documents the TrueTime API and the Paxos-based consensus that makes external consistency possible across continents.
  • Bigtable + Colossus. Bigtable is the wide-column NoSQL store; Colossus is the GFS successor and the planet-scale filesystem. Both have public research papers.
  • Stubby / gRPC. Stubby is Google's internal RPC framework, the predecessor and design template for the open-source gRPC project. SREs read Stubby traces, set Stubby deadlines, and tune Stubby load-balancing daily.
  • Monarch + Borgmon. Monarch is Google's planet-scale monitoring system; Borgmon was its predecessor and the design template for open-source Prometheus. The Monarch paper documents the architecture; the SRE Book's monitoring chapters describe the operating model.
  • Dapper. Distributed tracing system. The Dapper paper inspired Zipkin, Jaeger, and OpenTelemetry tracing.
  • Mondrian + Critique. Internal code-review tooling. Google's code-review practice (google.github.io/eng-practices/review) is widely cited as a high-standard model; SREs review and are reviewed in this system constantly.
  • Blaze / Bazel. The build system. Bazel is the open-source descendant; Blaze is the internal version. SREs ship binaries built by Blaze and ship config changes through the same review pipeline.

What is NOT in the stack: Terraform (Google has its own internal infrastructure-as-code), Ansible/Chef/Puppet (Borg replaces them), AWS or Azure-specific tooling, the general public Prometheus/Grafana stack at internal scope (those exist on GCP customer-facing surfaces). SRE candidates often ask whether external infrastructure-as-code experience translates; the honest answer is that the conceptual mental models translate fully, while the specific tool muscle-memory does not.

Frequently asked questions

Do I need to read the Google SRE Book before interviewing?
Yes - or at least the first six chapters. The hiring loop will not quiz you on chapter contents, but the vocabulary (SLI, SLO, error budget, toil, blameless post-mortem, four golden signals) is assumed. The book is free at sre.google/sre-book/table-of-contents. The Workbook is the practical companion; chapter 4 on SLO engineering and chapter 5 on alerting are particularly useful pre-interview reading.
Is SRE at Google compensated the same as SWE?
Yes. Ben Treynor's founding mandate was that SREs be hired at the SWE bar with SWE compensation. Levels.fyi self-reports in 2026 confirm parity at every level. On-call rotations carry additional flat compensation on top of base + RSU + bonus, typically adding $5K-$25K annually for SREs on primary rotations.
What is the difference between product SRE, SWE-SRE, and Cloud SRE at Google?
Product SRE is embedded with a product team (Search, Ads, YouTube, Maps, Gmail) and runs that product's reliability. SWE-SRE is a software-heavy track building the production-engineering platform itself - Borg subsystems, monitoring infrastructure, deploy pipelines. Cloud SRE is customer-facing, operating Google Cloud Platform reliability for external paying customers. All three have the same compensation ladder; the day-to-day work and on-call texture differ.
Does Google use Kubernetes internally?
Mostly no for internal product workloads, mostly yes for Google Cloud customer surfaces. Internal Google products (Search, Ads, Gmail, YouTube) run on Borg directly. Google Cloud's GKE service is Kubernetes; Cloud SREs work in Kubernetes daily. Kubernetes was open-sourced in 2014 as 'Borg lessons learned' - the conceptual lineage is direct, but the internal Borg has continued to evolve in parallel.
How long does the Google interview process take end-to-end?
Typically 6-10 weeks from recruiter screen to offer. The hiring-committee step adds 1-3 weeks beyond what Meta or Amazon's pipelines run. Offer windows after committee approval are usually 1-2 weeks; competing offers can extend them.
Is the algorithmic-coding bar really the same for SRE as for SWE?
Yes. Coding rounds at Google are language-agnostic and difficulty-equivalent across SRE and SWE loops. Frontend or systems specialty is recognized at L5+ but does not exempt you from the algorithmic bar at L3-L5. LeetCode-medium-to-hard preparation is the universal need.
Can I work remotely as a Google SRE?
Limited. Google has implemented a return-to-office policy of 3 days/week minimum at hub locations for most engineering roles. Some SRE roles are remote-eligible per the careers page (google.com/about/careers/applications), but the dominant pattern is hub-based work at Mountain View, Sunnyvale, Seattle, NYC, Boulder, London, Zurich, or Bangalore. On-call rotations require timezone coverage and influence team-location placement.
What is the hiring-committee dynamic and how should I prepare for it?
After onsite, your packet (interviewer feedback, code samples, behavioral notes) is reviewed by a committee of Google senior engineers who did not interview you. The committee decides hire/no-hire from the packet alone. You cannot prepare for the committee directly, but you can prepare for the packet: write code clearly during interviews so the artifact reads well after the fact, articulate trade-offs explicitly so the interviewer's notes capture them, and answer behavioral questions in structured form (situation, action, result) so the writeup is dense.
What public Google research should I read before applying for an SRE role?
The Borg paper, the Spanner paper, the Dapper paper, the Monarch paper, and chapters 1-6 of the Google SRE Book. All are free at sre.google or research.google. Reading the Borg paper before a system-design round is particularly load-bearing because Google's mental model for cluster scheduling is the Borg model.

Sources

  1. Site Reliability Engineering (Google SRE Book) - canonical free reference for the discipline.
  2. The Site Reliability Workbook - SLO engineering, alerting, on-call, and non-abstract large system design.
  3. Google Research - public papers on Borg, Spanner, Dapper, Monarch, Bigtable, Colossus.
  4. Google Careers - official application portal and per-role hiring documentation.
  5. Google Engineering Practices - public code-review and developer-guide documentation.
  6. levels.fyi - Google SRE compensation by level (public-company RSU self-reports).
  7. levels.fyi - Software Engineer comp benchmark; confirms SRE/SWE compensation parity at Google.

About the author. Blake Crosley founded ResumeGeni and writes about site reliability engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.