SRE Engineer ATS Keywords That Pass Tech Screens (2026)
Site Reliability Engineering (SRE) hiring is a different keyword target than backend software engineering, platform engineering, or DevOps, and most resume advice conflates the four. Recruiters at companies that hire SREs at scale — Google, Stripe, Netflix, Cloudflare, Datadog, Honeycomb, Shopify, Anthropic — configure ATS searches around eight signal classes that don't show up on generic backend resumes: orchestration (Kubernetes / Helm / containers), infrastructure-as-code (Terraform / Ansible / Pulumi), cloud platforms (AWS / GCP / Azure with services named), observability (Prometheus / Grafana / Datadog / OpenTelemetry), CI/CD pipelines (GitHub Actions / GitLab CI / ArgoCD), systems programming (Go / Python / Rust / Bash), reliability practices (SLO / SLI / error budget / postmortem / chaos), and networking (BGP / DNS / TLS / mTLS / service mesh). A resume that reads like a backend engineer who happens to deploy code to production gets filtered out for SRE roles because the keyword density across those eight classes is too low [1][2]. This page lists the SRE keywords that pass screens in 2026, grouped by signal class, with worked rewrites and a counter-list of keywords that backfire when an SRE candidate leans on them.
Key Takeaways
- SRE resumes get scanned for eight signal classes — orchestration, IaC, cloud, observability, CI/CD, systems programming, reliability practices, and networking — and missing density across three or more of them is the most common reason senior backend engineers get filtered out of SRE searches [1][3].
- Named services beat generic platform names: "AWS EKS, ALB, CloudFront, RDS, S3, IAM" outperforms a single "AWS" mention because Greenhouse, Lever, and Ashby all weight the specific service names higher than the parent platform [2][4].
- The Google SRE Book (Beyer et al., O'Reilly 2016) and the Site Reliability Workbook (Beyer et al., O'Reilly 2018) are the canonical reliability-vocabulary source — SLO, SLI, error budget, toil, postmortem, blameless culture, and on-call rotation are Tier-1 SRE keywords precisely because they originate there and recruiters scan for that vocabulary [5][6].
- "Production on-call" is a Tier-1 SRE keyword that distinguishes SRE resumes from backend-engineer resumes; recruiters at top-tier infrastructure companies filter explicitly for evidence of pager-carrying experience [3][7].
- Kubernetes specificity matters: "Kubernetes" alone is too generic for senior SRE searches; Tier-1 specificity is naming the exact components ("operators, CRDs, admission webhooks, HPA, VPA, network policies, Helm 3, Kustomize, ArgoCD") because senior screens filter on depth, not just presence [8].
- BLS does not have a dedicated Site Reliability Engineer occupation code; the closest proxies are SOC 15-1244 Network and Computer Systems Administrators (median annual wage $96,800 in May 2024) and SOC 15-1252 Software Developers ($133,080 in May 2024) [9][10]. Both undercount SRE comp at top-tier tech companies — levels.fyi tracks SRE / Production Engineer / Reliability Engineer comp separately and consistently above both BLS proxies [11].
- "Just AWS Console" or "manual deploy" are anti-keywords; surfacing them on an SRE resume signals lack of automation maturity and gets the resume deprioritized regardless of other content [3][7].
How SRE ATS Screens Work
SRE hiring runs through the same ATS engines as software engineering hiring — Greenhouse, Lever, Workday, Ashby, SmartRecruiters, iCIMS — but the keyword matrix is wider and more layered. A backend engineer might hit 60% of expected keywords with a strong Java + Kafka + PostgreSQL surface; an SRE has to hit signal across orchestration, IaC, cloud, observability, CI/CD, programming, reliability, and networking. Density matters more than depth in any single area, and named tooling matters more than category names [2][4].
Engine-specific behavior for SRE hiring:
Greenhouse (used at Stripe, Notion, Shopify, Robinhood, and most Series-B-and-up startups) supports semantic matching, so "managed production Kubernetes clusters" registers as related to "operated EKS in production" or "ran k8s in prod" [2]. Greenhouse weights experience-bullet keywords more than skills-section keywords for SRE roles — bullets carry the load. The recruiter UI lets the filter "Kubernetes within last 2 years" return only candidates whose recent role used k8s, which is the most common SRE pre-screen [2].
Lever (used at Eventbrite, parts of Lyft, Cruise) emphasizes recency for infrastructure roles. Lever recruiters routinely filter on "production on-call within last 2 years" — a candidate who carried pager 3 years ago and has been on a non-on-call team since then needs to surface the recent reliability work explicitly, even if it isn't formally on-call [2].
Workday (used at Disney, Salesforce, Adobe, large-enterprise SRE hires) is the strictest exact-match parser. Workday filters often require the literal phrase "Site Reliability Engineer," "Production Engineer," "DevOps Engineer," or "Platform Engineer" in the title block — a candidate titled "Software Engineer" who has been operating as an SRE for two years gets filtered out unless the resume explicitly clarifies the de-facto SRE work [12]. Fix: write the company entry as "Software Engineer (SRE / Production Reliability Pod — primary on-call)" or similar.
Ashby (used at Notion, Linear, Ramp, Anthropic, and most modern AI-era startups) is the friendliest ATS for nuanced SRE resumes because its LLM-based scoring reads bullets and infers level from context. A bullet that describes "owned the SLO definition for the platform-data API, drove the error-budget policy with the product team, and reduced p99 latency from 850ms to 220ms across 3 quarters by introducing a connection-pool replacement" registers as senior SRE signal even if the title is ambiguous [13]. Ashby is where SWE-to-SRE transitions get the fairest read.
SmartRecruiters (Visa, Atlassian) and iCIMS (Capital One, Disney non-engineering) lean stricter and more exact-match. Both score the title block heavily for SRE searches, and both penalize creative titles ("Reliability Engineer," "Production Software Engineer," "Infrastructure Engineer") for not matching the canonical SRE strings. Taleo (legacy enterprise, Oracle) is the oldest and the strictest; for SRE Taleo searches, write defensively with explicit phrases like "site reliability," "production on-call," "incident response," "service-level objective" [12].
Tier 1 — Orchestration and Containers
These are the non-negotiables for nearly every modern SRE posting in 2026, across LinkedIn, Built In, and direct careers pages at Stripe, Cloudflare, Datadog, Notion, Linear, and Anthropic [3][7][8].
Kubernetes — Always include the literal word "Kubernetes" plus the abbreviation "k8s" once each, because both forms appear in JDs. Senior screens want named components: "operators," "CRDs (custom resource definitions)," "admission webhooks," "HPA (horizontal pod autoscaler)," "VPA," "network policies," "PodDisruptionBudgets," "StatefulSets," "DaemonSets," "Service / Ingress / Gateway API." Patterns: "operated 14-cluster Kubernetes fleet across 3 regions," "wrote 4 custom operators using kubebuilder for the platform-data team's reconciliation loops" [8].
Helm / Kustomize — Manifest tooling. Patterns: "shipped 40+ services via Helm 3 charts with strict-mode validation," "migrated platform manifests from Helm to Kustomize overlays for environment parity." Naming the version (Helm 3, not just Helm) is a Tier-1 specificity move [8].
Docker / containerd / OCI — Container runtime keywords. Patterns: "built minimal distroless images cutting attack surface and runtime size by 60%," "moved CI builds from Docker-in-Docker to Buildkit for 4x faster pipelines," "operated containerd directly post-Docker-Shim-deprecation." Senior SRE resumes name the runtime, not just "Docker" [8].
Service mesh — Istio / Linkerd / Cilium — Layer-7 networking keyword. Patterns: "operated Istio across 14-cluster fleet with mTLS enforcement and per-route retry policies," "migrated from Istio to Linkerd for operational simplicity at the platform-data team scale," "deployed Cilium with eBPF-based network policies replacing iptables-based kube-proxy" [8].
ArgoCD / Flux — GitOps deploy controllers. Patterns: "introduced ArgoCD for the platform-services team, replacing 14 hand-rolled deploy scripts with declarative GitOps across 3 environments," "ran Flux v2 with image-update automation for 22 microservices."
Tier 1 — Infrastructure as Code
IaC fluency is a non-negotiable Tier-1 SRE signal. The expectation in 2026 is that infrastructure changes go through code review, plan/apply, and pull-request workflow — not console clicks [3][7].
Terraform — Default IaC tool for SRE roles. Patterns: "owned the Terraform monorepo (180+ modules) for the platform-infrastructure team," "introduced Terragrunt to reduce environment-stamp duplication across 4 AWS accounts," "managed Terraform Cloud workspaces with policy-as-code via Sentinel." HashiCorp registry references and named providers (aws, google, kubernetes, helm) read as senior signal.
Pulumi — IaC alternative gaining traction at modern AI-era infra teams. Patterns: "migrated platform infrastructure from CloudFormation to Pulumi (TypeScript) for cross-cloud portability," "wrote Pulumi components for the standard service-deployment pattern adopted across 6 product teams."
CloudFormation / CDK — AWS-native IaC. CDK (TypeScript / Python) is the modern preference. Pattern: "shipped 22 services via AWS CDK in TypeScript with synthesized CloudFormation reviewed in PRs."
Ansible / Chef / Puppet — Configuration management. Ansible is current; Chef and Puppet are legacy signal but still scanned at older enterprises. Patterns: "operated Ansible playbooks for 220-host bare-metal inventory," "managed legacy Chef cookbooks while migrating workloads onto Kubernetes."
Packer — AMI / image building. Patterns: "owned Packer pipelines producing weekly hardened base AMIs across the AWS estate."
policy-as-code — OPA / Conftest / Sentinel — Senior IaC signal. Patterns: "shipped OPA Gatekeeper policies enforcing image-registry allowlists across 14 clusters," "introduced Conftest in CI to gate Terraform plans before apply."
Tier 1 — Cloud Platforms (with Service Specificity)
Naming the cloud isn't enough; senior SRE screens want service specificity. The pattern is platform + 4–6 named services per platform.
AWS — Services worth naming explicitly: EKS, EC2, ECS / Fargate, ALB / NLB / API Gateway, CloudFront, Route 53, S3, RDS / Aurora, DynamoDB, ElastiCache, Lambda, SQS / SNS / EventBridge, IAM (and IAM Identity Center), VPC / Transit Gateway / PrivateLink, CloudWatch, KMS, Secrets Manager, Systems Manager. Pattern: "operated multi-region AWS estate spanning EKS, RDS Aurora, S3, ALB, Route 53, CloudFront, IAM Identity Center, and Transit Gateway across 4 accounts under AWS Organizations and Control Tower."
GCP — Services: GKE, Compute Engine, Cloud Load Balancing, Cloud DNS, Cloud Storage, Cloud SQL, Spanner, BigQuery, Pub/Sub, IAM, VPC, Cloud Armor, Cloud CDN, Cloud KMS, Secret Manager. Pattern: "ran GKE Autopilot across 3 GCP projects with Cloud Armor edge protection, Cloud Load Balancing, Spanner for transactional state, and Pub/Sub for event fan-out."
Azure — Services: AKS, App Service, Front Door, DNS, Blob Storage, Cosmos DB, Azure SQL, Service Bus, Azure AD / Entra ID, Virtual Network, Azure Firewall, Application Gateway, Key Vault, Azure Monitor. Pattern: "operated AKS across 6 subscriptions with Azure Front Door, Application Gateway WAF, Cosmos DB, and Service Bus for event-driven workloads."
Cloudflare — Edge platform. Patterns: "deployed 14 Cloudflare Workers for edge auth and rate-limit logic," "operated Cloudflare Zero Trust (Access + Tunnel + Gateway) for the platform-team SSH and internal-services posture."
Multi-cloud — Senior signal when real. Pattern: "operated active-active multi-cloud across AWS and GCP for the platform-data team using Terraform-driven cross-cloud DNS failover and per-cloud regional deploys." Don't claim multi-cloud experience for what was actually a single-cloud-with-disaster-recovery workload.
Tier 1 — Observability and Telemetry
Observability fluency is Tier-1 for SRE roles and weighted heavily by recruiters at telemetry-aware companies (Datadog, Honeycomb, Grafana, New Relic) and by every infra team using SLO-driven on-call [5][14][15].
Prometheus + Grafana — Open-source observability stack. Pattern: "owned the Prometheus federation across 14 clusters with Thanos for long-term storage and global query, plus 60+ Grafana dashboards used by the platform team for daily operations." Named components (Thanos, Cortex, Mimir, Alertmanager) are senior signal.
Datadog — Commercial APM and observability. Pattern: "ran Datadog APM, logs, infrastructure, and synthetics across the platform-services org with SLO-driven monitor definitions and on-call routing via Datadog incident management."
Honeycomb — High-cardinality observability. Pattern: "introduced Honeycomb for the platform-data team to debug high-cardinality production-only issues, replacing 40+ legacy log dashboards with BubbleUp-driven exploration." Charity Majors's writing on observability is the canonical reference; naming "Honeycomb" plus "high-cardinality" reads as senior signal [15].
OpenTelemetry — Standard instrumentation. Pattern: "migrated 22 services from vendor-specific tracing SDKs to OpenTelemetry Collector with OTLP export to Honeycomb and Datadog," "owned the OTel collector pipeline with tail-sampling and cost-aware processors." OpenTelemetry fluency is a Tier-1 signal at modern infra teams [14].
Distributed tracing — Jaeger / Tempo / Zipkin — Tracing backends. Patterns: "ran self-hosted Tempo in the Grafana Stack for trace storage at 14B spans/day," "deployed Jaeger with Cassandra backend prior to the OTel migration."
Logging — Loki / Elasticsearch / OpenSearch / Splunk — Log aggregation. Patterns: "operated Loki with multi-tenant indexing across 22 teams," "migrated platform logs from Splunk to OpenSearch saving the org $1.4M annually" (only with the verifiable number).
SLI / SLO / SLA — Reliability vocabulary [5][6]. Patterns: "defined and operated 22 SLOs for the platform-data API and user-services tier," "drove the SLO-driven on-call escalation policy reducing pager-driven interrupts by 40% (measured) across 2 quarters," "instrumented latency, availability, freshness, and correctness SLIs across the platform-services portfolio."
Error budgets — Reliability-policy keyword [5][6]. Patterns: "introduced error-budget policy with the product team for the platform-data API; halted feature deploys for 9 days in Q3 after budget exhaustion and shipped a connection-pool fix that brought availability back to SLO."
Tier 1 — CI/CD Pipelines
Pipeline fluency is Tier-1 SRE signal. Recruiters scan specifically for evidence the candidate has built pipelines, not just consumed them [3][7].
GitHub Actions — Default CI for modern teams. Patterns: "owned GitHub Actions runners (self-hosted on EKS) for 320 repos across the platform org," "wrote reusable workflows for the standard build-test-scan-deploy pattern adopted across 6 product teams."
GitLab CI — Common at companies on GitLab. Patterns: "operated GitLab Runner fleet with Kubernetes executors across 4 clusters."
Jenkins / Buildkite / CircleCI / Drone — Other CI engines. Naming legacy Jenkins shops with "migrated 80 Jenkins pipelines to GitHub Actions" is a strong modernization signal.
ArgoCD / Flux — GitOps CD controllers (also listed in orchestration). Pattern: "drove the cut-over from imperative deploy scripts to ArgoCD GitOps for the platform org's 60+ services."
Spinnaker — Multi-cloud deploy controller, more common at Netflix-influenced shops. Pattern: "operated Spinnaker for 80+ services with manual judgment gates and automated canary analysis."
Container registry — ECR / Artifact Registry / Harbor / GitHub Packages — Patterns: "migrated platform-team images from Docker Hub to Harbor with image-signing via Cosign and SBOM attestation."
Supply chain — Cosign / SLSA / SBOM / Sigstore — Senior security-aware SRE signal. Patterns: "introduced Cosign signing in the build pipeline with verification at admission via Kyverno," "drove the platform's SLSA Level 3 compliance roadmap across 22 services."
Tier 1 — Systems Programming
SRE roles in 2026 expect production code, not just shell glue. The Tier-1 languages are Go, Python, Rust, and Bash, in roughly that order of demand [3][7].
Go — Default SRE / infrastructure language. Patterns: "wrote 4 Kubernetes operators in Go using controller-runtime," "owned the Go-based reconciliation loop for the platform's per-tenant config rollout (~40K lines)," "shipped Go HTTP services with chi router, sqlx, and otel-instrumented handlers."
Python — Tooling and operational scripts. Patterns: "wrote async Python tooling using asyncio and aiohttp for the platform's automated runbook system," "migrated legacy Python 2 operational scripts to Python 3.12 with type hints and pytest coverage." Python alone is too generic for senior SRE roles; pair with frameworks (FastAPI, Flask, asyncio) and a context that goes beyond shell scripting.
Rust — Growing as performance-critical SRE language. Patterns: "ported the platform's load-shedding sidecar from Go to Rust for predictable tail-latency under contention," "wrote a Rust-based BPF tool for the platform-team's syscall-tracing investigation."
Bash / shell — Always present. Specificity matters: "POSIX-compliant Bash with strict error handling (set -euo pipefail)" reads as senior; "Bash scripting" alone reads as junior. Modern SRE shell work uses jq, yq, ripgrep, fd, and just / Make as task runner.
SQL — Database fluency. Patterns: "wrote and reviewed PostgreSQL schema migrations using Sqitch for the platform-data team," "tuned production query plans cutting p99 read latency from 280ms to 35ms across the user-services tier."
Tier 1 — Reliability Practices and Vocabulary
The SRE-specific vocabulary that separates SRE candidates from generic backend candidates. Sourced from the Google SRE Book and the SRE Workbook [5][6].
On-call / pager / rotation — Tier-1 SRE keyword. Patterns: "carried primary on-call for the platform-data team (1-week rotation, 6-engineer pool) for 24 months," "owned the on-call runbook portfolio and the quarterly runbook-review cadence," "introduced sustainable-on-call policy capping interrupts at <2 per shift via SLO-driven alerting."
Incident response / incident command — Process keyword [5][6]. Patterns: "drove 14 Sev-1 incidents as incident commander across 18 months," "owned the platform's incident-response framework adopted org-wide," "introduced ICS-style incident command (IC, scribe, comms) for the platform team."
Postmortem / blameless postmortem / RCA — Tier-1 SRE keyword [5]. Patterns: "authored 22 blameless postmortems with action-item ownership tracked through completion," "drove the org-wide postmortem template refresh adopted across 6 engineering teams."
Toil / toil reduction — Reliability-economics keyword [5]. Patterns: "automated 60% of the platform-team toil portfolio (measured by self-reported hours) over 4 quarters," "owned the quarterly toil-audit process surfacing 40+ automation candidates."
Chaos engineering / chaos testing — Disciplined-failure keyword. Patterns: "introduced quarterly GameDays with simulated regional failure across 3 product surfaces," "operated chaos experiments via Chaos Mesh in the staging fleet with automated rollback gates," "drove the org's first DiRT-style disaster-recovery test catching 3 latent dependencies on a single AZ" [16].
Capacity planning / load testing — Performance-reliability keyword. Patterns: "owned platform capacity-planning model spanning compute, network, and storage forecasts across 4 quarters," "built the platform's k6-driven load-testing harness with SLO-aware thresholds and CI integration."
DR / disaster recovery / BCP — Resilience keyword. Patterns: "owned RTO / RPO definitions for 14 platform services," "drove the multi-region failover drill cadence (semi-annual) with measured RTO improvements from 22 minutes to 4 minutes across 18 months."
Tier 1 — Networking and Security
Networking fluency is the most commonly weak surface on SRE resumes from candidates with mostly-application backgrounds. It's also a strong differentiator at companies that hire SREs for serious infrastructure work [7][17].
BGP / anycast — Edge-routing keywords. Patterns: "operated multi-region BGP anycast for the platform's edge load balancer," "managed BGP peering with 3 Tier-1 transit providers and 12 IXP sessions."
DNS — Resolution-layer keyword. Specificity matters: "Route 53 with health-check-driven failover and weighted records," "Cloud DNS with policy-based routing across 3 regions," "operated authoritative DNS with PowerDNS at 14B queries/month."
TLS / mTLS — Transport-security keywords. Patterns: "rolled out service-mesh-enforced mTLS across 220 services with automated certificate rotation via cert-manager," "operated public TLS posture (Let's Encrypt + ACM) with strict cipher allowlists and HSTS preloading."
HTTP/2 / HTTP/3 / QUIC — Modern-protocol signal. Patterns: "enabled HTTP/3 (QUIC) at the platform's edge fleet measuring 18% p95 latency reduction for mobile clients."
Load balancing — L4 / L7 / Envoy / NGINX / HAProxy — Patterns: "operated Envoy-based L7 ingress across 14 clusters with per-route retry, timeout, and outlier-detection policies," "ran NGINX Plus for the platform's edge tier with custom Lua-based auth modules."
VPC / subnet / firewall / security group — Network-security keywords. Patterns: "owned VPC topology across 4 AWS accounts with Transit Gateway hub-and-spoke and AWS Network Firewall east-west inspection."
WAF / DDoS — Edge-security keywords. Patterns: "operated Cloudflare WAF with custom rule sets for the platform's API surface," "drove DDoS-mitigation runbook adoption with AWS Shield Advanced and Cloudflare combined posture."
Zero Trust / SSO / SAML / OIDC — Modern-auth keywords. Patterns: "deployed Cloudflare Zero Trust (Access + Tunnel) replacing the org's legacy VPN for 320 employees," "operated Okta SAML federation for 60+ internal services."
Tier 2 — Databases and Data Plane
SRE roles touch databases differently than DBA roles do — the SRE owns operability, not schema design — but database vocabulary still matters [3][7].
PostgreSQL — Default OLTP database. Patterns: "operated 22-instance PostgreSQL fleet on RDS Aurora with logical-replication-driven cross-region read replicas," "drove the migration from PostgreSQL 13 to 16 across 14 services," "introduced pgBouncer connection pooling cutting per-instance connection overhead by 70%."
MySQL / Aurora MySQL / Vitess — Patterns: "operated Vitess (sharded MySQL) for the platform's high-write user-events table at 1.4B writes/day."
Redis / Memcached / ElastiCache — Caching layer. Patterns: "owned ElastiCache Redis cluster fleet with cluster-mode-enabled topology and TTL-driven eviction policies."
Kafka / Kinesis / Pub/Sub / NATS — Streaming. Patterns: "operated MSK (Managed Kafka) at 14B messages/day with consumer-lag SLOs and per-topic retention policies."
Spanner / Cassandra / DynamoDB / CockroachDB — Distributed databases. Patterns: "owned DynamoDB on-demand-vs-provisioned posture across 22 tables with per-table SLOs."
S3 / GCS / Azure Blob — Object storage. Specificity: lifecycle policies, replication, intelligent tiering, KMS encryption.
Counter-List — Keywords That Backfire on SRE Resumes
This is the part most SRE resume advice misses. SRE resumes can be sunk by anti-keywords that signal lack of automation maturity, lack of production exposure, or career framing problems [3][7].
"Manual deploy" / "deployed via console" — Anti-keyword. Even if true historically, surface this only as the before of a migration story ("Replaced manual console-driven deploys with Terraform-managed infrastructure across 22 services in 4 months"). Standalone, it screams pre-2018 SRE practice [3].
"No on-call experience" — Anti-keyword phrasing. If you don't have on-call experience, don't volunteer the absence on the resume. Surface adjacent reliability work — incident participation, runbook authoring, postmortem contributions, monitoring ownership — and let the on-call discussion happen in interview. Naming the absence on the resume gets the resume filtered out [3][7].
"Just AWS Console" / "ClickOps" — Anti-keyword. Senior SRE screens explicitly filter against candidates whose only cloud experience is console-driven [3]. Even when accurate, surface a learning trajectory: "Beginning AWS work in console; transitioned the team to Terraform-managed infra over 6 months across 14 resources."
"No observability" / "Logged to stdout" — Anti-keyword phrasing. Don't surface absence of observability. If your past role didn't have observability tooling, surface the work you did to introduce it: "Stood up the team's first observability stack (Prometheus + Grafana + Loki) replacing stdout-based debugging across 6 services."
"Helped the team" / "Assisted with" — Anti-ownership verbs. SRE resumes work in implicit-ownership voice. Replace "Helped with the Kubernetes migration" with "Co-led the Kubernetes migration from Nomad across 22 services with two other SREs" — names co-owners and specifics rather than diluting the verb.
"Familiar with" / "Exposure to" / "Worked alongside" — Distance-creating phrasings. SRE recruiters scan for ownership; "familiar with Kubernetes" reads as junior even from a 5-year backend engineer. Either name the work specifically or omit the surface.
"Full-stack engineer" on an SRE resume — Anti-keyword for SRE specifically. Senior SRE screens read "full-stack" as IC-product-engineer signal, not infrastructure signal. If your background is genuinely full-stack-with-infrastructure-focus, frame it as "backend engineer with production-reliability ownership" or "platform engineer building developer-facing infrastructure" — both are stronger SRE framings.
Buzzword stacking — "passionate about reliability" — Generic-passion phrasing. Reliability-passion claims read as filler on SRE resumes; specificity beats sentiment. Replace with named reliability work and metrics.
Long tools list (20+ items) — Backend-resume convention. SRE resumes that include a 30-item Tools section (every database, every monitoring tool, every cloud service in flat list) trigger spam-detection on Greenhouse and Ashby and read as resume-stuffing [2][13]. Group by category instead and put depth in experience bullets.
Worked Examples — SRE Keywords in Experience Bullets
Example 1 — Production on-call and incident response
Before (C-grade): Responded to alerts and helped during incidents.
After (A-grade): Carried primary on-call for the platform-data team (1-week rotation, 6-engineer pool) for 24 months — drove 14 Sev-1 incidents as incident commander across 18 months, authored 22 blameless postmortems, and introduced sustainable-on-call policy capping interrupts at <2 per shift via SLO-driven alerting.
Keywords hit: Primary on-call, rotation, incident commander, Sev-1, blameless postmortems, sustainable on-call, SLO, alerting.
Example 2 — Kubernetes operability
Before: Worked with Kubernetes for deploying services.
After: Operated 14-cluster Kubernetes fleet across 3 regions with Helm 3 charts, ArgoCD GitOps, and Cilium-based network policies — wrote 4 custom operators using kubebuilder for the platform-data team's reconciliation loops and reduced cluster-upgrade time from 6 hours to 45 minutes via blue-green node-pool rotation.
Keywords hit: Kubernetes, cluster fleet, regions, Helm 3, ArgoCD, GitOps, Cilium, network policies, operators, kubebuilder, node-pool rotation.
Example 3 — Observability and SLO ownership
Before: Set up monitoring for the team.
After: Defined and operated 22 SLOs across the platform-data API and user-services tier on Datadog and Honeycomb — drove the error-budget policy with the product team, halted feature deploys for 9 days in Q3 after budget exhaustion, and shipped a connection-pool replacement that brought availability back to SLO inside 4 weeks.
Keywords hit: SLO, platform-data, user-services, Datadog, Honeycomb, error-budget policy, feature deploys, availability.
Example 4 — IaC and automation
Before: Used Terraform to manage cloud resources.
After: Owned the Terraform monorepo (180+ modules, 4 AWS accounts) for the platform-infrastructure team — introduced Terragrunt for environment-stamp deduplication, shipped Sentinel policy-as-code gates for the production workspace, and drove the migration from CloudFormation to Terraform across 22 legacy stacks.
Keywords hit: Terraform monorepo, modules, AWS accounts, Terragrunt, Sentinel, policy-as-code, CloudFormation migration.
Example 5 — Performance and capacity
Before: Improved system performance.
After: Reduced p99 latency on the user-services tier from 850ms to 220ms across 3 quarters — introduced async connection pooling, replaced N+1 PostgreSQL access patterns with cached batch loaders, and drove the Honeycomb-based investigation that surfaced a DNS-resolver contention bug under high concurrent load.
Keywords hit: p99 latency, user-services, async, connection pooling, PostgreSQL, batch loaders, Honeycomb, DNS resolver, concurrency.
Example 6 — Toil reduction
Before: Automated some manual processes.
After: Automated 60% of the platform-team toil portfolio (measured by self-reported hours) over 4 quarters — built a Go-based runbook-automation framework with audit-trail logging, replaced 14 hand-rolled deploy scripts with ArgoCD GitOps, and introduced Terraform automation for the previously-manual onboarding of new product teams.
Keywords hit: Toil portfolio, Go, runbook automation, ArgoCD, GitOps, Terraform, onboarding.
Density and Placement Rules for SRE
- Professional Summary: Pack 8–10 Tier-1 SRE keywords across the eight signal classes. Example: "Senior SRE with 7 years operating production infrastructure — owned multi-region Kubernetes fleet, Terraform monorepo, and SLO-driven on-call for the platform-data team. Strengths: Go and Python, AWS (EKS, RDS, ALB), observability (Prometheus, Grafana, Datadog, Honeycomb), incident command, and chaos engineering."
- Skills section: Group by category, never flat. Recommended 6 categories, 24–36 items total: Orchestration (Kubernetes, Helm, Kustomize, ArgoCD, Istio), IaC (Terraform, Pulumi, Ansible, OPA, Packer), Cloud (AWS EKS/RDS/ALB/S3/IAM, GCP GKE/Cloud SQL, Cloudflare), Observability (Prometheus, Grafana, Datadog, Honeycomb, OpenTelemetry, Loki), CI/CD (GitHub Actions, ArgoCD, Cosign, SLSA), Programming (Go, Python, Rust, Bash, SQL).
- Experience bullets: Each recent bullet should pair an action verb with a quantified outcome. Aim for 1–2 Tier-1 SRE keywords per bullet, embedded naturally. Don't repeat the same keyword across more than 2–3 bullets.
- Pick depth over breadth on cloud: Strong AWS surface plus a credible GCP or Azure mention beats shallow surface across all three. Recruiters at AWS-shop companies prefer deep AWS over wide-cloud-with-no-depth.
- Surface on-call status explicitly: "Primary on-call (24-month tenure)," "Secondary on-call across 3 services," "Carried pager during major incidents" — name your relationship to production explicitly.
Density rule of thumb for SRE: Tier-1 orchestration / IaC / cloud / observability / CI-CD / programming / reliability / networking keywords each appear 3–5 times across the resume. Total Tier-1 keyword surface: roughly 40–60 distinct terms across an SRE resume, embedded in bullets, not flattened in a Skills dump.
Anti-Patterns That Fail SRE Screens
- The "backend engineer with deploys" resume: 80% application-feature work, 20% infrastructure mention. Reads as backend engineer who deploys to production, not as SRE. Senior+ SRE screens filter against this aggressively [3].
- No on-call evidence: "Worked on platform team for 3 years" without naming pager, rotation, or incident response. Reads as did-not-actually-carry-pager. Recruiters cross-check at interview, and the gap shows fast.
- Cloud-without-services: "AWS, GCP, Azure" in a bullet without naming any specific service. Reads as resume-stuffing. Specificity is the senior signal [2].
- Tools dump: 30-item flat Skills section. Triggers Greenhouse and Ashby spam-detection [2][13]. Group by category and put depth in experience bullets.
- "Familiar with" / "exposure to": Distance-creating phrasings. SRE work is owned, not glanced at.
- No observability mention: A Senior+ SRE resume that doesn't name Prometheus, Datadog, Honeycomb, OpenTelemetry, or equivalent reads as senior-without-observability-fluency, which is rare and suspect.
- SLO / error-budget vocabulary missing: An SRE resume without "SLO," "error budget," or "postmortem" reads as not-fluent-in-SRE-canon. The Google SRE Book vocabulary is Tier-1 keyword surface [5][6].
- Title inflation: Calling a generalist DevOps role "Site Reliability Engineer" when the actual scope was config-management-and-deploy-scripts. Senior interviewers cross-check on incident-response specifics, on-call rotation structure, and SLO definitions — and the inflation surfaces fast.
FAQ
I'm a backend engineer applying to my first SRE role — how do I write this resume?
Surface every reliability-adjacent thing you've done from backend work, framed in SRE-resume vocabulary. On-call participation (even secondary), incident response (even as a contributor), monitoring ownership, runbook authoring, performance investigations, deploy-pipeline contributions, infrastructure code reviews, and any production-impact work. The Google SRE Book and the SRE Workbook are the canonical reference for the vocabulary you should mirror — SLO, SLI, error budget, toil, blameless postmortem, on-call sustainability [5][6]. Then run the resume through Jobscan or Resume Worded against an SRE JD; aim for 70%+ match by reframing bullets to mirror the JD's reliability phrasing.
Should I list every cloud service I've touched on an SRE resume?
No. List the services you've operated in production with non-trivial ownership. The Greenhouse / Ashby keyword scan rewards specificity, but the recruiter screen rewards credibility — claiming 22 AWS services on one resume reads as resume-stuffing and gets probed at interview. The honest pattern is 6–10 named services per cloud where you have meaningful production experience, plus a brief "additional exposure" line if you've touched but not operated more.
How do I handle a "DevOps Engineer" title when applying to SRE roles?
Most modern SRE recruiters read DevOps and SRE as overlapping but distinct: DevOps emphasizes the dev/ops bridge and pipeline work; SRE emphasizes production reliability, SLO/error-budget discipline, and pager-carrying. If your DevOps work has been pipeline-and-config-management with no production on-call, frame the resume around the pipeline/IaC/cloud surface honestly and surface any reliability work you did. If your DevOps work was effectively SRE (production on-call, SLOs, incident response), use the resume bullets to make that explicit and consider a one-line subtitle in the role: "DevOps Engineer (production on-call, SLO ownership, primary incident commander)." Strict-match Workday and Taleo screens will weight the title; Ashby and Greenhouse will read the bullets [12][13].
How do I show on-call experience without overstating it?
Name the rotation structure, the team-pool size, the duration, and your role (primary, secondary, tertiary). Pattern: "Secondary on-call (4-engineer pool, 2-week rotation) for 14 months on the platform-services team — primary for the data-pipeline subset of services." If your on-call has been incident-participation rather than rostered pager, frame it that way: "Participated in 18 incident responses as the data-pipeline subject-matter expert across 12 months, including 4 as incident commander." Either is valid; faking primary-on-call status when you were not primary fails interview.
Do I need to use Google SRE Book vocabulary even if my company didn't?
Yes, mostly. The vocabulary in the Google SRE Book and the SRE Workbook (SLO, SLI, error budget, toil, blameless postmortem, on-call sustainability, alerting on symptoms not causes) is the canonical SRE keyword surface that recruiters scan for [5][6]. Even if your company called these things by different names internally — "service objectives," "reliability budget," "operational debt" — translate the internal vocabulary to the canonical SRE vocabulary on your resume. The recruiter scan will miss the internal jargon. Senior interviewers will accept the translation when you can name both the canonical concept and the internal practice.
How many years of experience do I need to claim "Senior SRE" titles?
The honest range is 5+ years of sustained infrastructure / production work, with at least 18 months of pager-carrying production on-call, ownership of an SLO portfolio for at least one major service, and lead-role on at least one incident response or postmortem cycle. Below that, "Senior SRE" reads as inflated even if a small company gave you the title. Above 7 years, "Senior SRE" is the floor; staff / principal / SRE-Lead becomes the next step. Hello Interview's leveling rubrics map these tenure expectations across top-tier infrastructure-heavy companies [18].
What about salary data for SRE roles since BLS doesn't track them?
BLS has no Site Reliability Engineer occupation. The closest BLS proxies are SOC 15-1244 Network and Computer Systems Administrators (median annual wage $96,800 in May 2024) and SOC 15-1252 Software Developers ($133,080 in May 2024) [9][10]. Both undercount SRE comp at top-tier infrastructure-heavy tech companies because BLS aggregates broadly and the SRE specialty commands premium pricing in the labor market. levels.fyi tracks SRE / Production Engineer / Reliability Engineer comp separately at named companies (Google, Stripe, Netflix, Cloudflare, Datadog, Honeycomb) and consistently reports total compensation above both BLS proxies, especially at senior+ levels [11]. For honest salary expectations, anchor on levels.fyi by company and level rather than on the BLS proxy figures.
How do I show Kubernetes depth versus just mentioning it?
Name specific components, not just "Kubernetes." Strong depth signal: "operators, CRDs, admission webhooks, HPA, VPA, network policies, PodDisruptionBudgets, StatefulSets, DaemonSets, Service / Ingress / Gateway API, Helm 3, Kustomize, ArgoCD, Cilium." Pair with cluster-level operational claims: cluster count, region count, fleet management, upgrade cadence, custom-operator authorship, multi-tenant patterns, namespace governance, cluster-autoscaler tuning. The Kubernetes documentation and KubeCon talks are the canonical depth references — claims that name components beyond the Pod / Deployment / Service basics read as senior signal [8].
References
[1] Greenhouse Software. "Sourcing and Filtering Best Practices — Greenhouse Help Center." https://support.greenhouse.io/hc/en-us/articles/360051506331-Sourcing-best-practices
[2] Ashby HQ. "How Ashby's AI-Powered Sourcing Works." https://www.ashbyhq.com/resources/guides/ai-powered-sourcing
[3] Google SRE. "Site Reliability Engineering — How Google Runs Production Systems." https://sre.google/
[4] Lever. "Recruiter Search and Filtering Documentation." https://help.lever.co/
[5] Beyer, Jones, Petoff, Murphy (eds). Site Reliability Engineering: How Google Runs Production Systems (O'Reilly, 2016). https://sre.google/sre-book/table-of-contents/
[6] Beyer, Murphy, Rensin, Kawahara, Thorne (eds). The Site Reliability Workbook: Practical Ways to Implement SRE (O'Reilly, 2018). https://sre.google/workbook/table-of-contents/
[7] Brendan Gregg. "Systems Performance — Methodology and Tools." https://www.brendangregg.com/
[8] Kubernetes Project. "Kubernetes Documentation." https://kubernetes.io/docs/home/
[9] U.S. Bureau of Labor Statistics. "Network and Computer Systems Administrators (SOC 15-1244) — Occupational Employment and Wage Statistics, May 2024." https://www.bls.gov/oes/current/oes151244.htm
[10] U.S. Bureau of Labor Statistics. "Software Developers (SOC 15-1252) — Occupational Employment and Wage Statistics, May 2024." https://www.bls.gov/oes/current/oes151252.htm
[11] levels.fyi. "Site Reliability Engineer / Production Engineer Salary Data by Company and Level." https://www.levels.fyi/t/software-engineer/focus/devops
[12] Workday. "Workday Recruiting — Candidate Search Documentation." https://doc.workday.com/admin-guide/en-us/staffing/recruiting/candidate-experience.html
[13] Ashby HQ. "Recruiting Workflow and Candidate Scoring." https://www.ashbyhq.com/
[14] OpenTelemetry Project. "OpenTelemetry Documentation." https://opentelemetry.io/docs/
[15] Charity Majors. "Observability and Engineering Leadership Writing." https://charity.wtf/
[16] Casey Rosenthal and Nora Jones. Chaos Engineering: System Resiliency in Practice (O'Reilly, 2020). https://www.oreilly.com/library/view/chaos-engineering/9781492043850/
[17] Cloudflare. "Cloudflare Learning Center — Networking and Edge Security." https://www.cloudflare.com/learning/
[18] Hello Interview. "SRE / Infrastructure Engineering Leveling and Interview Rubrics." https://www.hellointerview.com/