DevOps / SRE Engineer Hub

Kubernetes and Platform Engineering: The Tier-1-Sourced SRE Field Guide (2026)

In short

Kubernetes and platform engineering is the load-bearing SRE deep-skill for 2026. The Kubernetes control plane (kube-apiserver, etcd, kube-controller-manager, kube-scheduler) is the cluster's brain; the data plane (kubelet, container runtime, kube-proxy, CNI plugin) is its nervous system. GitOps with ArgoCD or Flux makes the cluster's desired state a Git artifact and the reconciliation loop a continuous-delivery primitive. Service mesh (Istio for feature breadth, Linkerd for operational simplicity, or none if the org's traffic doesn't need mTLS / canarying / circuit-breaking) is an explicit add, not a default. Platform engineering — Spotify's Backstage as the canonical internal developer platform — is how a centralized SRE team sells reliability primitives as a paved road to product engineers without becoming a ticket-driven bottleneck.

Key takeaways

  • The Kubernetes control plane is four named processes — kube-apiserver, etcd, kube-controller-manager, kube-scheduler — and one optional cloud-controller-manager. The data plane is kubelet, container runtime (containerd / CRI-O), kube-proxy, and a CNI plugin. Memorize the responsibilities; every K8s incident traces back to one of them. (kubernetes.io/docs/concepts/architecture/)
  • etcd is the single source of truth and the single largest operational risk. A degraded etcd quorum degrades the entire control plane. Backups, snapshot cadence, and disaster-recovery runbooks for etcd are non-negotiable senior-SRE table stakes.
  • GitOps with ArgoCD or Flux replaces imperative kubectl apply with a declarative reconciliation loop driven by Git. ArgoCD ships a UI and an Application CRD; Flux is GitOps-native and CLI-driven. Both are CNCF graduated projects (cncf.io/projects). Pick one; do not run both.
  • Service mesh is an explicit architecture choice, not a default. Istio (Google / IBM origin, ambient + sidecar modes, deepest feature set) is the industry standard for breadth; Linkerd (Buoyant, Rust-based linkerd2-proxy, operationally simpler) is the standard for clusters that want mTLS + observability without Istio's complexity. Many production clusters run no mesh at all and use ingress + NetworkPolicy + per-service mTLS where needed.
  • Helm and Kustomize are the two canonical config-templating tools. Helm uses Go templates and a chart packaging model; Kustomize uses overlays without templating. Modern teams often combine: Helm for vendored upstream charts, Kustomize for in-house overlays.
  • Platform engineering, per the CNCF / Gartner framing, is the discipline of building internal developer platforms (IDPs) that abstract Kubernetes complexity behind a paved-road developer experience. Spotify's Backstage (CNCF graduated, backstage.io) is the dominant open-source IDP framework; software templates and the service catalog are the load-bearing primitives.
  • The platform team's product is the developer experience. The failure mode is platform engineers who ship YAML to product engineers and call it a platform. The success mode is platform engineers who ship a Backstage template that scaffolds a service, a pipeline, an SLO, and a runbook — and the product engineer never writes Kubernetes YAML directly.

Kubernetes architecture: control plane + data plane

Kubernetes is a declarative orchestration system that runs containerized workloads across a fleet of machines. The architecture, per kubernetes.io/docs/concepts/architecture/, splits cleanly into a control plane (the cluster's brain) and a data plane (the worker nodes that run the workloads). Senior SREs memorize the components and their responsibilities — every Kubernetes incident traces back to one of them.

Control plane components:

  • kube-apiserver. The front door. Every kubectl command, every controller, every kubelet talks to the apiserver over HTTPS. It validates requests, writes accepted state to etcd, and serves watch streams for controllers. Horizontally scalable; typical production clusters run 3+ replicas behind a load balancer.
  • etcd. The cluster's single source of truth. A distributed key-value store using the Raft consensus protocol. Holds every Pod spec, every Secret, every ConfigMap. Loss of etcd quorum = loss of the cluster's brain. Snapshot backups (etcdctl snapshot save) on a tight cadence are non-negotiable.
  • kube-controller-manager. Runs the reconciliation loops for Deployments, ReplicaSets, StatefulSets, DaemonSets, Jobs, Nodes, ServiceAccounts, and more. Each controller watches the apiserver for desired-state changes and drives actual state toward it.
  • kube-scheduler. Watches for unscheduled Pods and binds them to Nodes based on resource requests, affinity rules, taints / tolerations, and topology constraints. Pluggable scheduling framework; most clusters use the default.
  • cloud-controller-manager. Optional. Bridges Kubernetes to the underlying cloud (AWS / GCP / Azure) for LoadBalancer Services, persistent volume provisioning, and node lifecycle.

Data plane components on every worker node:

  • kubelet. The node agent. Talks to the apiserver, receives Pod specs, instructs the container runtime to start containers, reports node and Pod status back. Health-checks the Pods via liveness / readiness / startup probes.
  • Container runtime. containerd or CRI-O via the Container Runtime Interface. Docker-shim was removed in 1.24 (April 2022); production clusters in 2026 run containerd or CRI-O, not Docker.
  • kube-proxy. Implements Service abstraction by programming iptables (default) or IPVS rules on each node. IPVS scales better at large Service counts; iptables is the default for small / mid clusters.
  • CNI plugin. Container Network Interface. Provides the Pod network: every Pod gets an IP, Pods can route to each other across nodes. Calico, Cilium (eBPF-based, fastest growth in 2024-2026), Flannel, and the AWS VPC CNI are the production-common choices.

Storage and ingress. Persistent storage is provisioned via the Container Storage Interface (CSI) driver — AWS EBS, GCP PD, Azure Disk, Ceph, Longhorn. PersistentVolumeClaims request storage; PersistentVolumes back it. Ingress is provided by an ingress controller (NGINX, Traefik, HAProxy, or cloud-native AWS ALB Controller) reading Ingress resources to route external HTTP traffic to Services.

The minimal production manifest (Pod + Deployment + Service + Ingress) for a stateless web service:

apiVersion: apps/v1
kind: Deployment
metadata: { name: api, labels: { app: api } }
spec:
  replicas: 3
  selector: { matchLabels: { app: api } }
  template:
    metadata: { labels: { app: api } }
    spec:
      containers:
      - name: api
        image: registry.example.com/api:v1.4.2
        ports: [{ containerPort: 8080 }]
        readinessProbe: { httpGet: { path: /healthz, port: 8080 } }
        resources:
          requests: { cpu: 100m, memory: 128Mi }
          limits:   { cpu: 500m, memory: 512Mi }
---
apiVersion: v1
kind: Service
metadata: { name: api }
spec:
  selector: { app: api }
  ports: [{ port: 80, targetPort: 8080 }]
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata: { name: api, annotations: { kubernetes.io/ingress.class: nginx } }
spec:
  rules:
  - host: api.example.com
    http: { paths: [{ path: /, pathType: Prefix, backend: { service: { name: api, port: { number: 80 } } } }] }

StatefulSets replace Deployments when ordered identity matters (databases, Kafka, Elasticsearch). Headless Services provide stable DNS names; PVC templates provision per-Pod persistent storage. The DaemonSet ensures one Pod per Node — the canonical pattern for log shippers, node exporters, and CNI components.

GitOps with ArgoCD or Flux

GitOps is the operational pattern of treating Git as the source of truth for cluster state and a controller as the reconciliation engine that drives the cluster toward the Git-declared desired state. The pattern was named by Weaveworks in 2017 and is now the dominant deployment model for Kubernetes. Two canonical implementations: ArgoCD (Intuit origin, 2018, CNCF graduated 2022) and Flux (Weaveworks origin, CNCF graduated 2023). Both are listed at cncf.io/projects.

The four GitOps principles per the OpenGitOps working group (opengitops.dev): (1) declarative — the system's desired state is described declaratively; (2) versioned and immutable — desired state is stored in Git with full history; (3) pulled automatically — software agents automatically pull the desired state from Git; (4) continuously reconciled — software agents continuously observe actual state and converge it to desired.

ArgoCD (argo-cd.readthedocs.io) is the more feature-rich, UI-forward option. It runs in-cluster as a set of controllers and serves a web UI plus an API. The core abstraction is the Application CRD — a declarative pointer to a Git repo path that should be reconciled into a destination cluster + namespace. ArgoCD detects drift, supports manual or automatic sync, and renders a real-time topology view of the deployed resources.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-prod
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/acme/k8s-manifests
    targetRevision: main
    path: apps/api/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: api-prod
  syncPolicy:
    automated: { prune: true, selfHeal: true }

Flux (fluxcd.io) is the GitOps-native, CLI-forward option. No default UI — Flux composes with Weave GitOps or Kustomize / Helm tooling. The core abstractions are GitRepository (source) and Kustomization or HelmRelease (reconciler). Flux is more modular than ArgoCD; teams that prefer composability and fewer moving parts in the UI tier choose Flux.

Helm and Kustomize sit upstream of either GitOps tool. Helm (helm.sh, CNCF graduated) uses Go templates and a chart packaging model; the ecosystem ships thousands of vendored charts (Bitnami, ingress-nginx, cert-manager). Kustomize (built into kubectl since 1.14) uses overlay-based config without templating — a base manifest plus per-environment overlays. Modern teams combine both: Helm for vendored upstream charts, Kustomize for in-house overlays. ArgoCD and Flux both support either.

The senior-SRE GitOps decisions: (1) one tool, not both — running ArgoCD and Flux side-by-side multiplies operational surface; (2) one source of truth — every cluster mutation goes through Git; emergency kubectl edits are post-mortem-worthy; (3) a separate config repo from the application repo at scale; (4) automated sync with self-heal in non-prod, manual sync gates in prod; (5) image-tag automation via ArgoCD Image Updater or Flux image-automation controllers to close the CI-to-CD loop.

Service mesh: Istio vs Linkerd vs none

Service mesh is an architecture pattern that injects a sidecar proxy alongside every service Pod (or runs a node-level proxy in ambient mode) to handle traffic management, mTLS, observability, and policy enforcement transparently to the application. The two production-common open-source meshes are Istio and Linkerd; many production clusters run no mesh at all.

Istio (istio.io/latest/docs, Google / IBM origin, CNCF graduated 2023) is the broadest-feature mesh. The control plane is istiod; the data plane is Envoy proxies in sidecar mode (one Envoy per Pod) or in the newer ambient mode (node-level zTunnel + optional waypoint proxy per service). Istio's feature set: mTLS by default, fine-grained traffic shifting / canarying / fault injection, AuthorizationPolicy CRDs for L7 RBAC, deep telemetry via OpenTelemetry / Prometheus, egress and gateway control, multi-cluster mesh federation. Operational cost is real: the control plane is non-trivial, the sidecar latency adds up, and the feature surface is large enough that teams underestimate it.

Linkerd (linkerd.io/2.16/overview, Buoyant origin, CNCF graduated 2021) is the operationally-simpler mesh. The data plane is linkerd2-proxy, a Rust-built micro-proxy purpose-designed for Linkerd — smaller memory footprint and lower latency than Envoy. Feature set is narrower than Istio's: mTLS, retries / timeouts, traffic splitting via SMI, golden-metric observability, and a clean CLI. No L7 authorization policy story comparable to Istio's AuthorizationPolicy until recently. The trade-off is explicit: Linkerd trades feature breadth for operational simplicity, and many production teams prefer the trade.

No mesh is a legitimate production choice. Not every cluster needs sidecars. The minimum-viable security and observability story without a mesh: (1) NetworkPolicy resources for L3 / L4 segmentation; (2) per-service mTLS via cert-manager + a service-to-service auth library where required; (3) Ingress controller for north-south traffic; (4) OpenTelemetry SDKs in application code for distributed tracing; (5) Prometheus + Grafana for metrics. Many platforms operate at large scale with this stack and no mesh.

The senior-SRE service mesh decision: reach for Istio if you need multi-cluster federation, fine-grained L7 authorization, advanced traffic shifting (header-based canarying), or a deeply-mature feature surface. Reach for Linkerd if you want mTLS + golden-metric observability + retries with the lowest operational overhead. Reach for no mesh if the feature requirements (mTLS, observability, traffic management) can be met with NetworkPolicy + cert-manager + OpenTelemetry SDKs and the team would rather not run a mesh control plane. Adopting a mesh because it is fashionable is the most common service-mesh failure mode named in CNCF case studies.

Platform engineering: internal developer platforms (Spotify Backstage)

Platform engineering is the discipline of building internal developer platforms (IDPs) that abstract Kubernetes and infrastructure complexity behind a paved-road developer experience. The framing was popularized by Team Topologies (Skelton / Pais, 2019), Gartner's 2022 platform-engineering coverage, and the CNCF Platforms Working Group white paper. The core argument: as Kubernetes adoption matures, the centralized SRE team becomes a bottleneck if it operates by ticket; the only scalable alternative is to ship a platform that lets product engineers self-serve the infrastructure primitives the SRE team has hardened.

Spotify's Backstage (backstage.io/docs, CNCF graduated 2024) is the dominant open-source IDP framework. It is a TypeScript / React application, extensible via plugins, that the platform team operates as the developer-facing surface for the entire engineering organization. Three load-bearing primitives:

  1. Software catalog. A graph of Components, Resources, Systems, and APIs — every service in the org, with its owner, its tier, its dependencies, and its docs. The catalog is YAML-driven (catalog-info.yaml) and lives alongside service code. The catalog is the company's authoritative service registry.
  2. Software templates (scaffolder). Self-service service creation. An engineer fills out a Backstage form ('I want a new Python API service called X in team Y'), and the scaffolder generates a Git repo, a CI pipeline, a Kubernetes manifest, an SLO, on-call wiring, and registers the service in the catalog — all from a parameterized template the platform team maintains. The minimal template excerpt:
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: python-api-service
  title: Python API Service
spec:
  parameters:
    - { title: Service name, properties: { name: { type: string } } }
  steps:
    - id: fetch
      action: fetch:template
      input: { url: ./skeleton, values: { name: ${{ parameters.name }} } }
    - id: publish
      action: publish:github
      input: { repoUrl: github.com?owner=acme&repo=${{ parameters.name }} }
    - id: register
      action: catalog:register
      input: { repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }} }

3. TechDocs. Docs-as-code rendered inside Backstage from Markdown in each service's repo. Every service has docs in the same place, discoverable from the catalog entry. The TechDocs plugin is the most-installed Backstage plugin per backstage.io statistics.

The platform team's product is the developer experience. The failure mode named in CNCF case studies and Team Topologies is platform engineers who ship YAML to product engineers and call it a platform. The success mode is platform engineers who ship a Backstage template that scaffolds a service, a pipeline, an SLO, a runbook, and on-call wiring — and the product engineer never writes Kubernetes YAML directly. The platform's success metric is paved-road adoption rate: what percentage of new services in the last 90 days were created via the Backstage template versus hand-rolled?

The senior-SRE platform-engineering posture: (1) treat the platform as a product with named users (product engineers) and named success metrics (paved-road adoption, time-to-first-deploy, mean-time-to-onboard); (2) run the platform team as a Team Topologies platform team — not a ticket-driven ops team; (3) write the templates so the SRE-hardened defaults (resource limits, probes, NetworkPolicy, SLO, on-call, log shipping) are the only path; (4) measure developer satisfaction with the platform as a leading indicator of reliability — engineers who fight the platform route around it, and routed-around platforms produce reliability incidents.

Frequently asked questions

What is the smallest production-credible Kubernetes architecture?
Three control-plane nodes (for etcd Raft quorum), two worker nodes (so a node failure doesn't take production down), one ingress controller (NGINX or Traefik), one CNI plugin (Calico or Cilium), one CSI driver (cloud-native or Longhorn), one observability stack (Prometheus + Grafana + a log shipper). Anything smaller is a development cluster — anything missing the three control-plane Raft majority is a single-point-of-failure cluster.
When should I run Istio versus Linkerd versus no service mesh?
Istio (istio.io) for multi-cluster mesh federation, fine-grained L7 AuthorizationPolicy, advanced traffic shifting / fault injection, and the deepest feature surface; accept the operational cost. Linkerd (linkerd.io) for mTLS + golden-metric observability + retries with the lowest operational overhead; the Rust linkerd2-proxy is materially smaller than Envoy. No mesh if NetworkPolicy + cert-manager + OpenTelemetry SDKs cover the requirements; this is a legitimate production choice at scale and the failure mode of adopting a mesh because it is fashionable is the most common service-mesh failure mode named in CNCF case studies.
ArgoCD or Flux — which is the right GitOps tool?
Both are CNCF graduated (cncf.io/projects). ArgoCD (argo-cd.readthedocs.io) ships a UI, a richer Application CRD, and an opinionated developer experience — preferred when the platform team wants a visual surface for all developers. Flux (fluxcd.io) is GitOps-native, CLI-driven, and more modular — preferred when the team prefers composability and fewer UI moving parts. Pick one. Running both multiplies operational surface and is a senior-SRE anti-pattern.
Helm or Kustomize?
Both. Helm (helm.sh, CNCF graduated) is the right tool for vendored upstream charts — ingress-nginx, cert-manager, Bitnami, prometheus-community. The chart ecosystem is the largest in Kubernetes. Kustomize (built into kubectl since 1.14) is the right tool for in-house overlay-based config without templating. Modern teams combine: Helm for vendored, Kustomize for overlays. ArgoCD and Flux both natively support either or both.
How do I think about etcd backup and disaster recovery?
etcd is the single source of truth — its loss is cluster loss. Senior-SRE etcd hygiene: (1) snapshot backups on a tight cadence (etcdctl snapshot save) shipped off-cluster; (2) tested restore runbook — an untested backup is not a backup; (3) Raft quorum monitoring with paging on member loss; (4) defragmentation cadence as the database grows; (5) separation of etcd nodes from worker workloads. The kubernetes.io documentation on etcd disaster recovery is the canonical reference.
What does Backstage actually replace?
Backstage (backstage.io) replaces the patchwork of internal wikis, service registries, Confluence runbooks, ticket templates, and ad-hoc service-creation scripts that platform teams accumulate. The software catalog replaces a service registry; software templates replace cookie-cutter scaffold scripts; TechDocs replaces docs-in-Confluence. The CNCF graduation in 2024 reflects the maturity. Backstage is large to operate — the platform team should expect to invest 1-2 dedicated engineers maintaining and extending it.
How do I measure platform engineering success?
Per the CNCF Platforms Working Group white paper and Team Topologies (Skelton / Pais): (1) paved-road adoption rate — % of new services created via the platform's templates versus hand-rolled; (2) time-to-first-deploy for a new service from idea to production; (3) developer satisfaction with the platform (NPS / quarterly survey); (4) reliability of services on the paved road versus off it; (5) the inverse — how often product engineers route around the platform, which is a leading indicator of platform-team failure.
What is the senior-SRE Kubernetes reading list?
kubernetes.io/docs/concepts (architecture, workloads, networking, storage) is the canonical primary source. Then: argo-cd.readthedocs.io for GitOps, istio.io/latest/docs and linkerd.io/2.16/overview for service mesh, backstage.io/docs for platform engineering, and cncf.io/projects to track the graduated and incubating projects in the broader ecosystem. The CNCF Platforms Working Group white paper is the canonical platform-engineering reference. Total reading time: 30-50 hours; total transformative value across a senior SRE career: very high.

Sources

  1. Kubernetes Documentation — Concepts (architecture, workloads, networking, storage). Tier-1 primary source.
  2. ArgoCD Documentation. CNCF graduated GitOps tool with Application CRD, declarative reconciliation, and a topology UI.
  3. Istio Documentation — istiod control plane, Envoy data plane (sidecar + ambient modes), AuthorizationPolicy, multi-cluster mesh.
  4. Linkerd 2.16 Overview — Buoyant origin, Rust linkerd2-proxy, mTLS + golden metrics + retries with low operational overhead.
  5. Spotify Backstage Documentation — software catalog, software templates (scaffolder), TechDocs. CNCF graduated 2024.
  6. CNCF Projects — graduated / incubating / sandbox project catalog. Reference for ecosystem maturity.

About the author. Blake Crosley founded ResumeGeni and writes about site reliability engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.