DevOps / SRE Engineer Hub

AI Tools in the SRE Workflow (2026)

In short

AI tools sit beside the senior SRE in 2026, not in front of them. Cursor and Claude Code accelerate Terraform, Helm, and Kubernetes manifest authoring; Claude Code drafts runbooks and postmortems from incident timelines; Honeycomb AI, Datadog Bits, and Grafana k6 surface anomalies and generate PromQL from natural language; PagerDuty AI groups noisy alerts and auto-routes pages. The discipline that keeps production safe: AI drafts, humans approve. AI never executes production-mutation operations, never makes on-call decisions, and never ships security-sensitive infra changes (IAM, network policy, secrets) without a senior reviewer reading every diff line.

Key takeaways

  • Cursor's repo-indexed agent mode is the dominant authoring surface for Terraform modules, Helm charts, and Kubernetes manifests in 2026. The killer feature is editing across module + variables + outputs + values.yaml in a single coherent pass.
  • Claude Code is Anthropic's CLI agent for SRE workflows that span many files with verification loops — drafting runbooks from code paths, authoring postmortems from incident timelines, generating bash + Python scripts for one-off remediations and SLO calculations.
  • Stack Overflow's 2024 Developer Survey found 76% of professional developers using or planning to use AI tools; SREs who refuse AI tooling screen poorly in 2026 interviews and ship less than peers who use it well.
  • Honeycomb AI (BubbleUp + AI query assistant) and Datadog Bits AI translate natural-language questions into traces, logs, and PromQL/MetricsQL queries — the right tool for the 3am 'what is different about the failing requests' question.
  • PagerDuty AI Operations groups related alerts into a single incident, suppresses transient noise, and suggests responders based on past resolutions. Treat it as a triage assistant, never as the on-call decision-maker.
  • Grafana k6 + AI-generated load scripts let SREs scaffold realistic load tests in minutes; review for correctness of think-time, ramp curves, and assertion thresholds before running against production-shaped environments.
  • Hard guardrail: AI is allowed to write and plan, never to apply. `terraform apply`, `kubectl apply`, `helm upgrade`, IAM policy edits, security-group changes, and secret rotations are human-in-the-loop with senior approval. Auto-merge on infra PRs is a bug, not a feature.

How senior SREs use AI tools in 2026

The senior site reliability engineer in 2026 has a stable AI workflow across four surfaces: the IDE (Cursor), the agent CLI (Claude Code), the observability stack (Honeycomb AI, Datadog Bits, Grafana k6), and the paging layer (PagerDuty AI Operations). Each tool earns its place on a specific job; none is trusted to make a production-mutation decision unsupervised.

  • Cursor. The dominant AI-first IDE for infrastructure-as-code in 2026. Repo-indexed multi-file context lets the engineer ask for a change that spans Terraform module, variables, outputs, Helm chart values, and the consuming Kubernetes manifest in a single coherent edit. The agent mode plans the change, edits across files, and runs `terraform validate` or `helm lint` before reporting back.
  • Claude Code. Anthropic's terminal agent for SRE tasks that benefit from verification loops — drafting a runbook from `services/payments/*.go`, writing a postmortem from a Slack timeline export, generating a Python script that walks every prod RDS instance and reports parameter-group drift. Claude Code reads files, runs commands, reads output, and iterates.
  • Observability AI (Honeycomb AI, Datadog Bits, Grafana). Natural-language to PromQL/MetricsQL/trace queries; anomaly explanation (Honeycomb's BubbleUp surfaces 'what is different about the failing requests'); root-cause hypothesis generation. The right tool for 3am pages where the SRE knows the symptom but not the dimension.
  • PagerDuty AI Operations. Alert grouping, noise suppression, and responder suggestion. Reduces page volume by 40-60% on noisy services without missing real incidents — when configured by a human who understands the service map.

The Stack Overflow 2024 Developer Survey reports 76% of professional developers using or planning to use AI tools. SREs are not exempt. The pattern that holds across mature infra teams in 2026: AI handles the mechanical 70-80% of authoring (boilerplate Terraform, Helm value scaffolding, runbook structure, first-pass PromQL); the SRE handles the judgment 20-30% (blast radius, change-management, security review). Output reviewed line-by-line; never trusted by default; never auto-applied to production.

The senior bar in 2026: fluent daily use plus a calibrated opinion on the guardrails (Section 4) — where AI helps freely, where it needs aggressive review, and where it produces dangerous changes that look correct.

AI-assisted runbook drafting + postmortem authoring with Claude Code

Two SRE workflows where Claude Code earns its keep on a weekly basis: runbook drafting and postmortem authoring. Both are mechanical-structure work where the AI handles the scaffolding and the engineer adds the institutional context.

Pattern 1: Runbook drafting. When a service ships, on-call needs runbooks covering symptom, diagnosis, and remediation for every external dependency and failure mode. The senior SRE prompts Claude Code:

Read services/payments/*.go and infra/payments/*.tf.

For each external dependency (Stripe, Postgres, Redis, internal
fraud-service), draft a runbook section covering:
- Symptom: what an on-call sees in dashboards / logs
- Diagnostic: 1-2 commands or queries to confirm the failure mode
- Remediation: rollback, failover, or escalation step
- Blast radius: what else is affected

Match the structure of docs/runbooks/orders.md.
Do NOT include placeholder credentials or example secrets.

Claude Code reads the codebase, finds the dependency calls, generates a structured draft. The engineer adds tribal knowledge — the Datadog dashboard URL, the on-call Slack channel, the historical incident links — and ships. Time: 30 minutes vs 4 hours hand-authoring.

Pattern 2: Postmortem from incident timeline. The most valuable Claude Code workflow in 2026 SRE practice. After resolution, the SRE pastes the Slack incident-channel export and the alert timeline into Claude Code with this prompt:

Draft a blameless postmortem from this incident timeline.

Inputs: slack_export.txt + pagerduty_timeline.json + relevant
Datadog dashboard screenshots (paths attached).

Follow the team template at docs/postmortems/_template.md.

Requirements:
- Summary (3-5 sentences, plain English)
- Impact: duration, affected users, $ revenue, SLO budget burned
- Timeline (UTC, 5-minute resolution)
- Root cause: 5 Whys, stop at a system boundary, NOT a person
- What went well / poorly / where we got lucky
- Action items table (owner, priority, due date placeholder)

Do NOT name individuals as causes. Do NOT speculate beyond evidence.
Mark anything you cannot verify as [VERIFY].

The output is a draft postmortem that captures 80% of the work: the timeline reconstructed from timestamps, the impact section computed from the duration and SLO burn, the structural sections in the team's idiom. The engineer rewrites the root-cause section (the AI cannot reason about distributed-systems edge cases as well as the engineer who fought the incident), fills in the [VERIFY] markers, and assigns action items. The Etsy and Google SRE doctrine on blameless postmortems still governs the content; Claude Code accelerates the drafting.

The discipline: every Claude Code-drafted postmortem is rewritten, not just reviewed. Reading and editing reveals where the AI hand-waved over a real causal step. Senior SRE practice in 2026 treats the draft as a first iteration, not a final.

AI in observability: Honeycomb AI, Datadog Bits, Grafana k6

Observability is where AI tooling has matured fastest in 2026. The three platforms that matter for senior SRE practice:

Honeycomb AI + BubbleUp. Honeycomb's high-cardinality wide-event model is the right substrate for AI assistance: 'what is different about the failing requests' is a tractable question when every event carries every dimension. BubbleUp answers it visually; the AI query assistant answers it in natural language. The pattern that works at 3am:

honeycomb-cli ai query \
  --dataset checkout-prod \
  --time-range 'last 30m' \
  --question "Why are p99 latencies elevated for POST /charge?\n\nGroup by anything that distinguishes slow from fast requests.\nOnly include groups with at least 100 events."

# Honeycomb's AI generates an equivalent query:
#   VISUALIZE: HEATMAP(duration_ms), P99(duration_ms)
#   WHERE: name = "POST /charge" AND duration_ms > 0
#   GROUP BY: app.region, db.connection_pool.host, customer.tier
#   HAVING: COUNT > 100
#   TIME: last 30m
#
# BubbleUp surfaces: 95% of slow requests share customer.tier=enterprise
# AND db.connection_pool.host=primary-db-3.

Datadog Bits AI. Datadog's natural-language assistant for metrics, logs, and traces, plus AI-generated incident summaries. Bits is strongest for cross-product correlation: 'show me errors on the checkout service correlated with deploys in the last 24h' becomes a Datadog query joining APM, logs, and the deployment timeline. The Datadog engineering blog documents the 2025-2026 evolution toward AI-suggested investigation paths. Treat suggestions as hypotheses, not conclusions; the AI does not know the team's deploy semantics or the recent incident history.

Grafana k6 + AI load script generation. The 2026 pattern for load testing: describe the user journey in natural language, let Cursor or Claude Code generate the k6 script, then review for realism. A senior SRE prompts: 'Generate a k6 script that simulates 200 concurrent checkout flows: GET /cart, POST /checkout, poll /orders/{id} until status=completed, with 1-3 second think time and a 5-minute ramp.' The AI emits a runnable script. The SRE adjusts assertion thresholds (default-generated thresholds are usually too lenient) and ramp curves before running. AI also helps interpret k6 output — 'why did p95 climb from 200ms to 2s at minute 4?' is a question the model can hypothesize about given the script and the result summary.

AI-generated PromQL. The single biggest day-to-day quality-of-life win for SRE work in 2026. PromQL is dense, easy to get wrong, and AI is excellent at translating intent into a working query. 'Give me the 5-minute rate of 5xx responses per service, top 10 by rate' becomes `topk(10, sum by (service) (rate(http_requests_total{status=~"5.."}[5m])))` reliably. Always validate the query mentally — `sum by` vs `sum without` errors are the most common AI mistake — and never paste an AI-generated alerting rule into production without graphing it for at least an hour first.

AI-assisted dependency analysis. Service-graph AI features in Datadog APM, Honeycomb's service map, and Grafana Tempo all surface 'what depends on this service' and 'what is this service's blast radius' in 2026. Useful for change-management ('if I take db-3 down for maintenance, what breaks'), but the answer is only as good as the trace coverage. Untraced background jobs and cron tasks are invisible to the AI; the senior SRE checks the deploy manifests directly.

Where AI helps and where it dangerously fails — guardrails for production

The senior SRE in 2026 carries a calibrated model of AI's competence surface. The guardrails table below codifies what works on the Hive and what kills production:

OperationGuardrailWhy
Use freely
Terraform module scaffolding (variables, outputs, naming)Use freelyConvention-driven; HCL syntax is well-known to AI; `terraform validate` catches errors.
Helm chart values.yaml + templates boilerplateUse freelyStrong AI fluency; lint and template-render before commit.
Kubernetes manifest authoring (Deployment, Service, ConfigMap)Use freelySchema-validated; `kubectl --dry-run=server` catches drift.
Runbook drafts from code pathsUse freely + polishStructure is mechanical; institutional context is human.
Postmortem drafts from incident timelinesUse freely + rewriteAI scaffolds; engineer rewrites root-cause section.
PromQL / MetricsQL / LogQL from natural languageUse freely + graph itStrong intent-to-query; verify cardinality and `by` clauses.
k6 / Locust / JMeter load script scaffoldingUse freely + tune thresholdsMechanical script generation; review ramp + assertions.
Bash + Python one-off remediation scripts (read-only)Use freelyExec only after eyeballing the diff and the destructive flags.
Dockerfile + docker-compose authoringUse freelyConventional; security-scan with Trivy or Grype after.
Review carefully
Rolling restart commands (kubectl rollout restart)Review carefullyBlast radius depends on PDB, replica count, dependent services.
Terraform plan output interpretationReview carefullyAI explanations smooth over destructive replace operations.
Alert thresholds + SLO burn-rate definitionsReview carefullyAI defaults to noisy thresholds; require historical-data calibration.
Database migration scripts (online schema changes)Review carefullyLock contention, replication lag, rollback paths are subtle.
Backup + restore scriptsReview carefullyTest restore in staging; AI under-specifies edge cases.
PagerDuty escalation policy editsReview carefullyMis-routed pages = missed incidents; verify with on-call simulation.
Do not use unsupervised
terraform apply / helm upgrade / kubectl apply (production)Do NOT auto-applyMutations are irreversible; senior approval required on every diff.
IAM policy changes / role bindings / service-account creationDo NOT usePrivilege escalation paths are subtle; AI optimizes for working-not-secure.
Security-group / network-policy / firewall rule changesDo NOT useWrong CIDR or wrong direction = data exfil or outage.
Secret rotation / KMS key changes / TLS cert renewalDo NOT useCryptographic operations require a vetted runbook, not AI improvisation.
On-call decision-making during a Sev0/Sev1Do NOT delegateThe IC owns the call; AI suggests, the human commands.
Production data deletion / table drop / index rebuild on hot tablesDo NOT useRecovery cost is unbounded; require explicit human plus DBA sign-off.
Auto-merge of infrastructure pull requestsDo NOT enableTwo human reviewers minimum on Terraform; one must be senior SRE.

Three failure modes the senior SRE watches for. The plausible-but-wrong Terraform diff. AI generates a resource block that imports cleanly but uses the wrong region, the wrong KMS key, or the wrong tag schema; the plan looks reasonable; production drifts. Mitigation: pre-merge tag policy + region policy enforcement via OPA or tfsec; never trust a plan output without reading every changed resource. The hallucinated kubectl command. AI suggests `kubectl drain --force --delete-emptydir-data` to clear a pending pod; the SRE runs it; an emptydir-backed cache is destroyed; recovery takes 40 minutes. Mitigation: never run AI-generated `kubectl` commands with `--force` or `--delete-*` flags without re-reading the man page. The auto-routed page. PagerDuty AI groups two unrelated alerts into one incident because they fired in the same minute; the on-call investigates the wrong service; a real outage extends 25 minutes. Mitigation: review AI grouping rules monthly; track 'incidents where AI grouped wrong' as an explicit metric.

For SRE work specifically — production-mutation operations, on-call decision-making, security-sensitive infrastructure changes — the rule is unambiguous: AI drafts, plans, and explains; humans review and execute. Engineers who use AI well author Terraform faster, write postmortems sharper, and pattern-match through observability data more efficiently. Engineers who use AI poorly auto-apply a `terraform apply` over coffee and are debugging a production outage by lunch. The difference is the guardrails.

Frequently asked questions

Which AI tools should an SRE learn first in 2026?
Cursor and Claude Code are the foundational pair. Cursor handles Terraform, Helm, and Kubernetes manifest authoring with repo-indexed multi-file context — the killer feature for infra-as-code work. Claude Code handles tasks that span verification loops: runbook drafting from code, postmortem authoring from incident timelines, generating one-off remediation scripts. Layer in your observability platform's AI features (Honeycomb AI if you are on Honeycomb, Datadog Bits if Datadog) for incident investigation, and PagerDuty AI Operations for alert grouping if your page volume justifies it.
Can AI write production-quality Terraform?
AI is excellent at Terraform module scaffolding, variable and output blocks, and conventional resource definitions; it is unreliable at IAM policy structure, network-security configuration, and stateful-resource lifecycle (replace vs in-place update). The pattern: let AI scaffold the module, run `terraform validate` and `terraform plan`, then read every resource in the plan output line by line before applying. Never enable auto-apply on infrastructure pull requests; require two human reviewers, one of whom is a senior SRE. The cost of a wrong `terraform apply` against production is unbounded.
How do I use Claude Code for postmortems without producing useless boilerplate?
Feed it the actual incident artifacts — Slack channel export, PagerDuty timeline, dashboard screenshots — not the abstract idea of the incident. Use the team's postmortem template explicitly. Require the AI to mark anything it cannot verify as [VERIFY] so the engineer can fill it in. Critically, rewrite the root-cause section yourself: AI cannot reason about distributed-systems edge cases as well as the engineer who fought the incident. Treat the AI output as a first draft, not a final document. The Etsy and Google SRE doctrine on blameless postmortems still governs the content.
Should I let PagerDuty AI auto-route on-call pages?
Yes for alert grouping and noise suppression on services with high alert volume; no for the actual on-call decision. AI Operations is excellent at recognizing that ten related alerts belong to one incident and at suppressing transient flaps that resolve in under five minutes. It is unreliable at routing pages to the right responder when service ownership is ambiguous or when an incident spans multiple service domains. Treat PagerDuty AI as a triage assistant; the incident commander is always human. Track 'incidents where AI grouped wrong' as an explicit metric and review monthly.
Can AI generate PromQL safely?
Yes, AI-generated PromQL is one of the highest-leverage SRE workflows in 2026. AI translates 'top 10 services by 5-minute 5xx rate' or 'p99 latency for checkout grouped by region' into correct queries reliably. Two failure modes to guard against: `sum by` versus `sum without` confusion (silently wrong cardinality) and rate-window choice (5m vs 1m changes signal-to-noise dramatically). Always graph an AI-generated query for at least an hour before pasting it into an alerting rule. Never paste an AI-generated `for:` duration or alerting threshold without calibrating against historical data.
What is the single hardest SRE guardrail to enforce on AI?
No auto-apply. The pull toward letting AI run `terraform apply`, `kubectl apply`, or `helm upgrade` after a clean plan is significant — the diff looks fine, the plan is green, the AI is confident. But infra mutations are irreversible in the general case, and AI confidence is not calibrated against blast radius. The discipline: AI authors and plans; humans review and apply. Auto-merge on infrastructure pull requests is a bug, not a feature, regardless of how clean the AI-generated PR looks. Two human reviewers, one senior SRE, every time.

Sources

  1. Anthropic — Claude Code launch and capabilities. Canonical for the agent-style CLI workflow used in SRE for runbook + postmortem authoring.
  2. Cursor — Features. Canonical for the AI-first IDE used by senior SREs for Terraform, Helm, and Kubernetes manifest authoring in 2026.
  3. Honeycomb — Engineering Blog. Canonical for Honeycomb AI, BubbleUp, and the high-cardinality observability patterns AI assistance amplifies.
  4. Datadog — Engineering Blog. Canonical for Datadog Bits AI, AI-suggested investigation paths, and APM service-graph dependency analysis.
  5. PagerDuty — Engineering Blog. Canonical for PagerDuty AI Operations, alert grouping, noise suppression, and on-call automation patterns.
  6. Stack Overflow 2024 Developer Survey. Reports 76% of professional developers using or planning to use AI tools — establishes the baseline for AI adoption among SREs.

About the author. Blake Crosley founded ResumeGeni and writes about site reliability engineering, hiring technology, and ATS optimization. More writing at blakecrosley.com.