Site Reliability Engineer III

Bengaluru April 16, 2026 Full Time Workday

As a Observability Engineer under Site Reliability Engineering Team, you will be a crucial part of the team responsible for the availability, performance, and scalability of our cloud platform. You will blend software engineering and systems administration expertise to build and run large-scale, distributed, fault-tolerant systems. Your mission is to ensure our services are reliable and efficient through automation, robust monitoring, and proactive incident response. You will work closely with development teams to build resilient and scalable applications on our Google Cloud Platform (GCP) and Kubernetes-based infrastructure. Having a Strong troubleshooting skills and a methodical approach to problem-solving is a MUST.

Key Responsibilities:

Infrastructure as Code (IaC): Design, build, and maintain our core cloud infrastructure on GCP using tools like Terraform and Google Config Connector (KCC) within a GitOps framework
Automation: Utilize Infrastructure as Code (IaC) with Kubernetes (GKE) and Google Config Connector (KCC), Develop automation scripts and tools (primarily in Python or Go) to reduce operational toil, streamline deployments, and improve system efficiency
Observability: Implement and manage comprehensive monitoring, logging, and alerting solutions using tools like Prometheus, Open Telemetry, Grafana, and Google Cloud's operations suite to gain deep insights into system health
Reliability & SLOs: Define, measure, and monitor Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical services
Drive initiatives to meet and exceed these objectives
Develop & promote dashboarding, and actionable alerting across the organization
Incident Management: Participate in an on-call rotation to respond to and resolve production incidents
Lead blameless post-mortems to identify root causes and implement lasting solutions
Collaboration: Partner with software engineering teams throughout the development lifecycle to provide guidance on building reliable, scalable, and secure applications
Help them troubleshoot complex issues, improve service performance, and adopt observability best practices
Enhance Reliability: Analyze observability data to identify trends, uncover potential issues, and drive initiatives to improve system reliability, performance, and cost-efficiency
Secure and Scale: Manage secrets and system configurations securely using Hashi Corp Vault and ensure the observability platform scales to meet the demands of a growing engineering organization

Qualifications Required:

Bachelor's degree in computer science, a related technical field, or equivalent practical experience.
3-8 years of experience in a Site Reliability, DevOps, or Software Engineering role.
Strong proficiency in at least one high-level programming language (e.g., Python, Go, Java).
Hands-on experience with cloud platforms, particularly Google Cloud Platform (GCP).
Solid understanding and practical experience with containerization (Docker) and orchestration (Kubernetes).
Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or Google Config Connector.
Familiarity with CI/CD principles and tools (e.g., GitLab CI, Jenkins...)
Knowledge of GitOps principles and tools
Excellent communication skills and the ability to work effectively in a collaborative team environment.

Apply on company site

How to Get Hired at CME Group

Tailor your resume to each specific CME Group role — Workday applications are evaluated per-position
CME Group uses Workday to manage applications; PDF format preserves your formatting through their parser

Read the full guide

How well do you match this role?

Check My Resume