Site Reliability Engineer - SaaSOps

Hyderabad April 15, 2026 Full Time Lever

About the Role:

Responsibilities:

Define and embed SRE best practices across the SaaS platform, ensuring reliability is built into the system from the ground up.
Establish and maintain meaningful SLA, SLIs, SLOs, and error budgets to protect customer experience and guide engineering priorities.
Design and continuously improve high-availability and disaster recovery strategies.
Automate manual processes, manage incident response, optimize performance (SLI/SL0).
Bridge the gap between development IT operations.
Ensure strong tenant isolation and consistent performance within a DB-per-tenant architecture.
Strengthen system resiliency across both Azure and on-prem deployments in our hybrid environment.
Lead incident response efforts with structured troubleshooting and clear communication.
Drive thorough root cause analysis (RCA) and conduct blameless postmortems focused on long-term improvements.
Translate incidents into systemic fixes rather than temporary patches.
Develop and maintain operational runbooks to standardize responses.
Design and maintain a comprehensive observability framework for both cloud and on-prem environments.

Requirements:

Must have a minimum of 3+ years of hands-on experience in Site Reliability Engineering (SRE), supporting production-grade, cloud-native enterprise software platform/applications.
Prior experience as a DevOps engineer, cloud system administrator or software developer.
Strong proficiency in scripting languages such as Python, PowerShell etc
Deep hands-on experience working with Microsoft Azure in production environments.
Possess a solid understanding of Terraform, Ansible, Kubernetes internals, including networking, scheduling, scaling, and resource management.
Have proven experience in PostgreSQL performance tuning and optimization in production systems.
Demonstrate hands-on experience with Azure Monitor, Application Insights, and Log Analytics for cloud-based observability.
Implement and manage Prometheus and Grafana for Kubernetes and on-prem monitoring.
Understand how to turn metrics, logs, and traces into actionable insights that improve reliability and performance.
Troubleshoot and improve CI/CD pipelines to ensure stable and predictable releases.
Apply

Apply on company site

Tailor your resume to each specific Valgenesis role — Lever applications are evaluated per-position
Valgenesis uses Lever to manage applications; PDF format preserves your formatting through their parser

How well do you match this role?