Site Reliability Engineer - SaaSOps
About the Role:
Responsibilities:
- Define and embed SRE best practices across the SaaS platform, ensuring reliability is built into the system from the ground up.
- Establish and maintain meaningful SLA, SLIs, SLOs, and error budgets to protect customer experience and guide engineering priorities.
- Design and continuously improve high-availability and disaster recovery strategies.
- Automate manual processes, manage incident response, optimize performance (SLI/SL0).
- Bridge the gap between development IT operations.
- Ensure strong tenant isolation and consistent performance within a DB-per-tenant architecture.
- Strengthen system resiliency across both Azure and on-prem deployments in our hybrid environment.
- Lead incident response efforts with structured troubleshooting and clear communication.
- Drive thorough root cause analysis (RCA) and conduct blameless postmortems focused on long-term improvements.
- Translate incidents into systemic fixes rather than temporary patches.
- Develop and maintain operational runbooks to standardize responses.
- Design and maintain a comprehensive observability framework for both cloud and on-prem environments.
Requirements:
- Must have a minimum of 3+ years of hands-on experience in Site Reliability Engineering (SRE), supporting production-grade, cloud-native enterprise software platform/applications.
- Prior experience as a DevOps engineer, cloud system administrator or software developer.
- Strong proficiency in scripting languages such as Python, PowerShell etc
- Deep hands-on experience working with Microsoft Azure in production environments.
- Possess a solid understanding of Terraform, Ansible, Kubernetes internals, including networking, scheduling, scaling, and resource management.
- Have proven experience in PostgreSQL performance tuning and optimization in production systems.
- Demonstrate hands-on experience with Azure Monitor, Application Insights, and Log Analytics for cloud-based observability.
- Implement and manage Prometheus and Grafana for Kubernetes and on-prem monitoring.
- Understand how to turn metrics, logs, and traces into actionable insights that improve reliability and performance.
- Troubleshoot and improve CI/CD pipelines to ensure stable and predictable releases.
- Apply