Engineer, Staff
Design, implement, and maintain large-scale Kubernetes clusters optimized for AI inference workloads, with focus on performance, reliability, and scalability across cloud environments Deploy and manage containerized AI services using Docker, Kubernetes, and KServe (or similar ML serving platforms), ensuring high availability and optimal resource utilization Write production-quality Python code to build automation tools, frameworks, and infrastructure management solutions that eliminate manual processes and improve operational e iciency Lead triaging e orts for complex production incidents, performing deep-dive analysis to identify root causes and implement permanent fixes Debug sophisticated deployment scenarios at multiple levels - from application layer through container orchestration to Linux OS and hardware interfaces Support the full lifecycle of AI inference services - from design and capacity planning through deployment, operation, optimization, and continuous refinement Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible, or similar technologies to ensure reproducible and version-controlled infrastructure Collaborate with ML engineers, software developers, and infrastructure teams to optimize AI workload deployment and performance Experience with cloud platforms (AWS, Azure, GCP, or private cloud) and cloud-native architectures Experience with monitoring and observability tools such as Prometheus, Grafana, ELK Stack, or similar platforms Experience deploying and managing AI/ML inference systems or model serving platforms (KServe, TorchServe, TensorFlow Serving, Triton Inference Server) Experience with capacity planning and performance optimization for high-throughput systems Automation-First Mindset: Passion for eliminating repetitive manual work through intelligent automation Systems Thinking: Ability to understand how complex distributed systems interact and impact each other Ownership and Accountability: Taking end-to-end responsibility for services and their Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 4+ years of Software Engineering or related work experience. OR Master's degree in Engineering, Information Systems, Computer Science, or related field and 3+ years of Software Engineering or related work experience. OR PhD in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Engineering or related work experience. 2+ years of work experience with Programming Language such as C, C++, Java, Python, etc.