AI Platforms Leader Enterprise AI Platforms
Own the AI Platform strategy & roadmap Define the multi‑year vision for a multi‑tenant, hybrid (on‑prem + cloud) AI platform, aligned to business needs, developer productivity, and cost efficiency. Establish clear platform SLAs/SLOs, reliability goals, and security/compliance guardrails. Operate and optimize on‑prem GPU clusters (e.g., Kubernetes + GPU operator and/or Slurm), including capacity planning, scheduling, partitioning, NCCL, and high‑throughput storage/networking. Deliver MLOps & LLMOps as a product Provide golden paths for data prep, training/fine‑tuning, model registry, lineage, governance, evaluation, red‑teaming, and safe deployment (batch, online, streaming). Implement CI/CD for models, prompts, and agents; automate evaluations and rollout/rollback with canaries, A/B, and shadow deployments. Establish multi‑cloud patterns for portability, resilience, and vendor risk management. Manage strategic vendors (e.g., cloud, GPU, enterprise AI tooling) and negotiate licenses/SOWs aligned to roadmap and budget. Lead a ~10‑engineer global team (platform, SRE, MLOps/LLMOps) with global collaboration, 24×7 readiness, and a healthy on‑call rotation. Drive incident response, post‑mortems, and continuous improvement. Partner with Security, Legal, and Compliance for model/data governance. 15+ years overall engineering/technology experience, including ~10 years building and operating large‑scale platforms (AI/ML, data, or high‑performance computing). Leadership: Proven experience leading a team of ~10 engineers for 5+ years, across platform/SRE/MLOps/LLMOps, with coaching, hiring, performance management, and clear execution rhythms. GPU cluster expertise: Hands‑on operations for on‑prem GPU clusters (Kubernetes + GPU operator and/or Slurm), scheduling, capacity planning, performance tuning, and reliability. MLOps & LLMOps: Strong experience with model lifecycle (data → training → registry → deployment), model/agent evaluation, safety/guardrails, and observability. Cloud (AWS/GCP/Azure): Deep experience with AI/ML services and managed Kubernetes (EKS/AKS/GKE), networking, security, identity, and cost management. DevOps/Platform Engineering: CI/CD, GitOps, IaC (Terraform/Bicep/Helm), containerization (Docker), Kubernetes, and secure SDLC practices. Agentic AI & MCP: Solid understanding of agent orchestration, A2A patterns, tool abstractions, and operating MCP servers in production. Operational excellence: Demonstrated success running AI or computing clusters with SLOs, on‑call, incident management, and post‑mortems. Global collaboration: Experience leading a distributed engineering team across time zones. Education: Bachelor's degree in Engineering, Computer Science, or related field. Master's or PhD in CS/EE/Math or related field. Training & Inference stacks: PyTorch, CUDA/cuDNN, Triton Inference Server, vLLM, KServe, Ray, Slurm. Data & storage: High‑throughput storage (e.g., Lustre, BeeGFS, Ceph), vector databases (e.g., FAISS, Milvus, Pinecone, Azure AI Search), feature stores (e.g., Feast). MLOps toolchain: MLflow/Vertex/Azure ML/SageMaker registries, Airflow/Argo, Weights & Biases, LangSmith, Prompt/version management. Security & governance: OIDC/RBAC, policy as code (OPA), secrets management (AWS Secrets Manager/Azure Key Vault), model governance/risk controls, privacy/PII safeguards. Agentic frameworks: Semantic Kernel, LangChain, CrewAI, AutoGen (or equivalents) and experience integrating enterprise tools via MCP. Proven track record shipping platform capabilities that enable multiple product teams (self‑service, docs, SDKs, templates, golden paths). Strong communication with executives and technical leaders; clear metrics, dashboards, and business value storytelling. Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 8+ years of Software Engineering or related work experience. OR Master's degree in Engineering, Information Systems, Computer Science, or related field and 7+ years of Software Engineering or related work experience. OR PhD in Engineering, Information Systems, Computer Science, or related field and 6+ years of Software Engineering or related work experience. 4+ years of work experience with Programming Language such as C, C++, Java, Python, etc.