ML Platform Software Engineer - Timisoara, Romania
Bachelor's degree in Engineering, Information Systems, Computer Science, or related field and 4+ years of Software Engineering or related work experience. OR Master's degree in Engineering, Information Systems, Computer Science, or related field and 3+ years of Software Engineering or related work experience. OR PhD in Engineering, Information Systems, Computer Science, or related field and 2+ years of Software Engineering or related work experience. 2+ years of work experience with Programming Language such as C, C++, Java, Python, etc. Bachelor's or Master's degree in Computer Science, Engineering, or related field. Strong proficiency in Python and at least one of Go, C++, Rust, Bash. Solid grasp of data structures, algorithms, concurrency, networking, and distributed systems. Experience with Kubernetes, Helm, Argo Workflows, ArgoCD, Docker. Ability to pass coding assessments demonstrating problem-solving and clean code practices. Exposure to AWS (EKS, EC2, VPC, IAM, S3, EFS, Batch) and CI/CD automation. (Nice-to-have but not required) Hands-on experience with GPU clusters and ML frameworks (TensorFlow, PyTorch). Proven experience building large-scale data pipelines (batch/streaming), data orchestration, and storage patterns (S3, EFS, parquet/ORC). Familiarity with observability stacks (Prometheus/Grafana), ELK/Opensearch, and CloudWatch. Knowledge of ML optimization techniques, GPU memory management, and model deployment at scale. Experience with security and compliance for ML/data platforms (IAM, policies, isolation). Prior contributions to platform services (custom controllers/operators, plugins) and developer tooling. Ability to mentor and influence technical direction while remaining hands-on. A hands-on technical leader who thrives on solving complex, system-level problems and can execute independently. Excellent communication, cross-functional collaboration, and pragmatic delivery focus. Passion for building robust, scalable platforms that accelerate ML and data innovation. Architect and develop core components of the ML platform and data infrastructure for training, inference, and large-scale data processing. Design and implement scalable solutions for GPU clusters and distributed data pipelines on-prem and in AWS. Lead project workstreams, ensuring timely delivery and alignment with platform roadmap; operate independently and drive outcomes end-to-end. Build and optimize data pipelines for ingestion, transformation, storage, and retrieval supporting ML workflows (batch and streaming). Write clean, efficient, and maintainable code (services, operators, automation, tooling) in multiple programming languages. Collaborate with data science and engineering teams to integrate ML and data workflows seamlessly (feature stores, model registries, artifact stores). Implement CI/CD for ML and data workflows using Argo Workflows, ArgoCD, GitHub Actions; champion testability and reproducibility. Maintain observability (Prometheus, Grafana) and logging (AWS CloudWatch, ELK/Opensearch); drive SLOs, tracing, and cost-awareness. Operate AWS services (EKS, EC2, VPC, IAM, S3, EFS, Batch) across hybrid environments; contribute to security and compliance controls. Continuously improve platform reliability, performance (GPU utilization, throughput), and developer experience; stay current on modern MLOps/Data engineering practices.