Infrastructure Systems Engineer
NVIDIA’s Kernel Infrastructure team is looking for a Hands-On Systems Engineer to manage environment readiness, configuration, and long-term health of our next-generation GPU platforms. You will own the key lifecycle phase where early production hardware meets software. Your role ensures our innovative systems are stable, optimized, and continuously maintained for engineering teams.
If you love being hands-on with early-stage computing platforms, debugging complex hardware-to-software environments, and owning the operational stability of fast-evolving infrastructure, join us in Santa Clara, CA.
What you'll be doing:
Early Production Bringup & Tuning: Drive early-stage engineering systems to a performance-ready state. Handle firmware/VBIOS flashing, core clock configurations, power-state enablement, and system tuning.
Triage & Cross-Functional Collaboration: Act as the first line of defense for complex system and environment-level issues, coordinating directly with firmware, hardware design, and platform teams to unblock engineering.
Fleet Health & Maintenance: Monitor and optimize the ongoing health of the hardware fleet. Implement proactive health checks, diagnose degrading systems, and provide manual recovery when automated workflows fall short.
Standardization & Allocation: Establish and detail the "golden" system baselines (drivers, firmware, configurations) required for stable engineering execution as the product evolves. Track hardware inventory and manage demands from engineering teams to improve hardware utilization.
What we need to see:
Degree in Computer Engineering, Electrical Engineering, Computer Science, or equivalent experience.
3+ years in systems engineering, infrastructure operations, or hardware validation environments handling early-stage platforms.
Deep Linux and Windows system administration with strong debugging capabilities across the hardware-to-software stack.
Proficiency in scripting and automation (Shell scripting, Python, Ansible etc.).
Hands-on experience with Slurm, Kubernetes, or other cluster management platforms.
Strong, clear written and verbal communication skills, including the ability to explain complex technical concepts to non-technical audiences.
Strong problem-solving skills and a collaborative approach.
Self-motivated individual and a great teammate.
Ways to stand out from the crowd:
Experience managing HPC clusters at scale.
A proven track record of configuring and maintaining bring-up systems and early hardware prototypes.
Demonstrated technical curiosity and a drive to innovate.
Mechanically inclined and comfortable with tools and hands-on physical work.
Positive and cooperative, with the determination to help us reach the finish line.
#LI-Hybrid
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 124,000 USD - 195,500 USD.You will also be eligible for equity and benefits.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.