Job details

AI and ML Infra Software Engineer, GPU Clusters

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It's a unique legacy of innovation that's fueled by great technology-and amazing people. Today, we're tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what's never been done before takes vision, innovation, and the world's best talent. As an NVIDIAN, you'll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

We are currently hiring an AI/ML Infrastructure Software Engineer at NVIDIA to join our Hardware Infrastructure team. As an Engineer, you will play a crucial role in boosting productivity for our researchers through implementing advancements across the entire stack. Your primary responsibility will involve working closely with customers to identify and resolve infrastructure gaps, enabling innovative AI and ML research on GPU Clusters. Together, we can create powerful, efficient, and scalable solutions as we shape the future of AI/ML technology!

What you will be doing:

Collaborate closely with our AI and ML research teams to understand their infrastructure needs and obstacles, translating those observations into actionable improvements.
Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization.
Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results.
Collaborate with diverse teams, including researchers, data engineers, and DevOps professionals, to build a seamless and coordinated AI/ML infrastructure ecosystem.
Stay on top of the latest advancements in AI/ML technologies, frameworks, and effective strategies, and promote their implementation within the company.

What we need to see:

BS or equivalent experience in Computer Science or related field, with 8+ years of proven experience in AI/ML and HPC workloads and infrastructure.
Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure as well as in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high-speed networking (e.g., Infiniband, RoCE, Amazon EFA), and containers technologies (Docker, Enroot).
Expertise in running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX. Also, possess a deep understanding of AI/ML workflows, encompassing data processing, model training, and inference pipelines.
Proficiency in programming & scripting languages such as Python, Go, Bash, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience with parallel computing frameworks and paradigms.
Passion for continual learning and keeping abreast of new technologies and effective approaches in the AI/ML infrastructure field.
Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds.

NVIDIA provides competitive salaries and a comprehensive benefits package. Our engineering teams are expanding rapidly due to exceptional growth. If you're a passionate and independent engineer with a love for technology, we want to hear from you.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until July 31, 2025.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

AI infrastructure ML infrastructure GPU clusters HPC PyTorch Distributed training Slurm Kubernetes InfiniBand Lustre Cloud HPC Docker Python Go Enroot

NVIDIA Glassdoor Company Review

4.6

NVIDIA DE&I Review

No rating

CEO of NVIDIA

Jensen Huang

Approve of CEO

Average salary estimate

$270250 / YEARLY (est.)

min

max

$184000K

$356500K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Research Scientist - Trustworthy AI

NVIDIA Hybrid US, NY, Remote

VIEW

Posted 17 hours ago

Customer-Centric

Mission Driven

Inclusive & Diverse

Rise from Within

Diversity of Opinions

Work/Life Harmony

Growth & Learning

Transparent & Candid

Medical Insurance

Paid Time-Off

Maternity Leave

Mental Health Resources

Equity

Child Care stipend

Paternity Leave

WFH Reimbursements

Flex-Friendly

Dental Insurance

Vision Insurance

Life insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

401K Matching

Military leave

Lead cross-disciplinary efforts to build and operationalize low-resource language LLMs and language-based AI products that prioritize linguistic inclusion and responsible AI at NVIDIA.

Senior Prediction and Planning Engineer, VLM - Autonomous Vehicles

NVIDIA Hybrid US, CA, Santa Clara

VIEW

Posted 15 hours ago

Customer-Centric

Mission Driven

Inclusive & Diverse

Rise from Within

Diversity of Opinions

Work/Life Harmony

Growth & Learning

Transparent & Candid

Medical Insurance

Paid Time-Off

Maternity Leave

Mental Health Resources

Equity

Child Care stipend

Paternity Leave

WFH Reimbursements

Flex-Friendly

Dental Insurance

Vision Insurance

Life insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

401K Matching

Military leave

Lead the design, training, and production deployment of LLM/VLM-powered prediction and planning systems for production autonomous vehicles at NVIDIA's Santa Clara team.

Lead Software Engineer, Back End (Java/Go/Python, Kubernetes, AWS)

Capital One Hybrid McLean, VA

VIEW

Posted 15 hours ago

Lead a team designing and delivering cloud-native backend systems at Capital One using Java, Go, Python and Kubernetes to drive secure, regulatory-compliant solutions for millions of customers.

Staff Backend Engineer - Home Foundations - Personalization

Spotify Hybrid New York, NY

VIEW

Posted 15 hours ago

Inclusive & Diverse

Empathetic

Take Risks

Transparent & Candid

Feedback Forward

Mission Driven

Collaboration over Competition

Work/Life Harmony

Maternity Leave

Paternity Leave

Snacks

Medical Insurance

Dental Insurance

Vision Insurance

Mental Health Resources

Life insurance

401K Matching

Paid Sick Days

Paid Time-Off

Paid Volunteer Time

Lead the architecture and experimentation strategy for Spotify’s Home backend systems to enable trustworthy, scalable personalization and better product experiences.

Software Engineer (Machine Learning)

Meta Hybrid Bellevue, WA, USA

VIEW

Posted 1 hour ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Take Risks

Collaboration over Competition

Fast-Paced

Growth & Learning

Transparent & Candid

Feedback Forward

Dare to be Different

Medical Insurance

Paid Time-Off

Maternity Leave

Mental Health Resources

Equity

Paternity Leave

Flex-Friendly

Snacks

Social Gatherings

Company Retreats

Fitness Stipend

Paid Holidays

Summer Fridays

Work Visa Sponsorship

Bias Training

Flexible Spending Account (FSA)

Health Savings Account (HSA)

Vision Insurance

Dental Insurance

Life insurance

Meta is seeking a Machine Learning Software Engineer to develop scalable ML systems and production algorithms that power recommendations, ranking, and prediction at internet scale.

IT Services- Talent & Workforce Developer - Manager (US Remote)

PwC Hybrid FL-Tampa

VIEW

Posted 21 hours ago

PwC IT Services is hiring a remote Manager-level Full-Stack .NET Developer to lead Agile teams building scalable, cloud-native HR systems for global PwC member firms.

Staff Software Engineer, Credit Cards & Banking Infrastructure

Robinhood Hybrid Menlo Park, CA; New York, NY

VIEW

Posted 21 hours ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Dare to be Different

Reward & Recognition

Fast-Paced

Maternity Leave

Paternity Leave

Medical Insurance

Dental Insurance

Vision Insurance

Mental Health Resources

Life insurance

Disability Insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

401K Matching

Paid Holidays

Paid Sick Days

Paid Time-Off

Learning & Development

Social Gatherings

Lead a small pod at Robinhood to architect and operate mission-critical banking infrastructure that must scale with extreme reliability and performance.

Application Engineer

Smartleaf Hybrid Boston

VIEW

Posted 14 hours ago

Smartleaf is hiring an Application Engineer to work across UI, API and backend systems to scale its portfolio rebalancing platform and deliver reliable production software.

Forward Deployed Engineer - US

Parloa Hybrid New York Office

VIEW

Posted 19 hours ago

Work as a hands-on field engineer implementing and customizing Parloa's conversational AI platform for complex enterprise environments while partnering closely with customer teams and internal product and deployment stakeholders.

Senior Staff Software Engineer - Full Stack

NVIDIA Hybrid US, CA, Santa Clara

VIEW

Posted 14 hours ago

Customer-Centric

Mission Driven

Inclusive & Diverse

Rise from Within

Diversity of Opinions

Work/Life Harmony

Growth & Learning

Transparent & Candid

Medical Insurance

Paid Time-Off

Maternity Leave

Mental Health Resources

Equity

Child Care stipend

Paternity Leave

WFH Reimbursements

Flex-Friendly

Dental Insurance

Vision Insurance

Life insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

401K Matching

Military leave

A senior full-stack engineer role at NVIDIA building scalable, secure AI-enabled enterprise applications and platform services that connect data, agents, and user experiences across cloud and hybrid environments.

Senior Backend Engineer (Remote - US)

Jobgether Hybrid No location specified

VIEW

Posted 23 hours ago

Twingate is looking for a Senior Backend Engineer to build and scale secure, zero-trust backend services for cloud and on-prem remote access.

Technical Lead Manager - Credit Dashboard

Plaid Hybrid San Francisco

VIEW

Posted 5 hours ago

Plaid is hiring a Technical Lead Manager to lead the Credit Dashboard engineering team, driving technical direction and delivering scalable lender-facing products that combine Plaid’s Credit APIs with a polished UI.

MES Developer with Rockwell FactoryTalk

PDDN INC. Hybrid Cameron Ave, Los Angeles, CA 91342, USA

VIEW

Posted 1 hour ago

MedTech manufacturer in Sylmar seeks an experienced MES Developer with Rockwell FactoryTalk ProductionCentre, Java, and SQL to design, maintain, and validate MES solutions for regulated manufacturing operations.

Senior Backend Compiler Engineer

NVIDIA Hybrid US, TX, Austin

VIEW

Posted 14 hours ago

Customer-Centric

Mission Driven

Inclusive & Diverse

Rise from Within

Diversity of Opinions

Work/Life Harmony

Growth & Learning

Transparent & Candid

Medical Insurance

Paid Time-Off

Maternity Leave

Mental Health Resources

Equity

Child Care stipend

Paternity Leave

WFH Reimbursements

Flex-Friendly

Dental Insurance

Vision Insurance

Life insurance

Health Savings Account (HSA)

Flexible Spending Account (FSA)

401K Matching

Military leave

NVIDIA is hiring a Senior Backend Compiler Engineer in Austin to design and implement high-performance GPU code generation and optimization passes for graphics and compute.

Jr. Software Developer

Nova Dynamics Hybrid Philomath, OR, USA

VIEW

Posted 40 minutes ago

Nova Dynamics seeks a Full Stack Junior Software Developer to work on-site building emergency communication tools for fire departments alongside the CEO.

NVIDIA

NVIDIA is a publicly traded, multinational technology company headquartered in Santa Clara, California. NVIDIA's invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, and ignited the era of modern AI.

91 jobs

MATCH

Calculating your matching score...

BADGES