NVIDIA is at the forefront of innovations in Artificial Intelligence, High-Performance Computing, and Visualization. Our invention—the GPU—functions as the visual cortex of modern computing and is central to groundbreaking applications from generative AI to autonomous vehicles. We are now looking for a Principal Full Stack Software Engineer to help accelerate the next era of machine learning innovation.
In this role, you will propose and implement engineering solutions to ensure delivery of functional, reliable, secure, and performance-optimal GPU clusters to internal researchers, enable them to focus on training and development by reducing operational disruption and overhead, empower them for self-service continuous improvement on reliability, operational excellence & performance. Your work will empower scientists and engineers to train, fine-tune, and deploy the most advanced ML models on some of the world’s most powerful GPU systems.
What You'll Be Doing:
In this position, you will work with coworkers across the Managed AI Research Supercluster organization to understand the pain points of validating, monitoring and operating GPU clusters at scale. Then you will design, develop and maintain engineering solutions to solve those pain points systematically.
You will also research in traditional AIOps and the emerging Agentic AI, and leverage them to further reduce the operation toil.
You will participate in on-call support for systems, platforms built and owned by the team.
What We Need To See:
BS/MS in Computer Science, Engineering, or equivalent experience.
15+ years in software/platform engineering, including 3+ years in ML infrastructure or distributed systems.
Proficiency with full-stack development: Relational Data Modeling, DB optimization, REST API Semantics, Javascript, CSS, providing API as a service.
Experience in software development lifecycle on Linux-based platforms.
Strong coding skills in languages such as Python, C++ or Rust.
Experience with AIOps or Agentic AI and apply it successfully in production environment.
Experience with Docker, Kubernetes, GitLab CI, automated deployments.
Ways To Stand Out From The Crowd:
Familiarity with GPU computing, Linux systems internals, and performance tuning at scale.
Experience running Slurm or custom scheduling frameworks in production ML environments.
Experience with ML orchestration tools such as Kubeflow, Flyte, Airflow, or Ray.
You will also be eligible for equity and benefits.
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Join NVIDIA's HPC compiler team to develop and optimize production-quality compilers and parallelism features for GPUs and multicore CPUs using LLVM/MLIR.
NVIDIA SSG is hiring a Senior Technical Program Manager to lead thermal solution design and program execution for SOC bringup and productization in a cross-functional, fast-paced environment.
Experian is hiring a remote Full Stack Software Engineering Summer Intern to build responsive web UIs and backend services while gaining hands-on experience with React/Angular, Spring Boot, cloud services, Docker, and CI/CD.
Chronosphere is hiring a Frontend-focused Member of Technical Staff to design and deliver Control Plane UI features for large-scale observability, cost management, and usage reporting.
An experienced senior Python backend developer is sought to design and deliver scalable, production-grade services for a US partner with flexible remote working arrangements.
Onebrief is hiring a Senior Site Reliability Engineer to own reliability, observability, and secure operations for on-prem and cloud military deployments in Colorado Springs.
Observe seeks a Staff Frontend Engineer to lead the design and delivery of interactive, data-rich UI components for its SaaS observability platform.
RII seeks a motivated Software Development Engineer in Test intern to develop automated tests and testing tools for mission-critical software in a remote, mentor-led summer internship.
Oscilar is hiring a Full-Stack Engineer to build an AI Agent Platform that applies LLMs, embeddings, and serverless architectures to detect fraud, manage risk, and automate AML workflows.
Lead development of production-grade iOS/Android mobile features for a high-growth fintech product, owning architecture, integrations, and delivery across the user lifecycle.
Sierra is hiring a seasoned Site Reliability Engineer to own observability, scalability, and secure cloud infrastructure for its AI platform in San Francisco.
Boeing is hiring an Agile Software Engineer to develop and maintain UI and cloud-based software for mission-critical command-and-control systems within its Defense, Space & Security organization.
Visa is hiring a Senior Software Engineer (Sr. Consultant) in Foster City to design and deliver scalable, secure payment systems and platform services in a hybrid, cross-functional environment.
Kikoff is hiring a Senior Backend Software Engineer on the Grant Growth team to build and scale payment, subscription, and lifecycle systems that serve hundreds of thousands to millions of users.
Quora is hiring a Full Stack Software Engineer to design and ship large-scale Python + React features that improve user engagement and content quality on a global knowledge platform.
NVIDIA is a publicly traded, multinational technology company headquartered in Santa Clara, California. NVIDIA's invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, and ignited the era of modern AI.
234 jobs