NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you! NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for over 30 years. It’s a unique legacy of innovation that’s fueled by phenomenal technology and outstanding people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAn, you’ll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work. Come join the AI Infrastructure Production engineering team and see how you can make a lasting impact on the world.
What You Will Be Doing:
Develop and maintain large-scale systems supporting critical use cases for AI Infrastructure, driving reliability, operability, and scalability across global public and private clouds.
Implement SRE fundamentals, including incident management, monitoring, and performance optimization, while designing automation tools to reduce manual processes and operational overhead.
Build tools and frameworks to improve observability, define actionable reliability metrics, and enable fast issue resolution, driving continuous improvement in system performance.
Establish frameworks for operational maturity, lead sustainable incident response protocols, and conduct blameless postmortems to improve team efficiency and system resilience.
Work with engineering teams to deliver innovative solutions, mentor peers, uphold high standards for code and infrastructure, and contribute to hiring for a diverse, high-performing team.
What We Need to See:
Degree in Computer Science or related field, or equivalent experience with 8+ years in Software Development, SRE, or Production Engineering.
Proficiency in Python and at least one other language (C/C++, Go, Perl, Ruby).
Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, OCI, Azure, GCP).
Strong understanding of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (e.g., Terraform CDK).
Hands-on experience with observability platforms (e.g., ELK, Prometheus, Loki) and CI/CD systems (e.g., GitLab).
Strong communication skills with the ability to convey technical concepts effectively to diverse audiences.
Commitment to fostering a culture of diversity, curiosity, and continuous improvement.
Ways to stand out from the crowd:
Experience in AI training, inferencing, and data infrastructure services.
Proficiency in deep learning frameworks like PyTorch, TensorFlow, JAX, and Ray.
A strong background in hardware health monitoring and system reliability.
Hands-on expertise in operating and scaling distributed systems with stringent SLAs, ensuring high availability and performance.
Proven experience in incident, change, and problem management processes, fostering continuous improvement in sophisticated environments.
You will also be eligible for equity and benefits.
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
NVIDIA seeks a Senior Technical Program Manager in Santa Clara to lead software program execution for automotive customers, driving releases, issue resolution, and cross-functional coordination.
NVIDIA is hiring a Senior Solutions Architect (Networking) to lead technical pre-sales and integrate compute and networking solutions for AI and data center customers.
GridUnity is looking for a Senior Full Stack Engineer to design and deliver scalable, data-driven features that power a mission-critical platform for grid interconnection and energy customers.
Lead development of Iris’s browser-based React/TypeScript single-page application to deliver high-performance, AI-driven camera control and video experiences.
Allergan Aesthetics is looking for a Senior Software Engineer to deliver scalable, secure platform services using TypeScript/Node.js, GraphQL and AWS technologies.
Build and operate the backend systems that bridge cloud and factory-floor hardware to power the digital backbone of Base’s first manufacturing facility.
PointClickCare seeks an experienced Principal AI Engineer to lead architecture and delivery of agentic AI systems that drive safe, scalable AI adoption across its healthcare platform.
Lead architecture and delivery as a Senior Software Engineer at Opus, building scalable full-stack systems that power an accessible training platform for deskless workers.
BitGo is hiring an onsite Software Engineer for the Onboarding team to design and build scalable, API-first backend services powering crypto product integrations.
Lead the design and operation of the hybrid infrastructure and high-bandwidth telemetry systems that enable rapid, reliable vehicle testing and integration at REGENT.
Prime Robotics seeks a strategic VP to lead software engineering, support engineering, and IT for its warehouse robotics solutions, combining technical leadership with hands-on customer implementation.
Experienced Senior Software Developer with a Secret clearance needed to architect, develop, and lead cloud-enabled .NET and Angular applications for federal clients.
Lead and scale Central’s engineering organization as Head of Engineering, building distributed teams, defining culture and processes, and driving reliable delivery for a mission-focused HR/operations platform.
candidate.fyi is looking for a backend-focused Software Engineer (Python/Django) to scale APIs, improve database performance, and contribute to AI-powered hiring features on a high-growth platform.
Be a founding senior engineer at an early-stage fintech startup building agentic AI for capital markets, owning features from design to production and driving model-led product innovation.
NVIDIA is a publicly traded, multinational technology company headquartered in Santa Clara, California. NVIDIA's invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, and ignited the era of modern AI.
126 jobs