Our team builds, operates, and maintains cloud-hosted services that provide user and service authentication/authorization across NVIDIA. Ensuring continuity of operations is critical to our mission.
We are in search of a highly proficient software engineer with extensive experience in AWS service development, deployment, and observability practices. In this capacity, you will have the responsibility of ensuring the reliability, performance, and scalability of our services, while providing the team with actionable insights for continuous improvement. You will build, implement, and coordinate observability infrastructure to proactively identify, fix, and address operational issues across our services.
What you’ll be doing:
Architect, implement, and maintain observability systems at scale to enable monitoring, alerting, logging, and tracing for our cloud-based services.
Define and refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets in partnership with service owners and product teams.
Invent, construct, and uphold actionable dashboards that display important measurements, SLI/SLOs, and system health for distributed services.
Collaborate with software, platform, and networking teams to integrate observability at all stages of the application lifecycle, from development to incident response.
Drive automation efforts to reduce manual toil in monitoring, telemetry, and incident response workflows; build and maintain self-service observability tooling.
Address performance and reliability issues by bringing to bear root cause analysis, distributed tracing, and log correlation.
Participate in Pager Duty rotations, contribute to post-incident reviews, detailing findings and driving solutions that improve long-term system resilience and visibility.
Develop expertise in the functions and capabilities of our offerings, and assist in managing our support channels for other NVIDIA teams.
What we need to see:
Bachelor’s or master’s degree in computer science, engineering, or equivalent experience in the field.
8+ years in large-scale systems engineering roles with exposure to dealing with live service development, working end-to-end from service development, deployment, and observability, as well as being on-call.
Hands-on experience with modern monitoring systems (Prometheus, Grafana, Loki, Tempo, Datadog, New Relic, OpenTelemetry, etc.) within a production environment.
Advanced coding skills in Python, Go, or similar languages for building automation and integrating observability solutions. Comfort with JavaScript frameworks such as React and Next.js.
Proficiency in cloud platforms (AWS, GCP, Azure) and containerized environments (Kubernetes, Docker); experience with configuration-as-code tools (Terraform, Helm, Ansible).
Strong communication and collaboration skills, with experience working in global, cross-disciplinary teams.
Detailed, analytical problem-solving approach and high standards for operational excellence and customer happiness.
Experience with incident management, postmortem processes.
Ways to stand out from the crowd:
Familiarity with the Java Spring Boot framework, hands-on experience with Apache Cassandra and HashiCorp Vault would be very advantageous.
Besides our core duties, our team also manages multiple custom front-end services based on React for admin functions. Having relevant coding experience and being open to supporting development would be a huge plus.
You will also be eligible for equity and benefits.
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Lead performance engineering for Vision Language Models at NVIDIA, optimizing end-to-end inference pipelines, CUDA kernels, and SDK integrations to deliver accelerated computer vision at scale.
Be part of NVIDIA’s performance engineering team to architect, tune, and validate large-scale GPU-accelerated systems and workflows for AI and datacenter workloads.
Experienced software engineer needed to lead development and integration of real-time embedded software for advanced radar and threat simulation systems at a DoD-focused company.
Visa is hiring a Senior Consultant-level Software Engineer to design, implement, and scale secure payment systems and services used globally.
DeepFin Research is hiring a Quantitative Developer to translate market-microstructure research into high-performance, low-latency production trading systems across derivatives markets.
Senior frontend engineer needed to architect and build scalable, high-performance features for LogicGate's Risk Cloud using Angular and TypeScript.
Software Engineer (React) role at Nelnet Business Services building user-facing features and improving front-end applications within a collaborative, service-focused team.
Travelers seeks a Software Engineer II in Atlanta to build and operate cloud-native, API-driven applications using AWS, Node/React and AI-enabled tooling.
Lead the productization of cutting-edge open-source AI research into Red Hat's AI platform, focusing on synthetic data, model customization, and inference-time scaling as a senior technical leader.
Help shape Coinbase Institutional products by building secure, scalable React + TypeScript frontends and collaborating with designers, product managers and engineers to deliver features for institutional customers.
Peraton is hiring a Cyber Software Engineer in Herndon, VA to build and maintain secure, containerized software solutions that support critical national security missions.
Lead architecture and engineering delivery for Aladdin's Private Credit and CLO post-trade systems at BlackRock, driving scalable solutions and cross-team execution.
Help build scalable, secure enterprise features and self-hosted deployments at CodeRabbit to power AI-driven code review systems for large organizations.
Clearwater Analytics is hiring a Software Development Engineer II to build and maintain cloud-native Java microservices and full-stack features for its investment accounting SaaS platform.
Lead product and platform security onsite in San Francisco or New York, embedding with engineering teams to secure agentic workloads end-to-end and drive the security roadmap.
NVIDIA is a publicly traded, multinational technology company headquartered in Santa Clara, California. NVIDIA's invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, and ignited the era of modern AI.
170 jobs