Browse 14 exciting jobs hiring in Production Reliability now. Check out companies hiring such as Crusoe, MLabs, Fiddler AI in Minneapolis, Honolulu, Riverside.
Work on Crusoe’s fleet operations to automate server provisioning, troubleshoot GPU hardware, and help transition infrastructure to Kubernetes for large-scale deployments.
Senior Backend Engineer needed to build and operate high-reliability TypeScript backend services for a mission-driven stablecoin infrastructure startup operating on an EST remote schedule.
Fiddler is hiring a Staff AI Software Engineer to architect and build scalable observability and evaluation systems for LLMs, GenAI, and agentic applications while mentoring engineering teams.
Lead the integration of third-party infrastructure providers into NVIDIA’s operational systems and shape robustness for DGX Cloud as a Senior AI Infrastructure Engineer focused on cloud partnerships.
Lead the architecture and operational strategy for Lightspark’s global production infrastructure to ensure secure, reliable, and scalable payment systems.
ServiceNow is hiring a Staff Software Engineer to build and operate production-grade database provisioning tools that ensure reliable, scalable database operations across global data centers.
Serve as the primary production support engineer for a remote US team, managing incidents, optimizing system performance, and working with cross-functional teams to improve reliability and scalability.
Lead SRE efforts as the founding Site Reliability Engineer at a fast-growing AI company, building scalable, secure, and observable AWS infrastructure and processes.
Applied Labs seeks a Founding Engineer in New York City to design, ship, and own production-grade AI agent systems focused on reliability and real-world performance.
Netflix is looking for a Site Reliability Engineer (L4) to enhance resilience, automation, and incident response for its streaming infrastructure in a remote role.
Netflix seeks a Systems Development Engineer to design, automate, and operate scalable infrastructure and tooling that powers global content production workflows.
Drive platform reliability and observability at Forbright as a Senior SRE, building automated, resilient cloud systems that support the bank's digital banking and commercial lending services.
Lead end-to-end model development and production deployment for fault prediction and autonomous repair as a Model Engineer on the founding AI team of a Series C networking startup in San Francisco.
Lead reliability and automation efforts for ServiceNow cloud operations in a Staff Production Service Engineer role supporting US Public Sector customers from the Orlando office.
Below 50k*
0
|
50k-100k*
0
|
Over 100k*
1
|