We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position requires presence in our San Francisco or Seattle office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
The Lambda Observability team builds and operates large scale monitoring systems for our AI cloud product suite. We deploy observability solutions across the stack, from datacenter infrastructure to our in-house software stack. Keeping those offerings reliable and instantly detecting issues in the latest high-performance AI clusters is what makes us tick.
Along with the Platform Engineering organization, we help to build the foundations that unlock product excellence and a highly reliable experience for our customers.
Our expertise lies at the intersection of:
Scalable Observability Platforms: We build and operate mission-critical platforms for metrics, logs, and traces based on both open-source software and systems developed in-house.
AI Infrastructure Observability: We design observability solutions for large-scale AI clusters running the latest GPU, Networking, and Storage technologies.
Observability Practices: We engage across the company to promote best practices, help teams adopt our platforms, and enable applications that require observability data.
About the Role:
We are seeking a seasoned Observability Engineering Manager with deep experience in development and operation of modern observability platforms. You will hire and guide a team of observability engineers in building out critical pillars of our internal observability stack. You will lead the team in building monitoring solutions for new products, and in measuring and reporting the availability of our products.
Your role is not just to manage people, but to coordinate the delivery of observability solutions to customers inside and outside Lambda. Your leadership will be pivotal in ensuring our ability to deliver a high-quality, reliable product experience.
This is a unique opportunity to work at the intersection of large-scale observability systems and the rapidly evolving field of artificial intelligence infrastructure. You will be building the systems that monitor some of the world’s most advanced AI solutions.
What You’ll Do
Team Leadership & Management:
Grow/Hire, lead, and mentor a team of high-performing observability engineers and SREs.
Foster a culture of technical excellence, collaboration, and customer service.
Conduct regular one-on-one meetings, provide constructive feedback, and support career development for team members.
Drive outcomes by managing project priorities, deadlines, and deliverables.
Technical Strategy & Execution:
Work with the engineering team to drive strategy for Lambda internal and customer observability solutions.
Improve observability of AI infrastructure and develop new monitoring solutions as new products are introduced.
Lead the broader engineering organization in adoption of Observability and SRE practices.
Manage costs of both vendors and internally developed platforms.
Lead team in the continued development of our existing Metrics solutions based on the Prometheus and OpenTelemetry ecosystems.
Lead team in tasks related to delivery of new Logging and Tracing solutions based on Clickhouse.
Guide team in problem identification, requirements gathering, solution ideation, and stakeholder alignment on engineering RFCs.
Participate in design of solutions for bringing observability data to our customers.
Identify gaps in our observability posture and drive resolution.
Lead the team in supporting internal customers from across Lambda engineering.
Cross-Functional Collaboration:
Collaborate with the infrastructure and HPC teams on infrastructure monitoring and alerting.
Work closely with Lambda product engineering teams on instrumentation and best practices usage of our platforms.
Work to understand the needs of engineering teams and drive our Observability solutions towards self-service.
Manage a short list of vendors that provide SaaS solutions in the monitoring space.
You
Experience:
10+ years of experience in observability systems or platform engineering with at least 3 years in a management or lead role.
Demonstrated experience leading a team of engineers and SREs on complex, cross-functional projects in a fast-paced startup environment.
Significant experience in environments that require the monitoring of bare-metal infrastructure is preferred.
Experience with a wide variety of modern open-source observability software.
Strong background in software engineering and the SDLC.
Strong project management skills, leading planning, project execution, and delivery of team outcomes on schedule.
Extensive experience with site reliability engineering and ability to champion improved SRE practices.
Experience building a high-performance team through deliberate hiring, upskilling, performance-management, and expectation setting.
Nice to Have
Experience:
Experience driving cross-functional engineering management initiatives (coordinating events, strategic planning, coordinating large projects).
Experience driving organizational improvements (processes, systems, etc.)
Experience with Kubernetes, designing scalable distributed systems,
Salary Range Information
The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About Lambda
Founded in 2012, ~400 employees (2025) and growing fast
We offer generous cash & equity compensation
Our investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove.
We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
Health, dental, and vision coverage for you and your dependents
Wellness and Commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible Paid Time Off Plan that we all actually use
A Final Note:
You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.
Equal Opportunity Employer
Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Lead and scale Lambda's Detection & Response organization to deliver automated, enterprise-grade detection, AI-enabled hunting, and resilient incident response for a world-class AI infrastructure provider.
Help build and scale Pearpop’s creator platform as a Mid-level Full Stack Engineer focused on Node.js back-end development and front-end integration with React.
Lead a high-performing engineering team at Handshake to build employer-facing, revenue-generating products that connect employers with early talent.
Lead development of low-level network systems software at Arista's Austin engineering team, working on device drivers, hardware control, and performance optimization for high-scale networking products.
A new-graduate software engineer role on NVIDIA's TensorRT team to help design and optimize high-performance deep learning inference software for specialized platforms.
Cigna-Evernorth is hiring a Cloud Engineering Senior Advisor to architect and implement scalable cloud networking, automation pipelines, and application integrations in a hybrid environment supporting healthcare services.
Senior Fullstack Java Developer needed to lead end-to-end development of Java/Angular applications for a US-based remote team, with emphasis on quality, automated testing, and collaboration.
TEGNA seeks a React Native developer experienced with Fire TV and CTV platforms to build performant, large-screen streaming apps and integrate video, ad tech, and analytics.
Shepherd is hiring a Senior DevOps/SysOps Engineer to lead and scale cloud infrastructure, automation, and platform reliability for a fast-growing insurtech.
Work on LIGER™, LMI’s GenAI platform, as a Full-Stack Engineer developing scalable Python backends, ensuring code quality, and collaborating with stakeholders to deliver mission-ready solutions for government customers.
Lead development of cloud-managed SD-WAN applications at HPE by building React.js frontends and Node.js APIs for Silver Peak’s Orchestrator/EdgeConnect platform.
Aretum is hiring a Senior .NET Developer to lead .NET modernization and Azure migration efforts for mission-critical government and homeland security systems.
TwelveLabs is hiring a Staff Frontend Engineer to architect and ship performant React/Next.js interfaces that power next-generation video-AI products in a hybrid San Francisco-based role (remote within CA/WA considered).
NT Concepts is hiring a remote Software Developer to build and sustain scalable ML and data-driven systems for national-security-focused projects.
Lambda provides Artificial Intelligence and Machine Learning infrastructure to companies like Apple, Intel, Microsoft, MIT, Harvard, the Federal Government, and the DOD. Were headquartered in the Dogpatch and are a short walk from the 22nd Street ...
10 jobs