Job details

AI Infrastructure Operations Engineer - job 1 of 2

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.

Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.

About The Role

We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power.

You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives. This role requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with expertise in large-scale compute infrastructure, dependable and an advocate for customer success.

Responsibilities

Manage and operate multiple advanced AI compute infrastructure clusters.
Monitor and oversee cluster health, proactively identifying and resolving potential issues.
Maximize compute capacity through optimization and efficient resource allocation.
Deploy, configure, and debug container-based services using Docker.
Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed.
Handle engineering escalations and collaborate with other teams to resolve complex technical challenges.
Contribute to the development and improvement of our monitoring and support processes.
Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies.

Skills And Requirements

6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing.
Strong proficiency in Python scripting for automation and system administration.
Deep understanding of Linux-based compute systems and command-line tools.
Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM.
Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner.
Experience with monitoring and alerting systems.
Should have a proven track record to own and drive challenges to completion.
Excellent communication and collaboration skills.
Ability to work effectively in a fast-paced environment.
Willingness to participate in a 24/7 on-call rotation.

Preferred Skills And Requirements

Operating large scale GPU clusters.
Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired.
Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure).
Familiarity with machine learning frameworks and tools.
Experience with cross-functional team projects.

Location

SF Bay Area.
Toronto, Canada.
Bangalore, India.

Why Join Cerebras

People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we've reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras:

Build a breakthrough AI platform beyond the constraints of the GPU.
Publish and open source their cutting-edge AI research.
Work on one of the fastest AI supercomputers in the world.
Enjoy job stability with startup vitality.
Our simple, non-corporate work culture that respects individual beliefs.

Read our blog: Five Reasons to Join Cerebras in 2025.

Apply today and become part of the forefront of groundbreaking advancements in AI!

Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them.

This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.

AI infrastructure Wafer-Scale Engine WSE HPC GPU clusters Linux Python Docker Kubernetes SLURM Monitoring On-call RoCE TCP/IP AWS GCP Azure

Average salary estimate

$190000 / YEARLY (est.)

min

max

$160000K

$220000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Staff Software Engineer, Security

The Browser Company Hybrid No location specified

VIEW

Posted 9 hours ago

Lead the design and implementation of enterprise-grade security features and AI-aware defenses for Dia, spanning client and backend surfaces in a remote-first startup environment.

Senior Engineering Manager

LifeStance Health Hybrid CA-Remote

VIEW

Posted 14 hours ago

Lead a remote engineering team at LifeStance Health to design and deliver scalable, secure serverless microservices and interoperability solutions for next-generation mental health technology.

IT Services- Talent & Workforce Developer - Manager (US Remote)

PwC Hybrid FL-Tampa

VIEW

Posted 24 hours ago

PwC IT Services is hiring a remote Manager-level Full-Stack .NET Developer to lead Agile teams building scalable, cloud-native HR systems for global PwC member firms.

React Developer

QODE Hybrid No location specified

VIEW

Posted 24 hours ago

Senior React developer needed to lead front-end engineering and deploy cloud-native web applications using AWS for a financial services firm.

Firmware Development Engineer

Solidigm Hybrid Rancho Cordova

VIEW

Posted 9 hours ago

Work on cutting-edge SSD firmware at Solidigm, developing embedded C/C++ solutions and collaborating with hardware and cross-functional teams to deliver high-quality storage products.

Senior Software Engineer, Infrastructure

Kiddom Hybrid New York

VIEW

Posted 19 hours ago

Dental Insurance

Disability Insurance

Flexible Spending Account (FSA)

Health Savings Account (HSA)

Vision Insurance

Paid Holidays

Senior Software Engineer (Infrastructure) to help scale Kiddom's multi-region platform, improve CI/CD and observability, and drive DevOps best practices for an education-focused SaaS.

AI Engineer, L/S Equity

Point72 Hybrid San Francisco, CA, USA

VIEW

Posted 3 hours ago

An AI Engineer role on Point72's Long/Short Equity team to design and implement GenAI/ML solutions that enhance the team's research and investment capabilities.

Software Engineering Manager, Drivers/Firmware/Embedded Systems

HRL Laboratories Hybrid Calabasas, CA

VIEW

Posted 12 hours ago

Lead a system software team at HRL in Calabasas to build high-performance C++/Python software and drive technical execution for quantum device integration.

Machine Learning Engineer (L5) - Content Management & Distribution

Netflix Remote Los Gatos, CA, USA

VIEW

Posted 3 hours ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Customer-Centric

Fast-Paced

Growth & Learning

Medical Insurance

Dental Insurance

401K Matching

Paid Time-Off

Maternity Leave

Paternity Leave

Mental Health Resources

Flex-Friendly

Design and operationalize machine learning and generative AI solutions to enrich content metadata and improve discovery on Netflix's Content Management & Distribution team.

Software Engineering Advisor- Hybrid

The Cigna Group Hybrid Bloomfield, CT

VIEW

Posted 17 hours ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Experienced software engineer to develop ETL/ELT solutions, build CI/CD test automation, and support enterprise data warehousing initiatives at Cigna-Evernorth on a hybrid schedule.

Palantir Foundry - Technical Lead

TXSE Hybrid No location specified

VIEW

Posted 3 hours ago

Lead the architecture and delivery of Palantir Foundry-based data platforms at an early-stage fintech, applying full-stack and data engineering expertise to drive product impact.

Jr. Software Engineer

KRG Technologies Hybrid Redmond, WA, USA

VIEW

Posted 3 hours ago

Entry-level Software Engineer supporting HCL America’s device testing projects with emphasis on software testing, debugging, and learning scripting and Windows device test procedures.

Staff Backend Engineer - Home Foundations - Personalization

Spotify Hybrid New York, NY

VIEW

Posted 17 hours ago

Inclusive & Diverse

Empathetic

Take Risks

Transparent & Candid

Feedback Forward

Mission Driven

Collaboration over Competition

Work/Life Harmony

Maternity Leave

Paternity Leave

Snacks

Medical Insurance

Dental Insurance

Vision Insurance

Mental Health Resources

Life insurance

401K Matching

Paid Sick Days

Paid Time-Off

Paid Volunteer Time

Lead the architecture and experimentation strategy for Spotify’s Home backend systems to enable trustworthy, scalable personalization and better product experiences.

C Cerebras Systems

1 jobs

MATCH

Calculating your matching score...

BADGES

FUNDING

Private

DEPARTMENTS

Software Engineering

SENIORITY LEVEL REQUIREMENT

Mid-Level

TEAM SIZE