Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
AI Infrastructure Operations Engineer image - Rise Careers
Job details

AI Infrastructure Operations Engineer - job 1 of 2

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.


Cerebras' current customers include global corporations across multiple industries, national labs, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields. In August, we launched Cerebras Inference, the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services.


About The Role


We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power.


You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives. This role requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with expertise in large-scale compute infrastructure, dependable and an advocate for customer success.


Responsibilities

  • Manage and operate multiple advanced AI compute infrastructure clusters.
  • Monitor and oversee cluster health, proactively identifying and resolving potential issues.
  • Maximize compute capacity through optimization and efficient resource allocation.
  • Deploy, configure, and debug container-based services using Docker.
  • Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed.
  • Handle engineering escalations and collaborate with other teams to resolve complex technical challenges.
  • Contribute to the development and improvement of our monitoring and support processes.
  • Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies.


Skills And Requirements

  • 6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing.
  • Strong proficiency in Python scripting for automation and system administration.
  • Deep understanding of Linux-based compute systems and command-line tools.
  • Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM.
  • Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner.
  • Experience with monitoring and alerting systems.
  • Should have a proven track record to own and drive challenges to completion.
  • Excellent communication and collaboration skills.
  • Ability to work effectively in a fast-paced environment.
  • Willingness to participate in a 24/7 on-call rotation.


Preferred Skills And Requirements

  • Operating large scale GPU clusters.
  • Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired.
  • Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure).
  • Familiarity with machine learning frameworks and tools.
  • Experience with cross-functional team projects.


Location

  • SF Bay Area.
  • Toronto, Canada.
  • Bangalore, India.


Why Join Cerebras


People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we've reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras:

  1. Build a breakthrough AI platform beyond the constraints of the GPU.
  2. Publish and open source their cutting-edge AI research.
  3. Work on one of the fastest AI supercomputers in the world.
  4. Enjoy job stability with startup vitality.
  5. Our simple, non-corporate work culture that respects individual beliefs.


Read our blog: Five Reasons to Join Cerebras in 2025.

Apply today and become part of the forefront of groundbreaking advancements in AI!


Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them.


This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.

Average salary estimate

$190000 / YEARLY (est.)
min
max
$160000K
$220000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs
Photo of the Rise User
Posted 9 hours ago

Lead the design and implementation of enterprise-grade security features and AI-aware defenses for Dia, spanning client and backend surfaces in a remote-first startup environment.

Photo of the Rise User

Lead a remote engineering team at LifeStance Health to design and deliver scalable, secure serverless microservices and interoperability solutions for next-generation mental health technology.

Photo of the Rise User

PwC IT Services is hiring a remote Manager-level Full-Stack .NET Developer to lead Agile teams building scalable, cloud-native HR systems for global PwC member firms.

Photo of the Rise User
QODE Hybrid No location specified
Posted 24 hours ago

Senior React developer needed to lead front-end engineering and deploy cloud-native web applications using AWS for a financial services firm.

Posted 9 hours ago

Work on cutting-edge SSD firmware at Solidigm, developing embedded C/C++ solutions and collaborating with hardware and cross-functional teams to deliver high-quality storage products.

Photo of the Rise User
Dental Insurance
Disability Insurance
Flexible Spending Account (FSA)
Health Savings Account (HSA)
Vision Insurance
Paid Holidays

Senior Software Engineer (Infrastructure) to help scale Kiddom's multi-region platform, improve CI/CD and observability, and drive DevOps best practices for an education-focused SaaS.

Photo of the Rise User
Point72 Hybrid San Francisco, CA, USA
Posted 3 hours ago

An AI Engineer role on Point72's Long/Short Equity team to design and implement GenAI/ML solutions that enhance the team's research and investment capabilities.

Photo of the Rise User

Lead a system software team at HRL in Calabasas to build high-performance C++/Python software and drive technical execution for quantum device integration.

Photo of the Rise User
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony
Customer-Centric
Fast-Paced
Growth & Learning
Medical Insurance
Dental Insurance
401K Matching
Paid Time-Off
Maternity Leave
Paternity Leave
Mental Health Resources
Flex-Friendly

Design and operationalize machine learning and generative AI solutions to enrich content metadata and improve discovery on Netflix's Content Management & Distribution team.

Photo of the Rise User
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony

Experienced software engineer to develop ETL/ELT solutions, build CI/CD test automation, and support enterprise data warehousing initiatives at Cigna-Evernorth on a hybrid schedule.

TXSE Hybrid No location specified
Posted 3 hours ago

Lead the architecture and delivery of Palantir Foundry-based data platforms at an early-stage fintech, applying full-stack and data engineering expertise to drive product impact.

Photo of the Rise User
Posted 3 hours ago

Entry-level Software Engineer supporting HCL America’s device testing projects with emphasis on software testing, debugging, and learning scripting and Windows device test procedures.

Photo of the Rise User
Inclusive & Diverse
Empathetic
Take Risks
Transparent & Candid
Feedback Forward
Mission Driven
Collaboration over Competition
Work/Life Harmony
Maternity Leave
Paternity Leave
Snacks
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
401K Matching
Paid Sick Days
Paid Time-Off
Paid Volunteer Time

Lead the architecture and experimentation strategy for Spotify’s Home backend systems to enable trustworthy, scalable personalization and better product experiences.

MATCH
Calculating your matching score...
BADGES
Badge ChangemakerBadge Flexible CultureBadge Future Maker
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
HQ LOCATION
No info
EMPLOYMENT TYPE
Full-time, onsite
DATE POSTED
August 20, 2025
Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!