Job details

Machine Learning Infrastructure Engineer

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

We're looking for a distributed ML infrastructure engineer to help extend and scale our training systems. You’ll work side-by-side with world-class researchers and engineers to:

• Extend distributed training frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

• Implement distributed optimizers from mathematical specs

• Build robust config + launch systems across multi-node, multi-GPU clusters

• Own experiment tracking, metrics logging, and job monitoring for external visibility

• Improve training system reliability, maintainability, and performance

• While much of the work will support large-scale pre-training, pre-training experience is not required. Strong infrastructure and systems experience is what we value most.

Key Responsibilities

• Distributed Framework Ownership – Extend or modify training frameworks (e.g., DeepSpeed, FSDP) to support new use cases and architectures.

• Optimizer Implementation – Translate mathematical optimizer specs into distributed implementations.

• Launch Config & Debugging – Create and debug multi-node launch scripts with flexible batch sizes, parallelism strategies, and hardware targets.

• Metrics & Monitoring – Build systems for experiment tracking, job monitoring, and logging usable by collaborators and researchers.

• Infra Engineering – Write production-quality code and tests for ML infra in PyTorch or JAX; ensure reliability and maintainability at scale.

Qualifications

Must-Haves:

• 5+ years of experience in ML systems, infra, or distributed training

• Experience modifying distributed ML frameworks (e.g., DeepSpeed, FSDP, FairScale, Horovod)

• Strong software engineering fundamentals (Python, systems design, testing)

• Proven multi-node experience (e.g., Slurm, Kubernetes, Ray) and debugging skills (e.g., NCCL/GLOO)

• Ability to implement algorithms across GPUs/nodes based on mathematical specs

• Experience working on an ML platform/ infrastructure, and/or distributed inference optimization team

• Experience with large-scale machine learning workloads (strong ML fundamentals)

Nice-to-Haves:

• Exposure to mixed-precision training (e.g., bf16, fp8) with accuracy validation

• Familiarity with performance profiling, kernel fusion, or memory optimization

• Open-source contributions or published research (MLSys, ICML, NeurIPS)

• CUDA or Triton kernel experience

• Experience with large-scale pre-training

• Experience building custom training pipelines at scale and modifying them for custom needs

• Deep familiarity with training infrastructure and performance tuning

$300,000 - $600,000 a year

Total compensation target: $300,000–$600,000 (inclusive of base salary and target bonus of up to 30%), commensurate with experience.

• Comprehensive medical, dental, and vision

• 401(k) program

• Generous PTO, sick leave, and holidays

• Paid parental leave and family-friendly benefits

• On-site amenities and perks: Complimentary lunch, gym access, and a short walk to the Sunnyvale Caltrain station

Machine Learning Distributed Training DeepSpeed FSDP Multi-node GPU Clusters PyTorch JAX Infrastructure Engineer Optimizer Implementation Experiment Tracking Metrics Logging

Average salary estimate

$450000 / YEARLY (est.)

min

max

$300000K

$600000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Security Solutions Engineer ll

Allied Universal Hybrid Santa Ana

VIEW

Posted 18 hours ago

Allied Universal Technology Services seeks a Security Solutions Engineer II to design and document post-sale security systems, enhancing their leadership in security technology.

Infrastructure Engineer - AI Systems & Platform Reliability

DOE Hybrid San Francisco

VIEW

Posted 9 hours ago

Doe is looking for an Infrastructure Engineer to build and maintain scalable, secure cloud infrastructure supporting autonomous AI operations.

Bridge Engineer / Project Bridge Designer

Gannett Fleming, Inc. Hybrid Boston

VIEW

Posted 20 hours ago

Experienced Bridge Engineer needed at GFT in Boston to contribute to innovative bridge design and infrastructure projects within a hybrid work environment.

Electrical Engineer II

American Systems Hybrid Bethesda

VIEW

Posted 13 hours ago

Electrical Engineer II needed at AMERICAN SYSTEMS to support submarine system lifecycle, integration, and testing efforts.

Lead Electrical Engineer 2 - Grid

Sargent & Lundy Hybrid Columbus

VIEW

Posted 17 hours ago

Lead electrical engineering projects and teams focused on HV/EHV substations at Sargent & Lundy, leveraging your protective relaying expertise in a hybrid work environment.

Senior Cloud DevOps Engineer

Intuitive Hybrid Sunnyvale, CA

VIEW

Posted 16 hours ago

Drive cloud infrastructure automation and scalability for Intuitive's pioneering robotic-assisted surgery platform as a Senior Cloud DevOps Engineer.

Senior Engineering and Design Manager

Peraton Hybrid Tallahassee

VIEW

Posted 16 hours ago

Peraton seeks a Senior Engineering and Design Manager to lead critical engineering efforts ensuring high-availability and scalability within a national security mission environment.

Senior Systems Contract Manager - Cloud Architecture

Bowhead Hybrid Arlington

VIEW

Posted 7 hours ago

Experienced Senior Systems Architect needed to drive secure cloud migrations and architecture leadership for DoD systems with Bowhead, based onsite in Arlington, VA.

Entry Level Structural Bridge Designer

VHB Hybrid New York City

VIEW

Posted 20 hours ago

Contribute to impactful transportation infrastructure projects as an Entry Level Structural Bridge Designer at VHB, working on bridge design and inspection.

Structural Engineering Project Manager

Galloway & Company, Inc. Hybrid No location specified

VIEW

Posted 16 hours ago

Experienced Structural Engineering Project Manager needed at Galloway & Company, offering leadership in complex engineering projects and business development within a supportive, multidisciplinary environment.

Summer/ Co-op Engineer

Mpr Associates, Inc. Hybrid Alexandria

VIEW

Posted 5 hours ago

Technical co-op opportunities at MPR Associates offer engineering students hands-on experience in diverse disciplines within a leading specialty engineering firm.

Senior AI Engineer, .＊RAG

Eloquent AI Hybrid San Francisco

VIEW

Posted 4 hours ago

Contribute as a Senior AI Engineer focused on advanced RAG architectures to build high-performance, scalable AI systems revolutionizing financial service automation at Eloquent AI.

Senior Applications Developer

UIC Alaska Hybrid Arlington

VIEW

Posted 10 hours ago

Drive modernization of DoD legacy systems as a Senior Application Developer with Bowhead, specializing in cloud-native architectures and team leadership onsite in Arlington, VA.

I Institute of Foundation Models

2 jobs

MATCH

Calculating your matching score...

FUNDING

Private

DEPARTMENTS

Engineering

SENIORITY LEVEL REQUIREMENT

Senior Level

TEAM SIZE

No info