Browse 26 exciting jobs hiring in Distributed Training now. Check out companies hiring such as Sciforium, Abridge, Quizlet in Buffalo, Houston, Irvine.
Sciforium is hiring a Distributed Training Engineer to own and optimize the full ML training stack — from drivers and kernels to JAX/PyTorch — enabling large-scale training and deployment of next-generation LLMs.
Abridge is hiring a Head of AI Platform to lead the team building scalable, secure ML infrastructure and model-serving systems that power its generative-AI healthcare products.
Lead the design and production of large-scale personalization and recommendation systems at Quizlet to improve learning outcomes for millions of students.
An ML Researcher role at Prima Mente to develop and evaluate large-scale biological foundation models that translate into real clinical and mechanistic insights across neuroscience and complex disease.
Lead the ML stack as a founding Machine Learning Engineer at a stealth, self-funded AI group, defining models, training pipelines, and scalable inference for a global consumer product.
Work with NomadicML founders to train and productionize large-scale vision-language models that reason about motion in real-world video for autonomy and robotics.
Senior ML Systems Engineer to own and evolve the training framework and tooling that enables reliable, high-performance large-scale LLM training.
Lead and build the ML cloud platform at an early-stage AI startup in San Francisco, owning end-to-end infrastructure for training and deploying large-scale physics models while remaining deeply technical and customer-facing.
Lead the design and scaling of high-performance ML infrastructure for large generative and predictive molecular AI models, working at the intersection of ML, physics, and computational chemistry.
Quizlet is hiring a Staff Machine Learning Engineer to lead the design and production of scalable personalization and recommendation systems that improve learner engagement across its platform.
Senior Machine Learning Platform Engineer to design and optimize feature pipelines, distributed training, and low-latency inference systems for a remote US team building production ML infrastructure.
NVIDIA seeks a Principal Product Manager to lead AI training and post-training frameworks, building SDKs and tools that maximize large-scale model performance on NVIDIA GPUs.
Basis is hiring an experienced ML Systems Engineer to build, operate, and optimize distributed training and cloud infrastructure that enables large-scale, reproducible AI research.
Lead the architecture and execution of a high-throughput, low-latency ML and simulations platform that enables large-scale model training, inference, and simulation-driven product development.
Lead development of production-grade large-scale ML and deep learning models at a remote-first AI company, turning research innovations into scalable product solutions.
Lead the design and operation of multi-cloud GPU/TPU HPC infrastructure to enable scalable, high-performance distributed training for cutting-edge AI research and products.
Early-career ML Operations / Full Stack engineer to help design, deploy, and optimize scalable model serving and training infrastructure for Abridge’s AI-driven healthcare platform.
Coinbase is hiring a Machine Learning Platform Engineer to design and operate low‑latency inference, streaming pipelines, and distributed training infrastructure that powers fraud detection, personalization, and blockchain analysis.
Cohere is hiring a Staff Software Engineer to build and operate ML-optimized HPC infrastructure (Kubernetes-based GPU/TPU superclusters) that accelerates research and production training of large AI models.
Build and scale Whatnot's ML infrastructure to productionize cutting-edge models, enable low-latency LLM serving, and support distributed training at consumer scale.
OpenAI's Sora team is hiring a Software Engineer to design and scale distributed data infrastructure that powers large-scale multimodal training and evaluation.
Lead the design and implementation of petabyte-scale training, fine-tuning, and low-latency model serving infrastructure for an early-stage startup building foundation models for physics.
Lead end-to-end development of large-scale AI and deep learning solutions at Thomson Reuters Labs, driving production-grade LLM, retrieval, and data-pipeline capabilities across legal and news products.
OpenBlock Labs is hiring a Research Engineer to develop reinforcement learning and continual learning systems that enable coding agents to learn and adapt from developer feedback and real codebases.
Lead the ML Training Platform engineering team at Zoox to design, operate, and scale distributed model training and lifecycle systems for autonomous driving.
Lead research to design and scale models that power a production web index for AI-first APIs at a fast-moving Bay Area startup.
Below 50k*
0
|
50k-100k*
0
|
Over 100k*
24
|