Jobs / Job page

Job details

Head of ML Cloud Platform

📍 San Francisco | Work Directly with CEO & founding team | Report to CEO | OpenAI for Physics | 🏢 5 Days Onsite

Head of ML Cloud Platform

📍 San Francisco | Work Directly with CEO & Founding Team | Report to CEO | OpenAI for Physics | 🏢 5 Days Onsite

Location: Onsite in San Francisco
Compensation: Competitive Salary + Significant Equity

Who We Are

UniversalAGI is building OpenAI for Physics. AI startup based in San Francisco and backed by Elad Gil (#1 Solo VC), Eric Schmidt (former Google CEO), Prith Banerjee (ANSYS CTO), Ion Stoica (Databricks Founder), Jared Kushner (former Senior Advisor to the President), David Patterson (Turing Award Winner), and Luis Videgaray (former Foreign and Finance Minister of Mexico). We're building foundation AI models for physics that enable end-to-end industrial automation from initial design through optimization, validation, and production.

We're building a high-velocity team of relentless researchers and engineers that will define the next generation of AI for industrial engineering. If you're passionate about AI, physics, or the future of industrial innovation, we want to hear from you.

About the Role

As the Head of ML Cloud Platform, you'll be in the arena from day one, building and leading the team that creates the backbone for AI-powered physics simulation at scale. This is your chance to own the entire ML infrastructure vision—from training foundation models on petabytes of CFD data to deploying them into mission-critical automotive and maritime production environments.

You'll work directly with the CEO and founding team to build a world-class ML platform organization, recruiting exceptional engineers and researchers while remaining deeply technical yourself. You'll architect systems that train models faster, serve predictions with lower latency, and integrate seamlessly into customers' existing CAE workflows—all while managing a team that ships with the velocity of a startup and the rigor of enterprise infrastructure.

This isn't a pure management role. You're a technical leader who codes, debugs production incidents at 2 AM when needed, and earns respect through hands-on contribution while simultaneously building the team and culture that will scale our platform to serve the world's largest industrial companies.

What You'll Do

Technical Leadership & Architecture

Define the ML platform vision: Architect the end-to-end infrastructure strategy for training, fine-tuning, serving, and deploying foundation models for physics simulation across cloud and on-premise environments
Build for scale and reliability: Design systems that can handle petabyte-scale CFD datasets, multi-day distributed training runs, and real-time inference for customers making million-dollar engineering decisions
Stay hands-on: Write code, debug critical production issues, review pull requests, and make key architectural decisions yourself—you're a technical leader who leads by doing
Bridge research and production: Translate cutting-edge research from our deep learning team into production-grade infrastructure that customers can depend on
Integrate with CAE ecosystems: Ensure our platform works seamlessly with existing simulation tools (Ansys, OpenFOAM, STAR-CCM+), HPC clusters, PLM systems, and enterprise security requirements

Team Building & Management

Recruit world-class talent: Build a team of exceptional ML infrastructure engineers, cloud platform engineers, and MLOps specialists who can execute at the highest level
Develop and mentor: Coach engineers to grow technically and professionally, fostering a culture of deep work, technical excellence, and customer obsession
Scale the organization: Grow the team from founding engineers to a robust platform organization as we scale from early customers to enterprise deployments
Set technical standards: Establish engineering practices, code review processes, and quality bars that enable the team to ship fast without breaking things
Foster collaboration: Work closely with deep learning researchers, product engineers, CFD domain experts, and customer success to ensure platform capabilities align with company needs

Execution & Delivery

Ship relentlessly: Drive the team to deliver infrastructure from prototype to production in weeks, not quarters, iterating based on real customer feedback
Own reliability: Take responsibility for platform uptime, performance, and customer success—when things break, you're in the arena fixing them
Make strategic tradeoffs: Balance innovation with stability, speed with quality, and custom solutions with scalable platforms
Work with customers: Engage directly with automotive and maritime customers to understand their infrastructure requirements, security constraints, and deployment challenges
Build for enterprise: Implement security, compliance, monitoring, and operational practices that meet the standards of Fortune 500 companies

Qualifications

Required Experience

8+ years in ML infrastructure or cloud platform engineering, with at least 3 years in technical leadership roles managing high-performing teams
Proven track record building and scaling ML platforms for training, serving, or deploying models in production environments, ideally at AI-first companies
Deep technical expertise in distributed training (PyTorch Distributed, DeepSpeed, Ray), cloud infrastructure (AWS/GCP/Azure), and container orchestration (Kubernetes, Docker)
Hands-on coding ability: Expert-level Python and infrastructure-as-code skills—you can still ship production code yourself and review your team's work deeply
Team building success: Track record of recruiting, developing, and retaining exceptional engineering talent, with experience building teams from 3-4 engineers to 15-20+
Strong product and customer intuition: Experience working closely with customers, understanding their workflows, and translating requirements into technical solutions
Outstanding execution velocity: Proven ability to ship infrastructure rapidly in fast-paced, high-growth environments while maintaining quality

Technical Requirements

ML infrastructure mastery: Deep understanding of training pipelines, model serving, distributed systems, GPU optimization, and the full ML lifecycle
Cloud platform expertise: Strong experience with cloud providers, infrastructure-as-code tools, and building hybrid cloud/on-premise solutions
System design excellence: Can architect complex, scalable systems and make smart tradeoff decisions under uncertainty
Performance optimization: Knowledge of GPU programming, model optimization techniques, and infrastructure cost management
Enterprise infrastructure: Experience with security, compliance, SSO, RBAC, and deploying into regulated or air-gapped environments

Leadership & Communication

Technical credibility: Earns respect through deep technical contribution, not just title or tenure
Clear communicator: Can explain complex technical decisions to customers, executives, researchers, and engineers at all levels
Strategic thinker: Balances short-term execution with long-term platform vision and architectural decisions
Player-coach mentality: Comfortable coding and debugging yourself while also managing, mentoring, and growing a team
High agency: Takes ownership of outcomes, doesn't wait for permission, and drives solutions to completion

Bonus Qualifications

Experience in industrial or scientific ML: Built infrastructure for physics simulation, computational chemistry, drug discovery, or other scientific computing domains
CAE/HPC background: Familiarity with simulation software, job schedulers (SLURM, PBS), parallel file systems, or high-performance computing environments
Founded or led platform teams at AI startups (Seed to Series B) through rapid growth and scaling challenges
Published or presented on ML infrastructure, distributed training, or MLOps topics at major conferences or venues
Experience with foundation models: Built infrastructure for training or serving large-scale pretrained models (LLMs, vision models, multimodal models)
Open-source contributions to major ML infrastructure projects (PyTorch, Ray, Kubernetes, MLflow, etc.)
PhD or MS in Computer Science, ML, or related field (or equivalent industry experience)
Enterprise B2B experience: Sold to or deployed infrastructure for Fortune 500 customers with complex security and compliance requirements

Cultural Fit

Technical Respect: Ability to earn respect through hands-on technical contribution, not just management authority
Intensity: Thrives in our unusually intense culture—willing to grind when needed and expects the same from your team
Customer Obsession: Passionate about solving real customer problems and building infrastructure that enables their success
Deep Work: Values long, uninterrupted periods of focused work and fosters this culture in your team
High Availability: Ready to be deeply involved whenever critical issues arise, whether that's at 2 AM or on weekends
Communication: Can translate complex technical concepts to diverse audiences and bridge engineering, research, and business
Growth Mindset: Embraces continuous learning and develops this mindset in your team
Startup Mindset: Comfortable with ambiguity, rapid change, and wearing multiple hats—you're a builder first, manager second
Work Ethic: Willing to put in the extra hours when needed to hit critical milestones and holds your team to high standards
Low Ego, High Accountability: Collaborative leadership style with focus on outcomes over personal credit

What We Offer

Build the foundation: Shape the ML platform strategy for a rapidly growing foundational AI company from the ground up
Real-world impact: See your infrastructure power physics simulations that optimize automotive aerodynamics, maritime vessel design, and other critical engineering applications
Direct CEO collaboration: Work closely with the founder & CEO, influence company strategy, and have your voice heard on major decisions
Exceptional team: Recruit and work with world-class deep learning researchers, CFD experts, and infrastructure engineers
Competitive compensation: Base salary + significant equity upside as a founding leadership hire
In-person culture: 5 days a week in office with a team that values face-to-face collaboration, deep technical discussions, and building together
World-class network: Access to our investors and advisors including Eric Schmidt, Elad Gil, Ion Stoica, David Patterson, and others

Benefits

Competitive compensation and equity
Competitive health, dental, vision benefits paid by the company
401(k) plan offering
Flexible vacation
Team Building & Fun Activities
Great scope, ownership and impact
AI tools stipend
Monthly commute stipend
Monthly wellness / fitness stipend
Daily office lunch & dinner covered by the company
Immigration support

How We're Different

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, who comes short again and again... who at the best knows in the end the triumph of high achievement, and who at the worst, if he fails, at least fails while daring greatly." - Teddy Roosevelt

At our core, we believe in being "in the arena." We are builders, problem solvers, and risk-takers who show up every day ready to put in the work: to sweat, to struggle, and to push past our limits. We know that real progress comes with missteps, iteration, and resilience. We embrace that journey fully knowing that daring greatly is the only way to create something truly meaningful.

If you're ready to build the ML platform that will revolutionize physics simulation, lead a world-class team, and deliver transformative impact to industrial engineering, UniversalAGI is the place for you.

Head of ML Platform ML Infrastructure MLOps Distributed Training PyTorch DeepSpeed Ray Kubernetes Docker AWS GCP GPU Optimization Python Infrastructure-as-Code CAE CFD HPC Foundation Models SLURM

Average salary estimate

$275000 / YEARLY (est.)

min

max

$200000K

$350000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Tech Lead — ASR / TTS / Speech LLM (IC + Mentor)

OutcomesAI Hybrid Boston

VIEW

Posted 19 hours ago

Lead the design, training, and production deployment of ASR, TTS, and Speech LLM systems at OutcomesAI to power HIPAA-compliant voice agents in clinical settings.

Principal AI Engineer

TENEX.AI Hybrid Sarasota

VIEW

Posted 22 hours ago

TENEX seeks a Principal AI Engineer in Sarasota, FL to architect and productionize AI-driven detection, investigation, and remediation systems for a next-generation MDR platform.

Software Engineer - Platform (Remote)

Jobgether Hybrid No location specified

VIEW

Posted 20 hours ago

Experienced Platform Engineer needed to design and optimize scalable backend systems and cloud infrastructure for a leading data orchestration platform (fully remote).

Java Developer (AI Experience) - (Full-time)

Bounteous Hybrid United States

VIEW

Posted 20 hours ago

An experienced Java developer with AI/ML familiarity is needed to integrate and productionize machine learning capabilities within enterprise Java applications at a leading digital transformation consultancy.

Senior Staff Software Engineer

Jobgether Hybrid US

VIEW

Posted 1 hour ago

Senior Staff Software Engineer to lead architecture and hands-on development of scalable .NET systems while mentoring engineers and shaping product direction.

Embedded Software Engineer

Observable Space Hybrid Los Angeles

VIEW

Posted 22 hours ago

Observable Space is seeking an Embedded Software Engineer to design and maintain embedded Linux systems, drivers, and high-speed peripheral bring-up for next-generation ground and space telescopes in a hybrid Los Angeles role.

Software Engineer (All Levels)

Awesome Motive Hybrid New York City

VIEW

Posted 21 hours ago

Blossom Health, a Series A AI-native startup tackling the mental health crisis, is hiring Software Engineers in SoHo to build scalable, clinician-facing products and integrate modern AI capabilities.

Senior Backend Software Engineer, Subscriptions Enablement

GameChanger Hybrid New York

VIEW

Posted 21 hours ago

GameChanger seeks an experienced Senior Backend Software Engineer to lead development and reliability improvements for its subscriptions platform, working remotely across the U.S. or from our Manhattan office.

Engineering Manager

Iru Hybrid Miami

VIEW

Posted 1 hour ago

Lead and grow an engineering team at Iru to design and deliver scalable, secure core services for a modern AI-era security platform headquartered in Miami.

Software Developer

Jobspot Hybrid No location specified

VIEW

Posted 22 hours ago

Innovative tech company seeks a Software Developer to design and deliver scalable, maintainable applications across front-end and/or back-end stacks.

Founding Protocol Engineer

Pinot's Palette Hybrid United States

VIEW

Posted 22 hours ago

Build and lead the design and implementation of Palette Labs' core decentralized protocol primitives, smart contracts, and onchain/offchain integrations as the Founding Protocol Engineer.

Staff Software Development Engineer

Jobgether Hybrid US

VIEW

Posted 2 hours ago

Lead architecture and development of scalable full‑stack and cloud data systems for a clean‑energy software platform as a Staff Software Development Engineer.