Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
Site Reliability Engineer, AI/ML Infrastructure image - Rise Careers
Job details

Site Reliability Engineer, AI/ML Infrastructure

About The Role


We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, terabit networking, and hundreds of servers.


You'll be hands-on with the full lifecycle of HPC infrastructure: planning, building, testing, deploying, and keeping everything running smoothly. That means troubleshooting issues as they arise, monitoring performance, developing automation to make our lives easier, and working closely with engineering and science teams to ensure they have what they need. You'll also help us plan for future capacity and evaluate new technologies as we continue to scale.


Responsibilities
  • Manage and optimize HPC cluster operations
  • Deploy and maintain infrastructure-as-code solutions
  • Support ML/research teams with cluster usage optimization
  • Operate, troubleshoot and optimize Ceph storage clusters.
  • Develop automation and tooling


Minimum Qualifications
  • 5+ years of experience in SRE or HPC operations.
  • Proficiency in Linux systems administration (Ubuntu/Debian).
  • Experience with Kubernetes and container orchestration
  • Experience with Ceph >1PB deployments and maintenance
  • Knowledge of security best practices in multi-tenant environments.
  • Understanding of L2/L3 networking fundamentals
  • Skilled in Python and Bash scripting.


Preferred Qualifications
  • Experience with infrastructure-as-code tools (Ansible/Terraform).
  • Experience with GitOps (Helm, ArgoCD).
  • Strong grasp of RDMA, InfiniBand, and GPUDirect technologies
  • Familiarity with deep learning frameworks such as PyTorch and TensorFlow.
  • Familiarity in at least one cloud platform: AWS, Azure or GCP.


$150,000 - $250,000 a year

If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you.

Average salary estimate

$200000 / YEARLY (est.)
min
max
$150000K
$250000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs
Photo of the Rise User
Posted 40 minutes ago

Senior Fullstack Engineer (React/Node) for a remote US role focused on building scalable eCommerce platforms and driving product and architecture improvements.

Photo of the Rise User

Spreetail is hiring a Software Engineering Manager to lead remote engineering teams building large-scale backend and data platform systems that drive ecommerce growth.

Photo of the Rise User
April Hybrid No location specified
Posted 10 minutes ago

Experienced full-stack engineer wanted to lead architecture and production-ready services at april, building scalable, data-driven tax solutions used by millions.

Lead the design and implementation of ultra-low latency trading engines in Rust to power tokenized asset trading at Superstate.

Photo of the Rise User
Hatch Hybrid No location specified
Posted 10 hours ago

Lead the architecture and implementation of large-scale backend systems and LLM-driven agents at a high-growth AI customer-service startup headquartered in NYC with remote options in Austin.

Photo of the Rise User
Posted 13 hours ago

Trimble is hiring a Software Engineer Intern for its CTCT division in Dayton, OH to work on positioning/control software using C++, Python, Matlab/Simulink and modern dev tooling.

Posted 11 hours ago

Lead the architecture and delivery of safety-critical embedded software at Cobot to enable reliable, certifiable human-robot collaboration at industrial scale.

Posted 3 hours ago

Contribute to CapeZero's mission-driven platform by building scalable Django/Python backends and APIs that power renewable energy financing and modeling tools.

Photo of the Rise User
Posted 3 hours ago

Lead frontend development on KrakenFlow's drag-and-drop workflow and form builder, using React/TypeScript to deliver accessible, mobile-first experiences for a global energy platform.

Photo of the Rise User

Adaptive seeks a Senior Backend Software Engineer to drive backend architecture and delivery for its AI-native ERP platform serving the real estate and construction industries.

Photo of the Rise User

Senior embedded firmware engineer to architect and lead development of safety-critical firmware for Zoox's next-generation autonomous robots.

Photo of the Rise User
Posted 32 minutes ago

Lead a distributed engineering team to own and evolve a high-performance macOS endpoint agent, combining hands-on kernel/system work with strong people and incident management skills.

PactFi is hiring a Senior Software Engineer skilled in Laravel and Vue.js to help build secure, scalable infrastructure for private asset transactions.

we are transforming how stories are told, knowledge is learned, and insights are gathered.

1 jobs
MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
November 24, 2025
Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!