Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
Senior Cluster Site Reliability Engineer image - Rise Careers
Job details

Senior Cluster Site Reliability Engineer

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Cluster Site Reliability Engineer in California (USA).

This role is designed for a highly skilled engineer to ensure the reliability, scalability, and performance of critical research compute clusters. You will maintain and optimize both on-premises and cloud infrastructure while implementing automation and SRE best practices. Working closely with engineering and research teams, you will solve real-time operational issues, drive systemic improvements, and build observability frameworks to monitor cluster health. Your work will directly impact cutting-edge machine learning research, enabling teams to operate efficiently at scale. This position offers the opportunity to apply your technical expertise to complex distributed systems and HPC environments while collaborating with a high-performing, innovative team.

Accountabilities:

  • Act as a first responder to cluster outages or performance issues, triaging and resolving urgent problems efficiently.
  • Maintain high uptime and define, track, and report on SLAs to quantify reliability.
  • Diagnose recurring systemic issues and engineer long-term solutions in collaboration with engineering teams.
  • Develop and maintain observability and monitoring frameworks, including custom metrics for cluster health.
  • Support policy design for fair cluster usage and implement enforcement mechanisms for research teams.
  • Forecast cluster growth, optimize scaling strategies, and improve operational efficiency across cost, performance, and usability dimensions.
  • Collaborate with software and research teams to support distributed computing and machine learning workflows.
  • 5+ years of experience in SRE, DevOps, or similar senior engineering roles.
  • Expertise in HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or ML training systems (Kubeflow, MLflow, Horovod).
  • Proficiency in scripting (Python, Ruby, or similar) and infrastructure-as-code/configuration management (Terraform, Ansible).
  • Hands-on experience with cloud platforms (AWS or GCP) and distributed storage systems (Lustre, Ceph, S3).
  • Strong familiarity with observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry).
  • Bachelor’s degree in Computer Science or equivalent experience.
  • Systematic, automation-driven mindset with a focus on reliability engineering.
  • Experience with HPC frameworks, Kubernetes-based job orchestrators, and distributed computing frameworks (Ray, Dask, Spark).
  • Knowledge of ML frameworks (PyTorch, TensorFlow, JAX, Horovod, DeepSpeed).
  • Experience with hybrid or on-prem/cloud environments and HPC networking (InfiniBand, RDMA).
  • Strong security/IAM understanding, including Zero Trust and cloud IAM.
  • Proficiency with containerization (Docker, Podman, Singularity) for HPC/batch compute environments.

Benefits:

  • Base salary: $205,000 – $235,000 (depending on experience and location).
  • Comprehensive benefits package: medical, dental, and vision coverage; life and AD&D insurance.
  • Paid time off: 20 vacation days and 9 sick days annually.
  • Retirement plan: 401(k) with company match.
  • Opportunities to work on cutting-edge HPC and ML infrastructure at scale.

Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.

When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.
🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience, and achievements.
📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.
🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role.
🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed.

The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role. Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team.

Thank you for your interest!

#LI-CL1

Average salary estimate

$220000 / YEARLY (est.)
min
max
$205000K
$235000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs
Photo of the Rise User
Posted 2 hours ago

Lead analytics-driven optimization as a Senior Digital Analyst for a US eCommerce partner, turning web data into clear recommendations that boost conversion and customer engagement.

Photo of the Rise User
Posted 2 hours ago

Lead the design and operation of scalable, secure cloud infrastructure and CI/CD practices as a Senior DevOps Engineer for a mission-driven, remote US company.

Photo of the Rise User
Posted 12 hours ago
Inclusive & Diverse
Transparent & Candid
Mission Driven
Collaboration over Competition
Empathetic
Social Impact Driven
Rise from Within
Work/Life Harmony
Maternity Leave
Paternity Leave
Family Coverage (Insurance)
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Disability Insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
Paid Time-Off

Notion is hiring a Winter 2026 Software Engineer Intern to join its New York or San Francisco teams for a 12-week, mentor-led program building product, mobile, and infrastructure features.

Photo of the Rise User
Posted 7 hours ago
Customer-Centric
Mission Driven
Inclusive & Diverse
Rise from Within
Diversity of Opinions
Work/Life Harmony
Growth & Learning
Transparent & Candid
Medical Insurance
Paid Time-Off
Maternity Leave
Mental Health Resources
Equity
Child Care stipend
Paternity Leave
WFH Reimbursements
Flex-Friendly
Dental Insurance
Vision Insurance
Life insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
401K Matching
Military leave

NVIDIA seeks a Senior Software Engineer to design and implement scalable GPU cluster solutions and AIOps-driven operational tools that accelerate internal AI research.

Gridium Hybrid No location specified
Posted 16 hours ago

Gridium seeks a U.S.-based frontend engineer to build intuitive, data-rich UIs for its energy-efficiency SaaS using Ember.js/React and modern AI-assisted development tools.

Photo of the Rise User
Posted 6 hours ago

Replicate seeks a Front-End Engineer to design and implement intuitive, high-performance React/TypeScript interfaces that bridge design and backend systems for AI model deployment and observability.

Photo of the Rise User
Visa Hybrid Atlanta, GA
Posted 21 hours ago

Senior engineering leader to define and execute technical strategy and architecture for Visa's RaIS products (3DS, identity, AI) while leading teams to deliver secure, highly available payment solutions.

Photo of the Rise User
Customer-Centric
Mission Driven
Inclusive & Diverse
Rise from Within
Diversity of Opinions
Work/Life Harmony
Growth & Learning
Transparent & Candid
Medical Insurance
Paid Time-Off
Maternity Leave
Mental Health Resources
Equity
Child Care stipend
Paternity Leave
WFH Reimbursements
Flex-Friendly
Dental Insurance
Vision Insurance
Life insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
401K Matching
Military leave

Lead the development of robust, high-performance deep learning training infrastructure for NVIDIA's Autonomous Vehicles group to enable multi-thousand-GPU training and rapid experimentation on massive datasets.

Photo of the Rise User

Intuitive is hiring a Senior Research Software Engineer to design and prototype embedded, real-time, and AI/ML-enabled systems for future robotic medical products in Sunnyvale, CA.

Photo of the Rise User
Posted 3 hours ago

Senior Software Engineer - Infrastructure needed to build and operate resilient, multi-tenant AWS cloud platforms using IaC, observability tooling, and container orchestration in a hybrid work model.

Photo of the Rise User
Endava Hybrid Remote, United States
Posted 13 hours ago

Work as a Senior SYNON developer for a global technology consultancy, enhancing RxCLAIM claim adjudication logic and maintaining legacy SYNON/COOL 2E systems in a remote capacity.

Prime Healthcare is hiring a Software Developer to build and maintain scalable software solutions that improve hospital operations and patient care.

Photo of the Rise User
Posted 20 hours ago

Nationwide Technology seeks a Technical Lead for Conversational AI to architect and deliver enterprise-scale chatbot and omnichannel AI assistant solutions while mentoring engineering teams.

Photo of the Rise User
ServiceNow Hybrid Building A,B,C 2225 Lawson Lane, Santa Clara, CALIFORNIA, United States
Posted 20 hours ago
Inclusive & Diverse
Mission Driven
Rise from Within
Diversity of Opinions
Work/Life Harmony
Empathetic
Feedback Forward
Take Risks
Collaboration over Competition
Medical Insurance
Dental Insurance
Vision Insurance
Mental Health Resources
Life insurance
Disability Insurance
Health Savings Account (HSA)
Flexible Spending Account (FSA)
Conferences Stipend
Paid Time-Off
Maternity Leave
Equity

Lead and deliver core AI platform security and cryptographic services at scale, architecting secure key management, PKI, identity, and agent/LLM security for ServiceNow's enterprise platform.

Posted 15 hours ago

Booz Allen is hiring a Mid Back End Developer in Reston, VA to build Java-based APIs and ETL/data-processing systems that support mission-critical analytics and client solutions.

Jobgether has the ambition to disrupt the recruitment industry as we know it by simplifying it and making it more accurate 🎯 Jobgether platform connects candidates and companies based on: - Skills -... Values - Ambition - Personality The candidat...

1131 jobs
MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
October 5, 2025
Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!