Job details

Senior Cluster Site Reliability Engineer

Voleon is a technology company that applies state-of-the-art machine learning techniques to real-world problems in finance. For more than a decade, we have led our industry and worked at the frontier of applying machine learning to investment management. We have become a multibillion-dollar asset manager, and we have ambitious goals for the future.

As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. Our research clusters are at the core of our R&D, and you will be directly responsible for keeping this key resource available and performant. Your work will provide a world-class HPC platform for researchers to focus on cutting-edge machine learning problems at scale. You will support both on-prem and cloud infrastructure, and work to provide the best experience to our technical staff. You will leverage IaC, Automation, and SRE principles to refine and hone a product that operates 24/7 to support Voleon.

The Cluster Operations team works on the frontline to triage and mitigate real-time operational issues. You will be an integral member of this team, solving day-to-day issues with high urgency, while also engineering systemic improvements and architectural fixes to prevent recurring issues. You will collaborate with engineering teams to develop improvements to monitoring/telemetry. You will help design and oversee operational frameworks to ensure the cluster operates within a set of rigorous SLAs.

Responsibilities

Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Requirements

5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
Experience with cloud infrastructure (AWS or GCP)
Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
Experience with distributed storage technologies (Lustre, Ceph, S3)
Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation
Bachelor degree in computer science or equivalent experience

Preferred Qualifications

Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)
Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)
Familiarity with hybrid/on-prem environments
Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments
Experience with HPC networking (InfiniBand, RDMA)
Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust)

The base salary range for this position is $205,000 to $235,000 in the location(s) of this posting. Individual salaries are determined through a variety of factors, including, but not limited to, education, experience, knowledge, skills, and geography. Base salary does not include other forms of total compensation such as bonus compensation and other benefits. Our benefits package includes medical, dental and vision coverage, life and AD&D insurance, 20 days of paid time off, 9 sick days, and a 401(k) plan with a company match.

“Friends of Voleon” Candidate Referral Program

If you have a great candidate in mind for this role and would like to have the potential to earn $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program.

Equal Opportunity Employer

The Voleon Group is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

SRE Site Reliability HPC Slurm Kueue Kubeflow Batch Python Terraform Ansible Prometheus Grafana Observability Lustre Ceph S3 AWS GCP Kubernetes Docker InfiniBand RDMA Machine Learning Cluster Operations

Average salary estimate

$220000 / YEARLY (est.)

min

max

$205000K

$235000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

What it's like to work at The Voleon Group

Read Reviews

Similar Jobs

Senior Software Engineer (Remote)

Cable One Hybrid Remote USA

VIEW

Posted 23 hours ago

Experienced software engineer needed to build and maintain .NET-based services and REST APIs for Sparklight/Cable One while providing technical leadership and production support in a remote capacity.

Senior Software Engineer, Realtime Imaging

Anduril Industries Hybrid Boulder, Colorado, United States

VIEW

Posted 23 hours ago

Lead development of high-performance, real-time image processing software for Anduril's imaging products, focusing on low-latency C++ solutions for GPU and embedded systems in Boulder, CO.

Senior Software Engineer - R37

r1rcm Hybrid Remote, USA

VIEW

Posted 13 hours ago

Lead design and delivery of scalable, production-ready systems for an AI-driven healthcare revenue-cycle platform, working end-to-end across product, operations, and engineering.

Java Software Developer - Space Domain

Peraton Hybrid King of Prussia

VIEW

Posted 17 hours ago

Peraton seeks a TS/SCI-eligible Java Software Developer in King of Prussia to build and integrate microservices and backend systems for next-generation Space Domain Awareness missions.

Software Engineer, Full Stack, Level 4

Snapchat Hybrid Los Angeles, California

VIEW

Posted 15 hours ago

Snap is hiring a Level 4 Full Stack Engineer to build performant, user-facing web and mobile experiences and contribute to scalable backend systems.

Simulation Software Engineer

Merlin Labs Hybrid Boston

VIEW

Posted 16 hours ago

Merlin Labs is hiring a Simulation Software Engineer to develop and integrate scalable simulation systems that power autonomous aviation testing, training, and validation.

Mainframe Application Developer - TEXAS, US - HYBRID

Awesome Motive Hybrid USA - TX - PLANO

VIEW

Posted 12 hours ago

DXC Technology seeks a Mainframe Application Developer in Plano, TX to work hybrid (2 days onsite) on COBOL/CICS mainframe development, debugging, and documentation as part of a collaborative engineering team.

Member of Technical Staff - Developer Infrastructure

Jobgether Hybrid No location specified

VIEW

Posted 12 hours ago

Lead the design and implementation of scalable developer infrastructure and productivity tooling for a fast-growing, remote-first technology organization.

Front End Software Engineer, Imaging

Anduril Industries Hybrid Lexington, Massachusetts, United States

VIEW

Posted 22 hours ago

Build robust, user-focused web front-ends for Anduril’s Imaging systems to enable operators to control and assess optical and infrared sensors in real-world environments.

Remote Sr. DevOps System Engineer - Master's Degree Required

Trilogy Federal Hybrid Arlington, VA

VIEW

Posted 20 hours ago

Trilogy Federal is hiring a Senior DevOps System Engineer with a master’s degree to lead secure CI/CD, cloud infrastructure, and DevSecOps modernization for VA systems.

Principal Software Engineer (AI/Cloud/Infrastructure/API)

NBCUniversal Hybrid 904 Sylvan Ave, Englewood Cliffs, NEW JERSEY

VIEW

Posted 5 hours ago

Lead NBCUniversal's developer platforms and AI-enabled SDLC initiatives as a Principal Software Engineer driving cloud control plane, API governance, observability, and developer tooling at enterprise scale.

(PIPELINE) AD&S - Vehicle Software - Senior Software Engineer - V&V

Anduril Industries Hybrid Costa Mesa, California, United States

VIEW

Posted 17 hours ago

Senior Software Engineer (V&V) to lead requirements-based verification, automated testing, and technical leadership for AD&S vehicle software in Costa Mesa at Anduril Industries.

Site Reliability Engineer

Zapier Hybrid San Francisco

VIEW

Posted 15 hours ago

Inclusive & Diverse

Rise from Within

Mission Driven

Diversity of Opinions

Work/Life Harmony

Zapier is hiring a Site Reliability Engineer to improve observability, automate operations, and drive reliability across cloud-native services for a global, remote-first platform.

The Voleon Group

The Voleon Group is a family of companies committed to the development and deployment of cutting-edge technologies in investment management. We specialize in the application of rigorous data-driven techniques to financial markets, driven by our ow...

10 jobs

MATCH

Calculating your matching score...

FUNDING

Growth

DEPARTMENTS

Software Engineering

SENIORITY LEVEL REQUIREMENT

Senior Level

TEAM SIZE

51-200

HQ LOCATION