Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
Staff Software Engineer, GPU Infrastructure (HPC) image - Rise Careers
Job details

Staff Software Engineer, GPU Infrastructure (HPC)

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Software Engineer, GPU Infrastructure (HPC) in United States, Canada.

As a Staff Software Engineer in GPU infrastructure, you will design, build, and operate high-performance computing clusters to accelerate AI and machine learning workloads. You will collaborate closely with researchers and engineers to ensure AI workloads run reliably, efficiently, and at scale across cloud environments. The role includes optimizing infrastructure for cost, performance, and stability, while providing self-service tools for ML teams. You will troubleshoot complex issues, implement automation and observability best practices, and drive innovations in distributed GPU/TPU systems. This position offers opportunities to mentor engineers, influence infrastructure strategy, and directly impact the development of cutting-edge AI models. You will work in a fast-paced, collaborative environment where technical excellence and scalability are key priorities.

Accountabilities:
• Design, deploy, and manage Kubernetes-based GPU/TPU superclusters across multiple clouds for AI/ML workloads.
• Optimize HPC infrastructure for distributed training frameworks such as JAX, PyTorch, and TensorFlow.
• Identify and resolve performance bottlenecks, system failures, and infrastructure issues.
• Build self-service tools to enable researchers to monitor, debug, and optimize AI/ML training jobs independently.
• Implement best practices for automation, observability, and infrastructure-as-code (IaC).
• Collaborate closely with AI researchers and ML engineers to translate emerging needs into robust infrastructure solutions.
• Mentor team members, conduct code reviews, document processes, and foster a culture of knowledge sharing.

 • Deep expertise in ML/HPC infrastructure, including GPU/TPU clusters and distributed training frameworks.
• Proven experience with cloud-native Kubernetes deployments at scale.
• Strong programming skills in Python and Go, with preference for open-source contributions.
• Knowledge of Linux internals, RDMA networking, and performance optimization for ML workloads.
• Demonstrated ability to collaborate with research teams and solve complex infrastructure challenges.
• Self-directed problem-solving mindset with ability to drive impact in fast-paced environments.
• Experience in building scalable, resilient, and maintainable infrastructure systems.

 • Inclusive and collaborative work culture.
• Opportunities to work on cutting-edge AI research and infrastructure projects.
• Weekly lunch stipend, in-office meals, and snacks.
• Comprehensive health and dental benefits, including mental health budget.
• 100% parental leave top-up for up to six months.
• Personal enrichment benefits for arts, fitness, and workspace improvement.
• Remote-flexible work options, co-working stipend, and offices in major global cities.
• Six weeks of vacation (30 working days).


Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.
When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.
🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience and achievements.
📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.
🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role.
🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed.
The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role.
Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team.
Thank you for your interest!

 

#LI-CL1

Average salary estimate

$235000 / YEARLY (est.)
min
max
$170000K
$300000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs
Photo of the Rise User

Senior Staff Software Engineer needed to lead backend architecture and development for scalable microservices, mentor engineering teams, and drive strategic technical initiatives in a remote-friendly US role.

Photo of the Rise User

Lead major gift strategy and cultivation in the Bay Area, converting relationships with founders, investors, and tech leaders into transformational philanthropic support.

Photo of the Rise User

Lead the development of a high-throughput analytics platform for K-12 edtech as a Senior Software Engineer, working across Go backends, React dashboards, data pipelines, and AI integrations.

Photo of the Rise User
Posted 19 hours ago

Senior technical leader needed to architect and scale full-stack systems powering clinical AI agents, with a focus on reliability, real-time oversight, and developer-friendly no-code tools.

A design-led agency is looking for a hands-on Full-Stack WordPress Developer to deliver custom, high-quality WordPress sites and integrations while working fully remote on an EST schedule.

Photo of the Rise User
Posted 6 hours ago
Inclusive & Diverse
Empathetic
Collaboration over Competition
Growth & Learning
Dental Insurance
Flexible Spending Account (FSA)
Disability Insurance
Health Savings Account (HSA)
Vision Insurance
Paid Holidays

GoGuardian is hiring a remote Staff Software Engineer to lead architecture and build safety-focused SaaS features for K–12 schools while mentoring engineers and operating cloud infrastructure.

Photo of the Rise User
Posted 5 hours ago

Contract Software Engineer to develop and maintain internal annotation, dataset-tracking, and visualization tools supporting Mach9's geospatial ML teams (remote with PST overlap preferred).

Photo of the Rise User

Experienced Principal Software Engineer needed to lead and implement scalable full-stack solutions (TypeScript/Angular, GraphQL, Java/Python) while mentoring team members and shaping product functionality.

Photo of the Rise User
Posted 4 hours ago
Inclusive & Diverse
Rise from Within
Mission Driven
Diversity of Opinions
Work/Life Harmony

Build and scale Zapier’s core event and queuing systems as a Backend Engineer on the Events team, working with Kafka/MSK, Avro, AWS, Terraform, and Python/Go.

Photo of the Rise User
Posted 23 hours ago

Summation is hiring a Forward Deployed Engineer to lead end-to-end client deployments, build performant data pipelines and models, and deliver measurable business impact from financial data.

Photo of the Rise User
Dental Insurance
Flexible Spending Account (FSA)
Vision Insurance
Performance Bonus
Family Medical Leave
Paid Holidays

A 10-week Embedded C++ Software Engineering internship at Evolv offers hands-on firmware development, mentorship, and collaboration on machine-learning-enabled embedded projects with hybrid work flexibility.

Photo of the Rise User
Posted 23 hours ago

Quantiphi is hiring a Senior Full-Stack Developer (remote) to build and operate scalable web applications and microservices using modern frontend frameworks, Python/Node backends, and AWS cloud services.

Photo of the Rise User

Experienced Senior Software Engineer sought to lead development of high-quality, secure SaaS/cloud applications for a distributed US team represented by Jobgether.

Tek Spikes Hybrid No location specified
Posted 5 hours ago

Lead DevOps Engineer needed to architect and modernize CI/CD and cloud infrastructure for large-scale enterprise applications in Dallas, TX.

Photo of the Rise User
Tyk Hybrid No location specified
Posted 3 hours ago

Tyk is hiring a hands-on Technical Lead (EMEA, remote) to define architecture and build scalable non-functional engineering capabilities (observability, CI/CD, testing, performance) while mentoring teams and delivering measurable impact.

Jobgether has the ambition to disrupt the recruitment industry as we know it by simplifying it and making it more accurate 🎯 Jobgether platform connects candidates and companies based on: - Skills -... Values - Ambition - Personality The candidat...

738 jobs
MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
EMPLOYMENT TYPE
Full-time, hybrid
DATE POSTED
November 22, 2025
Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!