This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Staff Software Engineer, GPU Infrastructure (HPC) in United States, Canada.
As a Staff Software Engineer in GPU infrastructure, you will design, build, and operate high-performance computing clusters to accelerate AI and machine learning workloads. You will collaborate closely with researchers and engineers to ensure AI workloads run reliably, efficiently, and at scale across cloud environments. The role includes optimizing infrastructure for cost, performance, and stability, while providing self-service tools for ML teams. You will troubleshoot complex issues, implement automation and observability best practices, and drive innovations in distributed GPU/TPU systems. This position offers opportunities to mentor engineers, influence infrastructure strategy, and directly impact the development of cutting-edge AI models. You will work in a fast-paced, collaborative environment where technical excellence and scalability are key priorities.
Accountabilities:
• Design, deploy, and manage Kubernetes-based GPU/TPU superclusters across multiple clouds for AI/ML workloads.
• Optimize HPC infrastructure for distributed training frameworks such as JAX, PyTorch, and TensorFlow.
• Identify and resolve performance bottlenecks, system failures, and infrastructure issues.
• Build self-service tools to enable researchers to monitor, debug, and optimize AI/ML training jobs independently.
• Implement best practices for automation, observability, and infrastructure-as-code (IaC).
• Collaborate closely with AI researchers and ML engineers to translate emerging needs into robust infrastructure solutions.
• Mentor team members, conduct code reviews, document processes, and foster a culture of knowledge sharing.
• Deep expertise in ML/HPC infrastructure, including GPU/TPU clusters and distributed training frameworks.
• Proven experience with cloud-native Kubernetes deployments at scale.
• Strong programming skills in Python and Go, with preference for open-source contributions.
• Knowledge of Linux internals, RDMA networking, and performance optimization for ML workloads.
• Demonstrated ability to collaborate with research teams and solve complex infrastructure challenges.
• Self-directed problem-solving mindset with ability to drive impact in fast-paced environments.
• Experience in building scalable, resilient, and maintainable infrastructure systems.
• Inclusive and collaborative work culture.
• Opportunities to work on cutting-edge AI research and infrastructure projects.
• Weekly lunch stipend, in-office meals, and snacks.
• Comprehensive health and dental benefits, including mental health budget.
• 100% parental leave top-up for up to six months.
• Personal enrichment benefits for arts, fitness, and workspace improvement.
• Remote-flexible work options, co-working stipend, and offices in major global cities.
• Six weeks of vacation (30 working days).
Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.
When you apply, your profile goes through our AI-powered screening process designed to identify top talent efficiently and fairly.
🔍 Our AI evaluates your CV and LinkedIn profile thoroughly, analyzing your skills, experience and achievements.
📊 It compares your profile to the job’s core requirements and past success factors to determine your match score.
🎯 Based on this analysis, we automatically shortlist the 3 candidates with the highest match to the role.
🧠 When necessary, our human team may perform an additional manual review to ensure no strong profile is missed.
The process is transparent, skills-based, and free of bias — focusing solely on your fit for the role.
Once the shortlist is completed, we share it directly with the company that owns the job opening. The final decision and next steps (such as interviews or additional assessments) are then made by their internal hiring team.
Thank you for your interest!
#LI-CL1
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Senior Staff Software Engineer needed to lead backend architecture and development for scalable microservices, mentor engineering teams, and drive strategic technical initiatives in a remote-friendly US role.
Lead major gift strategy and cultivation in the Bay Area, converting relationships with founders, investors, and tech leaders into transformational philanthropic support.
Lead the development of a high-throughput analytics platform for K-12 edtech as a Senior Software Engineer, working across Go backends, React dashboards, data pipelines, and AI integrations.
Senior technical leader needed to architect and scale full-stack systems powering clinical AI agents, with a focus on reliability, real-time oversight, and developer-friendly no-code tools.
A design-led agency is looking for a hands-on Full-Stack WordPress Developer to deliver custom, high-quality WordPress sites and integrations while working fully remote on an EST schedule.
GoGuardian is hiring a remote Staff Software Engineer to lead architecture and build safety-focused SaaS features for K–12 schools while mentoring engineers and operating cloud infrastructure.
Contract Software Engineer to develop and maintain internal annotation, dataset-tracking, and visualization tools supporting Mach9's geospatial ML teams (remote with PST overlap preferred).
Experienced Principal Software Engineer needed to lead and implement scalable full-stack solutions (TypeScript/Angular, GraphQL, Java/Python) while mentoring team members and shaping product functionality.
Build and scale Zapier’s core event and queuing systems as a Backend Engineer on the Events team, working with Kafka/MSK, Avro, AWS, Terraform, and Python/Go.
Summation is hiring a Forward Deployed Engineer to lead end-to-end client deployments, build performant data pipelines and models, and deliver measurable business impact from financial data.
A 10-week Embedded C++ Software Engineering internship at Evolv offers hands-on firmware development, mentorship, and collaboration on machine-learning-enabled embedded projects with hybrid work flexibility.
Quantiphi is hiring a Senior Full-Stack Developer (remote) to build and operate scalable web applications and microservices using modern frontend frameworks, Python/Node backends, and AWS cloud services.
Experienced Senior Software Engineer sought to lead development of high-quality, secure SaaS/cloud applications for a distributed US team represented by Jobgether.
Lead DevOps Engineer needed to architect and modernize CI/CD and cloud infrastructure for large-scale enterprise applications in Dallas, TX.
Tyk is hiring a hands-on Technical Lead (EMEA, remote) to define architecture and build scalable non-functional engineering capabilities (observability, CI/CD, testing, performance) while mentoring teams and delivering measurable impact.
Jobgether has the ambition to disrupt the recruitment industry as we know it by simplifying it and making it more accurate 🎯 Jobgether platform connects candidates and companies based on: - Skills -... Values - Ambition - Personality The candidat...
738 jobs