Let’s get started
By clicking ‘Next’, I agree to the Terms of Service
and Privacy Policy, and consent to receive emails from Rise
Jobs / Job page
Inference Optimization Engineer image - Rise Careers
Job details

Inference Optimization Engineer

About BentoML

BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. With support from investors such as DCM, enterprises around the world rely on us for consistent scalability and performance in production. Our portfolio includes both open source and commercial products, and our goal is to help each team build its own competitive advantage through AI.

Role

As an Inference Optimization Engineer, you will improve the speed and efficiency of large language models at the GPU kernel level, through the inference engine, and across distributed architectures. You will profile real workloads, remove bottlenecks, and lift each layer of the stack to new performance ceilings. Every gain you unlock will flow straight into open source code and power fleets of production models, cutting GPU costs for teams around the world. By publishing blog posts and giving conference talks you will become a trusted voice on efficient LLM inference at large scale.

Example projects:

Responsibilities

  • Latency & throughput - Identify bottlenecks and optimize inference efficiency in single-GPU, multi-GPU, and multi-node serving setups.

  • Benchmarking - Build repeatable tests that model production traffic; track and report vLLM, SGLang, TRT-LLM, and future runtimes.

  • Resource efficiency - Reduce memory use and compute cost with mixed precision, better KV-cache handling, quantization, and speculative decoding.

  • Serving features - Improve batching, caching, load balancing, and model-parallel execution.

  • Knowledge sharing - Write technical posts, contribute code, and present findings to the open-source community.

Qualifications

  • Deep understanding of transformer architecture and inference engine internals.

  • Hands-on experience speeding up model serving through batching, caching, load balancing.

  • Experienced with inference engines such as vLLM, SGLang, or TRT-LLM (upstream contributions are a plus).

  • Experienced with inference optimization techniques: quantization, distillation, speculative decoding, or similar.

  • Proficiency in CUDA and use of profiling tools like Nsight, nvprof, or CUPTI. Proficiency in Triton and ROCm is a bonus.

  • Track record of blog posts, conference talks, or open-source projects in ML systems is a bonus.

Why join us

  • Direct impact – optimize distributed LLM inference and large GPU clusters worldwide and cut real GPU costs.

  • Technical scope – operate distributed LLM inference and large GPU clusters worldwide.

  • Customer reach – support organizations around the globe that rely on BentoML.

  • Influence – mentor teammates, guide open-source contributors, and become a go-to voice on efficient inference in the community.

  • Remote work – work from where you are most productive and collaborate with teammates in North America and Asia.

  • Compensation – competitive salary, equity, learning budget, and paid conference travel.

Average salary estimate

$200000 / YEARLY (est.)
min
max
$160000K
$240000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs
Photo of the Rise User

Lead the design and implementation of scalable data-processing workflows and APIs for a large-scale ingestion and analysis platform based at Fort Meade.

Senior full-stack engineer role at LPL Financial building and modernizing account opening and onboarding systems using .NET, Angular, AWS, IaC and AI-assisted development tools.

Console Hybrid San Francisco
Posted 7 hours ago

Design and operate Console’s core infrastructure to enable scalable, enterprise-grade deployments and accelerate developer velocity across the company.

Photo of the Rise User
Inclusive & Diverse
Empathetic
Collaboration over Competition
Growth & Learning

Lucid Software is seeking a Software Engineer Intern for Summer 2026 to help design and build modern web applications using JavaScript, TypeScript, and cloud technologies.

Photo of the Rise User
Posted 2 hours ago

Lead the design and reliability of HackerOne's cloud infrastructure as a Staff Infrastructure Engineer focusing on scalable, secure, and observable systems.

Photo of the Rise User
Verneek Hybrid No location specified
Posted 4 hours ago

Verneek is hiring a Frontend Engineer to craft high-performance TypeScript and React applications and collaborate on AI-driven product features.

Photo of the Rise User
Posted 3 hours ago

Join Digital Turbine's Raleigh/Durham Android team as a Junior Android Engineer to help build and maintain Android-based products and grow technically within an agile, collaborative environment.

Photo of the Rise User
Posted 7 hours ago

Apex.AI is hiring a Senior Application Engineer in Palo Alto to build and deploy C++-based applications and provide technical customer support for safety-critical mobility software.

Photo of the Rise User
Posted 3 hours ago

Esri is hiring a Senior Product Engineer to drive development, testing, and productization of ArcGIS Monitor features that improve observability and performance for enterprise GIS systems.

Photo of the Rise User
Posted 5 hours ago
Passion for Exploration
Dare to be Different
Customer-Centric
Diversity of Opinions
Inclusive & Diverse

Attentive is hiring a Frontend Software Engineer to build scalable React/TypeScript UIs for its enterprise email marketing platform in a hybrid New York City role.

Photo of the Rise User
Posted 4 hours ago
Inclusive & Diverse
Transparent & Candid
Growth & Learning
Diversity of Opinions
Mission Driven
Customer-Centric
Rapid Growth
Dare to be Different
Collaboration over Competition

Patreon is hiring a Senior Backend Platform Engineer to build scalable backend systems and platform tooling that accelerate product teams and support creator growth.

Photo of the Rise User
TreviPay Hybrid Overland Park, KS
Posted 1 hour ago

TreviPay is hiring a Senior Software Engineer to lead the development and operational delivery of secure, high-performance Java-based payment systems and APIs.

Photo of the Rise User

Zoox is hiring a Learned Trajectory Machine Learning Engineer to develop and deploy deep learned trajectory models using imitation and reinforcement learning for autonomous vehicles.

MATCH
Calculating your matching score...
FUNDING
SENIORITY LEVEL REQUIREMENT
TEAM SIZE
No info
HQ LOCATION
No info
EMPLOYMENT TYPE
Full-time, remote
DATE POSTED
September 29, 2025
Risa star 🔮 Hi, I'm Risa! Your AI
Career Copilot
Want to see a list of jobs tailored to
you, just ask me below!