Job details

Inference Optimization Engineer

About BentoML

BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. With support from investors such as DCM, enterprises around the world rely on us for consistent scalability and performance in production. Our portfolio includes both open source and commercial products, and our goal is to help each team build its own competitive advantage through AI.

Role

As an Inference Optimization Engineer, you will improve the speed and efficiency of large language models at the GPU kernel level, through the inference engine, and across distributed architectures. You will profile real workloads, remove bottlenecks, and lift each layer of the stack to new performance ceilings. Every gain you unlock will flow straight into open source code and power fleets of production models, cutting GPU costs for teams around the world. By publishing blog posts and giving conference talks you will become a trusted voice on efficient LLM inference at large scale.

Example projects:

Responsibilities

Latency & throughput - Identify bottlenecks and optimize inference efficiency in single-GPU, multi-GPU, and multi-node serving setups.
Benchmarking - Build repeatable tests that model production traffic; track and report vLLM, SGLang, TRT-LLM, and future runtimes.
Resource efficiency - Reduce memory use and compute cost with mixed precision, better KV-cache handling, quantization, and speculative decoding.
Serving features - Improve batching, caching, load balancing, and model-parallel execution.
Knowledge sharing - Write technical posts, contribute code, and present findings to the open-source community.

Qualifications

Deep understanding of transformer architecture and inference engine internals.
Hands-on experience speeding up model serving through batching, caching, load balancing.
Experienced with inference engines such as vLLM, SGLang, or TRT-LLM (upstream contributions are a plus).
Experienced with inference optimization techniques: quantization, distillation, speculative decoding, or similar.
Proficiency in CUDA and use of profiling tools like Nsight, nvprof, or CUPTI. Proficiency in Triton and ROCm is a bonus.
Track record of blog posts, conference talks, or open-source projects in ML systems is a bonus.

Why join us

Direct impact – optimize distributed LLM inference and large GPU clusters worldwide and cut real GPU costs.
Technical scope – operate distributed LLM inference and large GPU clusters worldwide.
Customer reach – support organizations around the globe that rely on BentoML.
Influence – mentor teammates, guide open-source contributors, and become a go-to voice on efficient inference in the community.
Remote work – work from where you are most productive and collaborate with teammates in North America and Asia.
Compensation – competitive salary, equity, learning budget, and paid conference travel.

LLM inference CUDA profiling vLLM TRT-LLM Triton GPU quantization speculative-decoding model-parallel distributed-systems performance-engineer transformer ML-systems

Average salary estimate

$200000 / YEARLY (est.)

min

max

$160000K

$240000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Full-Stack Software Engineer (Data Pipeline Focus)

Neural IT Hybrid Fort Meade

VIEW

Posted 5 hours ago

Lead the design and implementation of scalable data-processing workflows and APIs for a large-scale ingestion and analysis platform based at Fort Meade.

Senior Engineer, Software Development (Full-stack)

LPL Financial Hybrid Fort Mill/Charlotte

VIEW

Posted 1 hour ago

Senior full-stack engineer role at LPL Financial building and modernizing account opening and onboarding systems using .NET, Angular, AWS, IaC and AI-assisted development tools.

Platform Engineer

Console Hybrid San Francisco

VIEW

Posted 7 hours ago

Design and operate Console’s core infrastructure to enable scalable, enterprise-grade deployments and accelerate developer velocity across the company.

Software Engineer Internship - Summer 2026

Lucid Software Hybrid Remote

VIEW

Posted 33 minutes ago

Inclusive & Diverse

Empathetic

Collaboration over Competition

Growth & Learning

Lucid Software is seeking a Software Engineer Intern for Summer 2026 to help design and build modern web applications using JavaScript, TypeScript, and cloud technologies.

Staff Infrastructure Engineer

HackerOne Hybrid Washington

VIEW

Posted 2 hours ago

Lead the design and reliability of HackerOne's cloud infrastructure as a Staff Infrastructure Engineer focusing on scalable, secure, and observable systems.

Frontend Engineer

Verneek Hybrid No location specified

VIEW

Posted 4 hours ago

Verneek is hiring a Frontend Engineer to craft high-performance TypeScript and React applications and collaborate on AI-driven product features.

Junior Android Engineer

Digital Turbine Hybrid Durham, NC

VIEW

Posted 3 hours ago

Join Digital Turbine's Raleigh/Durham Android team as a Junior Android Engineer to help build and maintain Android-based products and grow technically within an agile, collaborative environment.

Senior Application Engineer

Apex.AI Hybrid Palo Alto

VIEW

Posted 7 hours ago

Apex.AI is hiring a Senior Application Engineer in Palo Alto to build and deploy C++-based applications and provide technical customer support for safety-critical mobility software.

Sr. Product Engineer - ArcGIS Monitor

Esri Hybrid Redlands, CA

VIEW

Posted 3 hours ago

Esri is hiring a Senior Product Engineer to drive development, testing, and productization of ArcGIS Monitor features that improve observability and performance for enterprise GIS systems.

Software Engineer, Frontend

Attentive Hybrid New York, NY

VIEW

Posted 5 hours ago

Passion for Exploration

Dare to be Different

Customer-Centric

Diversity of Opinions

Inclusive & Diverse

Attentive is hiring a Frontend Software Engineer to build scalable React/TypeScript UIs for its enterprise email marketing platform in a hybrid New York City role.

Senior Software Engineer, Backend Platform

Patreon Hybrid No location specified

VIEW

Posted 4 hours ago

Inclusive & Diverse

Transparent & Candid

Growth & Learning

Diversity of Opinions

Mission Driven

Customer-Centric

Rapid Growth

Dare to be Different

Collaboration over Competition

Patreon is hiring a Senior Backend Platform Engineer to build scalable backend systems and platform tooling that accelerate product teams and support creator growth.

Senior Software Engineer

TreviPay Hybrid Overland Park, KS

VIEW

Posted 1 hour ago

TreviPay is hiring a Senior Software Engineer to lead the development and operational delivery of secure, high-performance Java-based payment systems and APIs.

Software Engineer - Learned Trajectory Machine Learning Engineer

Zoox Hybrid Foster City, CA

VIEW

Posted 5 hours ago

Zoox is hiring a Learned Trajectory Machine Learning Engineer to develop and deploy deep learned trajectory models using imitation and reinforcement learning for autonomous vehicles.

B BentoML

1 jobs

MATCH

Calculating your matching score...

FUNDING

Series A

DEPARTMENTS

Software Engineering

SENIORITY LEVEL REQUIREMENT

Senior Level

TEAM SIZE

No info