BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. With support from investors such as DCM, enterprises around the world rely on us for consistent scalability and performance in production. Our portfolio includes both open source and commercial products, and our goal is to help each team build its own competitive advantage through AI.
As an Inference Optimization Engineer, you will improve the speed and efficiency of large language models at the GPU kernel level, through the inference engine, and across distributed architectures. You will profile real workloads, remove bottlenecks, and lift each layer of the stack to new performance ceilings. Every gain you unlock will flow straight into open source code and power fleets of production models, cutting GPU costs for teams around the world. By publishing blog posts and giving conference talks you will become a trusted voice on efficient LLM inference at large scale.
Example projects:
https://bentoml.com/blog/structured-decoding-in-vllm-a-gentle-introduction
https://bentoml.com/blog/benchmarking-llm-inference-backends
https://bentoml.com/blog/25x-faster-cold-starts-for-llms-on-kubernetes
Latency & throughput - Identify bottlenecks and optimize inference efficiency in single-GPU, multi-GPU, and multi-node serving setups.
Benchmarking - Build repeatable tests that model production traffic; track and report vLLM, SGLang, TRT-LLM, and future runtimes.
Resource efficiency - Reduce memory use and compute cost with mixed precision, better KV-cache handling, quantization, and speculative decoding.
Serving features - Improve batching, caching, load balancing, and model-parallel execution.
Knowledge sharing - Write technical posts, contribute code, and present findings to the open-source community.
Deep understanding of transformer architecture and inference engine internals.
Hands-on experience speeding up model serving through batching, caching, load balancing.
Experienced with inference engines such as vLLM, SGLang, or TRT-LLM (upstream contributions are a plus).
Experienced with inference optimization techniques: quantization, distillation, speculative decoding, or similar.
Proficiency in CUDA and use of profiling tools like Nsight, nvprof, or CUPTI. Proficiency in Triton and ROCm is a bonus.
Track record of blog posts, conference talks, or open-source projects in ML systems is a bonus.
Direct impact – optimize distributed LLM inference and large GPU clusters worldwide and cut real GPU costs.
Technical scope – operate distributed LLM inference and large GPU clusters worldwide.
Customer reach – support organizations around the globe that rely on BentoML.
Influence – mentor teammates, guide open-source contributors, and become a go-to voice on efficient inference in the community.
Remote work – work from where you are most productive and collaborate with teammates in North America and Asia.
Compensation – competitive salary, equity, learning budget, and paid conference travel.
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Lead the design and implementation of scalable data-processing workflows and APIs for a large-scale ingestion and analysis platform based at Fort Meade.
Senior full-stack engineer role at LPL Financial building and modernizing account opening and onboarding systems using .NET, Angular, AWS, IaC and AI-assisted development tools.
Design and operate Console’s core infrastructure to enable scalable, enterprise-grade deployments and accelerate developer velocity across the company.
Lucid Software is seeking a Software Engineer Intern for Summer 2026 to help design and build modern web applications using JavaScript, TypeScript, and cloud technologies.
Lead the design and reliability of HackerOne's cloud infrastructure as a Staff Infrastructure Engineer focusing on scalable, secure, and observable systems.
Verneek is hiring a Frontend Engineer to craft high-performance TypeScript and React applications and collaborate on AI-driven product features.
Join Digital Turbine's Raleigh/Durham Android team as a Junior Android Engineer to help build and maintain Android-based products and grow technically within an agile, collaborative environment.
Apex.AI is hiring a Senior Application Engineer in Palo Alto to build and deploy C++-based applications and provide technical customer support for safety-critical mobility software.
Esri is hiring a Senior Product Engineer to drive development, testing, and productization of ArcGIS Monitor features that improve observability and performance for enterprise GIS systems.
Attentive is hiring a Frontend Software Engineer to build scalable React/TypeScript UIs for its enterprise email marketing platform in a hybrid New York City role.
Patreon is hiring a Senior Backend Platform Engineer to build scalable backend systems and platform tooling that accelerate product teams and support creator growth.
TreviPay is hiring a Senior Software Engineer to lead the development and operational delivery of secure, high-performance Java-based payment systems and APIs.
Zoox is hiring a Learned Trajectory Machine Learning Engineer to develop and deploy deep learned trajectory models using imitation and reinforcement learning for autonomous vehicles.