Browse 48 exciting jobs hiring in Model Evaluation now. Check out companies hiring such as The College Board, The Browser Company, Danaher in New Orleans, Providence, St. Petersburg.
Lead the College Board’s strategy and operational implementation of automated scoring and AI/NL measurement to ensure valid, fair, and scalable solutions across major assessment programs.
Lead and grow the ML engineering practice at The Browser Company to build and ship LLM-powered, privacy-aware features that personalize and improve Dia’s browser experience.
Lead the design and execution of evaluation frameworks for multimodal AI systems at Danaher to ensure performance, robustness and safety across life-sciences and diagnostics products.
Netflix is hiring a Machine Learning Scientist (L4) to research and develop generative computer-vision and graphics models that will be integrated into production tools for media creation.
Product Manager, Model Behavior to define and elevate TTS/STT quality and evaluation at Cartesia, shaping how voice AI sounds, performs, and delights customers.
Cartesia is looking for a Post-Training Researcher to design and scale preference optimization, evaluation, and feedback-driven learning methods for multimodal foundation models.
Lead and scale a remote ML engineering team to design, deploy, and iterate on LLM-powered features that drive measurable product impact.
Blend seeks an experienced Associate Director of Product Management to lead discovery, specification, and iterative delivery of agentic AI products while aligning cross-functional teams to measurable outcomes.
Cartesia is hiring a senior Product Manager to define and lead the voice AI agent product area, building enterprise-grade speech-driven agents and evaluation standards using cutting-edge audio models.
Lead the development and scaling of LLM-driven product features as an Engineering Manager focused on ML strategy, team growth, and production-quality infrastructure in a remote-first, high-impact startup.
Applied Scientist Intern role at Wealth.com focused on building and productionizing LLM-, NLP- and CV-driven legal/financial AI assistants through hands-on model development and evaluation.
Lead the ML effort at Retell AI to build, evaluate, and deploy real-time LLM and audio models powering high-traffic voice agents.
Provide expert Biology knowledge to a generative AI research team by designing tasks, authoring guidelines, and evaluating model outputs in a long-term remote contract.
Hippocratic AI is hiring an RN Clinical Product Strategist in Palo Alto to turn nursing expertise into clinical evaluation, safety rails, and product guidance for healthcare LLMs.
Lyra Health is seeking a Senior Machine Learning Engineer to build production ML and generative-AI tooling, platforms, and services that support clinical mental-health products and scale across the organization.
Senior Analyst, GenAI supporting Comcast's Enterprise BI GenAI Center of Excellence to build, evaluate and operationalize generative AI workflows for marketing and creative content.
Lead and scale an AI engineering organization to deliver production-ready foundation models and LLM-powered products that impact millions of users in a fully remote environment.
Director of AI Engineering needed to lead a remote team in designing, deploying, and optimizing large-scale, production AI systems and foundational model infrastructure.
Chicago Cubs are hiring Data Scientists to develop and deploy analytical models and data pipelines that inform player evaluation, development, and strategic decisions across Baseball Operations.
Mercor is hiring an AI Researcher in San Francisco to lead LLM evaluation research, publish benchmark papers, and build dataset and annotation offerings for top AI labs.
Lead strategy and roadmap for AI solutions in Mastercard’s AI Center of Excellence, translating ML capabilities into clear business outcomes while advising senior leaders.
Customer-facing AI engineer focused on designing and operationalizing advanced prompts, prompt chains, and linguistic patterns to produce high-quality, formatted content for WRITER’s enterprise customers.
Lead and grow a hands-on ML systems engineering organization in Basis’s NYC office to build production multi-agent architectures, evaluation pipelines, and observability tooling that power the AI Accountant.
An acquired GenAI-native tax-document platform seeks an experienced Applied AI/ML Engineer to own and scale production ML systems that transform accounting workflows.
FurtherAI is hiring a hands-on Data Scientist to lead evaluation, LLM tuning, and metrics for production AI systems supporting major insurance customers at our San Francisco headquarters.
At Rox, an Applied AI Engineer will build and deploy agentic LLM-powered workflows in production to supercharge revenue teams and iterate rapidly with customers and product partners.
OpenAI is hiring a San Francisco-based software engineer to build evals, harnesses, and pipelines that drive model improvement and product reliability for advanced AI systems.
Lead the QA strategy for GenAI-powered e-learning features by designing prompt validation, HITL review workflows, and measurable evaluation protocols to ensure safe, reliable model behavior.
Lead development of models, datasets, and evaluations that advance theoretical research in mathematics and related fields while partnering with academia and engineering teams.
Work remotely as a Data Analyst producing and evaluating high-quality data and feedback to help improve AI systems for a US-based partner.
Epoch AI is hiring a remote Researcher to provide fast, rigorous analysis and forecasts across the AI pipeline for external partners and consultations.
Work remotely across AI projects to create, evaluate, and refine high-quality written data and prompts that improve model performance and user experience.
WitnessAI seeks a Machine Learning Engineer to design, evaluate, and deploy reliable, interpretable LLMs that power enterprise AI security and governance.
Lead the design, benchmarking, fine-tuning, and production deployment of specialized agents for enterprise customers using state-of-the-art RL and post-training techniques.
Handshake AI is seeking a Senior AI Research Engineer to architect and scale large post-training and evaluation systems for LLMs and lead engineering efforts that translate research into production-grade benchmarks and pipelines.
Itron is hiring an AI Data Science Analyst to support building and deploying AI-agent driven HR solutions that translate HR data into actionable insights.
Welo Data is hiring a Prompt Engineer & Data Analyst to design and evaluate prompts, curate datasets, and perform rigorous model analysis to enhance LLM capabilities and safety.
Contribute remotely to AI system improvement by creating, evaluating, and refining high-quality content and prompts that enhance model performance and user experience.
Lead the architecture and roadmap of Jiffy's core AI platform, driving model development, inference optimization, APIs, and agentic systems to power novel consumer and developer experiences.
Promise is seeking an AI Researcher to develop and operationalize ML and LLM solutions that streamline access to government benefits and improve service delivery.
Lead frontier LLM experimentation and productionize interpretable AI agent workflows as an early research scientist at DimRed.
Lead a cross-disciplinary data science and ML team to deliver LLM-driven solutions, scalable pipelines, and enterprise analytics for Netflix's Content organization.
Lead high-impact, product-aligned experiments on foundation models using PyTorch and distributed training to improve real-world customer outcomes at Liquid AI.
Lead and grow Compa’s inaugural Applied AI team, driving production ML systems and MLOps practices to power enterprise compensation intelligence.
Lead large-scale LLM training and synthetic data pipelines at Periodic Labs to build scientifically knowledgeable models and scale training across supercomputing infrastructure.
Lead the AI product strategy for an enterprise cloud data protection platform, turning real-world customer needs into high-impact, AI-enabled product features and commercial launches.
Join Mistral AI as a Model Behavior Architect to shape LLM behavior through prompt design, evaluation pipelines, and policy work informed by humanities expertise.
AirOps is hiring a Senior Product Manager to lead the Agents product — designing agent orchestration, evaluation frameworks, and workflows that turn AI insights into publish-ready content at scale.
Below 50k*
3
|
50k-100k*
1
|
Over 100k*
40
|