Browse 55 exciting jobs hiring in Llm Evaluation now. Check out companies hiring such as Beyondsoft Consulting, Cartesia, Braintrust in Irvine, Garden Grove, Orlando.
Beyondsoft is hiring a Data Analyst to prepare training data, anonymize documents, and validate LLM/model outputs for AI projects in a remote US-based role.
Join Cartesia’s in-office SF research-engineering team to design and scale synthetic datasets and systems that power next-generation foundation models.
Design and deliver developer-focused curriculum and hands-on programs that teach evals and agentic AI at Braintrust, working closely with engineers and product teams.
Work as a founding Backend Engineer to build scalable, secure backend infrastructure and data pipelines that power high-impact AI features at an early-stage startup in NYC.
MLabs, a fast-growing research lab supporting foundation model teams, is hiring a Senior Research Engineer to develop scalable RL recipes, modular environments, and production-ready data pipelines for post-training.
Lead the architecture and delivery of generative AI and multimodal systems that enable creative and contextual advertising capabilities across Netflix Ads.
WeRide seeks an AI Simulation Engineer to design AI-based simulation scenarios and agent behaviors that validate and accelerate autonomous vehicle algorithms.
Canva is hiring a Senior Research Engineer to engineer agentic, multimodal evaluation systems that automatically assess and improve the quality and human alignment of generative design models.
Eigenplane is hiring a Founding AI Research Scientist to drive LLM and agent research into scalable, interpretable production systems at an early-stage AI startup.
Lead the technical direction and hands-on engineering for Zapier Agents, building production-grade LLM-driven agent capabilities, integrations, and evaluation systems that scale across thousands of apps and real customers.
Ironclad is looking for a Staff Software Engineer - Applied AI to build and productionize LLMs, RAG systems, and document-understanding services that deliver actionable contract insights.
Unstructured seeks an experienced AI/ML Engineer to design, evaluate, and deploy secure ML solutions for Department of Defense and national security customers on government networks.
Work from the SoHo NYC office as an Applied AI Engineer building production LLMs and ML systems that accelerate bringing new therapies to market.
Tessera Labs seeks a Machine Learning Engineer Intern (Fall 2025, Hybrid in San Jose) to build and fine-tune LLM-driven multi-agent pipelines and enterprise tool integrations.
Lead the design and evaluation of long-term memory systems for LLMs at an early-stage AI startup focused on building self-improving agents.
Work with a top AI research lab to evaluate and improve LLM performance on advanced economics tasks by providing expert, written feedback.
Help shape next-generation AI by evaluating advanced physics solutions and guiding research teams to improve model performance as a contract Physics AI Trainer.
Oura is hiring a Senior AI Engineer to design evaluation systems and build custom LLM and agentic models that power next-generation, actionable health recommendations.
A 12-month AI Fellowship at the Gates Foundation to design, prototype, and deploy responsible AI solutions for global health and development while building capacity across program teams.
Work with Khan Academy to design and deploy generative AI features that improve literacy learning in a 24-month fixed-term Senior AI Engineer role.
OpenAI is hiring a Research Engineer/Scientist to advance personality and model-behavior research and integrate novel methods into products used by hundreds of millions of users.
Lead product and context engineering efforts to improve LLM-driven AI agent performance and user experience for advice-focused client intents within Vanguard's Discretionary Advice Platform.
Handshake AI seeks an experienced Electrical Engineering specialist (contract) to refine and annotate AI model outputs across circuits, signal processing, and embedded/embedded-systems domains.
Decagon seeks an Agent Software Engineer intern to build and evaluate production-ready conversational AI agents that improve customer support, working onsite in San Francisco during Summer 2026.
Profound, an NYC AI startup backed by Sequoia, is hiring an AI/ML Engineer to build production-scale NLP and LLM systems for content classification, generation, and measurement.
NBCUniversal is hiring an Analyst, AI Strategy & Innovation to perform market and vendor analysis, build financial/business cases, and support cross-functional pilots and innovation programs across its media businesses.
Mercor is hiring an early-career Data Scientist in San Francisco to drive experiments, metrics, and prototypes that improve hiring match quality and product metrics using SQL, Python, and causal thinking.
Mercor is hiring an Applied AI Engineer to convert real-world human datasets into production-ready signals, deploy and evaluate LLMs, and build integrations and tooling that improve customer outcomes.
OpenAI seeks a Research Engineer to design, build, and iterate frontier evaluations that quantify financial reasoning and related capabilities in large-scale AI models.
Lead the development of large-scale, auditable evaluations for frontier AI models to measure capabilities and steer safety decisions at OpenAI.
BetterUp seeks a product-focused Staff Machine Learning Engineer to design and deliver cutting-edge Generative AI coaching experiences and help scale ML systems in production.
TrustLab is hiring a Senior AI Engineer to develop, tune, and deploy LLM-based content moderation systems that operate at enterprise scale.
Lead development of LLM-driven systems at a mission-driven healthcare startup, focusing on prompt engineering, model optimization, and scalable AI product delivery.
Kiddom is hiring a Research Engineer (GenAI) to design and deploy ML-powered search, personalization, and agentic assistant systems that support teachers and improve student learning.
PointClickCare seeks an experienced Principal AI Engineer to lead architecture and delivery of agentic AI systems that drive safe, scalable AI adoption across its healthcare platform.
MCI is hiring a Prompt Engineer to craft and refine prompts for generative AI models, improving output quality across product and customer-facing applications.
College Board's GenAI Studio is hiring a Data Scientist to prototype and evaluate generative AI solutions that support students, educators, and internal products in a fully remote, mission-driven environment.
MCI is hiring a Prompt Engineer to craft, test, and optimize prompts for generative AI models and integrate prompt engineering into practical BPO and product workflows.
MCI seeks a detail-oriented Prompt Engineer to craft and optimize prompts for generative AI models and integrate them into practical BPO and product workflows.
Eve is hiring an AI Engineer to build, optimize, and ship LLM-powered systems that transform legal workflows and improve outcomes for plaintiff attorneys.
College Board's GenAI Studio is hiring a Data Scientist to prototype, evaluate, and operationalize generative AI solutions that support students, educators, and internal teams.
Lead LMArena’s open-source research program—building reproducible benchmarks, datasets, and evaluation methods that advance transparent, human-centered AI evaluation.
Lead the development and scaling of large language model customization and adaptation as a Principal Machine Learning Engineer on Microsoft's CoreAI - PostTraining team.
Lead cross-functional research programs to discover, evaluate, and mitigate adversarial behaviors in large language models at OpenAI's San Francisco office.
Lead engineering work to productize and deploy frontier AI models into edge and air-gapped defense systems while building evaluation and deployment pipelines for simulation-driven workflows.
Help build and productionize novel LLM-driven lesson experiences and assessment systems at Speak, a fast-growing Series C AI language learning company based in San Francisco.
Yupp seeks an experienced Staff+ AI Engineer in Mountain View to architect and ship scalable LLM applications and lead ML lifecycle work across data, model development, evaluation, and production.
Join Daydream as a Data Scientist to design and deploy LLM-driven stylist features and lead model lifecycle work that reimagines fashion shopping.
Mercor seeks PhD-level STEM experts with scientific Python experience to evaluate and improve LLM-generated code and reasoning in an asynchronous, remote contractor role.
Mercor seeks PhD-level biological scientists to design and evaluate advanced biology problems for a top AI lab in a flexible, remote contractor role.
Below 50k*
3
|
50k-100k*
0
|
Over 100k*
4
|