Browse 46 exciting jobs hiring in Ai Evaluation now. Check out companies hiring such as Weekday AI, Compa, BAE Systems in Akron, Milwaukee, St. Paul.
Experienced wet-lab biology PhDs are needed to assess and annotate experimental failure modes and recommend mitigations for an AI research benchmark.
Lead and grow Compa’s inaugural Applied AI team, driving production ML systems and MLOps practices to power enterprise compensation intelligence.
Lead the engineering and applied-LLM work to improve agent reliability, autonomy, and evaluation pipelines for a fast-moving startup building autonomous business agents.
Lead the AI product strategy for an enterprise cloud data protection platform, turning real-world customer needs into high-impact, AI-enabled product features and commercial launches.
Lead the design and delivery of agent-based AI and orchestration frameworks at Heidi to safely automate clinician workflows and scale clinical impact.
AirOps is hiring a Senior Product Manager to lead the Agents product — designing agent orchestration, evaluation frameworks, and workflows that turn AI insights into publish-ready content at scale.
Adtalem is seeking a Senior Analyst, Market Intelligence & Insights to lead always-on research and translate AI and edtech competitive intelligence into actionable insights and executive briefings for enterprise AI strategy.
Siena is hiring a Product Engineer to build full-stack AI-driven agent capabilities, shape evaluation systems, and deliver integrations that redefine customer experience and e-commerce.
Zillow's Agentic AI team is hiring a Machine Learning Engineer to design, train, evaluate, and ship agentic LLM solutions that improve user understanding and decision-making across the home search experience.
RWS is seeking part-time remote AI Data Specialists in Florida to perform data annotation, evaluation, and tagging tasks that improve AI content quality and safety.
Join Ataraxis AI as a Research Engineer (Data Science) to advance AI-driven precision oncology through rigorous data pipelines, reproducible research, and publication-grade scientific contributions.
Lead the development of agentic LLM systems and domain-specific fine-tuning at Argon to build the next-generation AI OS for pharma from our NYC office.
Amigo is seeking an Applied Scientist to develop evaluation and safety frameworks that ensure AI systems are reliable and safe for healthcare deployment.
Beyondsoft is hiring a Data Analyst to prepare training data, anonymize documents, and validate LLM/model outputs for AI projects in a remote US-based role.
Lead validation and automated assurance for agentic AI systems supporting NGA missions, focusing on benchmark design, regression testing, and CI/CD-integrated verification.
Design and deliver developer-focused curriculum and hands-on programs that teach evals and agentic AI at Braintrust, working closely with engineers and product teams.
Lead the architecture and delivery of generative AI and multimodal systems that enable creative and contextual advertising capabilities across Netflix Ads.
WeRide seeks an AI Simulation Engineer to design AI-based simulation scenarios and agent behaviors that validate and accelerate autonomous vehicle algorithms.
Canva is hiring a Senior Research Engineer to engineer agentic, multimodal evaluation systems that automatically assess and improve the quality and human alignment of generative design models.
MIRI, a nonprofit focused on reducing existential AI risk, is hiring a Technical Governance Team Manager to lead stakeholder engagement, run projects and people processes, and help produce rigorous technical governance research.
Eigenplane is hiring a Founding AI Research Scientist to drive LLM and agent research into scalable, interpretable production systems at an early-stage AI startup.
Lead the technical direction and hands-on engineering for Zapier Agents, building production-grade LLM-driven agent capabilities, integrations, and evaluation systems that scale across thousands of apps and real customers.
Decagon seeks an experienced QA Lead in San Francisco to build and run QA for AI-powered customer service agents, moving from hands-on evaluation to scalable QA processes and team leadership.
Apply your SEC-filings and financial-analysis expertise remotely at Welo Data to evaluate AI-generated outputs from 10-K filings in a short-term contract role with potential extension.
Ironclad is looking for a Staff Software Engineer - Applied AI to build and productionize LLMs, RAG systems, and document-understanding services that deliver actionable contract insights.
LILT is hiring native Mandarin/Simplified Chinese linguists to perform remote prompt evaluation, multimedia content understanding, and text review for an AI-driven translation project.
DepthFirst AI is hiring a Research Engineer to develop and evaluate AI agents and training pipelines that discover and exploit software vulnerabilities at scale.
Unstructured seeks an experienced AI/ML Engineer to design, evaluate, and deploy secure ML solutions for Department of Defense and national security customers on government networks.
Work from the SoHo NYC office as an Applied AI Engineer building production LLMs and ML systems that accelerate bringing new therapies to market.
Tessera Labs seeks a Machine Learning Engineer Intern (Fall 2025, Hybrid in San Jose) to build and fine-tune LLM-driven multi-agent pipelines and enterprise tool integrations.
Lead the design and evaluation of long-term memory systems for LLMs at an early-stage AI startup focused on building self-improving agents.
Work with a top AI research lab to evaluate and improve LLM performance on advanced economics tasks by providing expert, written feedback.
Help shape next-generation AI by evaluating advanced physics solutions and guiding research teams to improve model performance as a contract Physics AI Trainer.
Handshake AI is hiring a Technical Program Manager, AI Operations to run high-impact AI data programs, ensuring scalable processes, data quality, and excellent customer outcomes.
Lead the design and deployment of cutting-edge 3D computer vision and generative ML models at Dandy to automate and improve dental manufacturing workflows.
Oura is hiring a Senior AI Engineer to design evaluation systems and build custom LLM and agentic models that power next-generation, actionable health recommendations.
A 12-month AI Fellowship at the Gates Foundation to design, prototype, and deploy responsible AI solutions for global health and development while building capacity across program teams.
Work with Khan Academy to design and deploy generative AI features that improve literacy learning in a 24-month fixed-term Senior AI Engineer role.
Technical leader needed to architect and deliver complex, production C++ systems integrating sensors, hardware, and software for high-stakes intelligence programs in Reston, VA.
Lead product and context engineering efforts to improve LLM-driven AI agent performance and user experience for advice-focused client intents within Vanguard's Discretionary Advice Platform.
Decagon seeks an Agent Software Engineer intern to build and evaluate production-ready conversational AI agents that improve customer support, working onsite in San Francisco during Summer 2026.
Profound, an NYC AI startup backed by Sequoia, is hiring an AI/ML Engineer to build production-scale NLP and LLM systems for content classification, generation, and measurement.
Work at the intersection of research and deployment to turn Twelve Labs’ video understanding models into scalable, production solutions for customers.
Build and ship mission-critical conversational AI agents at Decagon, working directly with enterprise customers to create scalable, high-impact solutions.
Dandy is hiring a Senior Machine Learning Engineer to advance 3D computer vision and generative ML models that automate and scale dental appliance manufacturing.
NBCUniversal is hiring an Analyst, AI Strategy & Innovation to perform market and vendor analysis, build financial/business cases, and support cross-functional pilots and innovation programs across its media businesses.
Below 50k*
0
|
50k-100k*
0
|
Over 100k*
1
|