Our Client is a well-funded nonprofit research organization focused on measuring frontier AI capabilities—especially agentic / autonomous capabilities and the ability of models to conduct AI R&D, because those capabilities can create outsized societal and security risk if they scale faster than our ability to evaluate and govern them.
Their work is unusually “real-world” compared to typical benchmarks: they build evaluations with high realism and measure performance against skilled-human baselines (often multi-hour tasks), and publish research on how quickly models are improving at completing long tasks.
You’d be building the UI that turns messy LLM evaluation outputs into clear, explorable artifacts that researchers can trust.
What you’ll do
- Build React + TypeScript interfaces for exploring LLM evaluation results and experiment outputs.
- Design and implement data visualizations that make model behavior, metrics, and results easy to inspect.
- Build workflows that support end-to-end traceability of LLM runs (prompts → intermediate steps → decisions → outputs).
- Partner closely with researchers; iterate quickly while balancing clarity, accuracy, and performance.
Tech stack / must-haves
- React + TypeScript
- Hands-on with at least one major visualization library: D3, Plotly, Vega/Vega-Lite, Visx, Three.js, Highcharts, ECharts
Why this matters
- Their mission is to give society and AI labs grounded answers to: “What can frontier models actually do?” and “When do capabilities become dangerous?”
- The team includes researchers and engineers with backgrounds across top AI orgs and programs (e.g., OpenAI, DeepMind, and alumni of Oxford, Caltech, MIRI, and ML interpretability programs).
Location
- Onsite in the San Francisco Bay Area (relocation sponsored).
Contact [email protected].
If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.
Help build the Data Experience for Spotify's Backstage Portal by developing TypeScript/React UIs, Node.js backends, and connectors that make enterprise dataset metadata discoverable and actionable.
Lead Orum's frontend architecture and developer experience by owning the design system, shared component libraries, tooling, and performance standards for a high-scale, real-time sales platform.
Pearly is hiring a hybrid NYC-based Software Engineer to build scalable payments and platform infrastructure using TypeScript, GraphQL, and PostgreSQL.
Lead the design and delivery of scalable enterprise SaaS at Veeva Systems as a Principal Full Stack Engineer, building cloud software that accelerates life sciences innovation.
Lead the development of scalable, secure back-end services and streaming data pipelines for a fast-growing, data-driven fintech platform focused on syndicated loans.
Build and own full-stack features at Human Delta to help enterprises reliably adopt and govern AI, working closely with founders and early customers in San Francisco.
Railroad19 is seeking experienced Cloud Full Stack Python engineers to build serverless AWS applications, develop full-stack features with React, and advise clients on enterprise-grade solutions.
Aretum is hiring a Senior Software Engineer to lead development of .NET/GraphQL/Postgres-based applications for federal modernization initiatives in a remote Agile setting.
Lead the development of scalable RL training systems and procedural scenario generation to accelerate safe robotic deliveries at Serve Robotics.
Senior C++ Full-Stack Engineer needed to build and optimize production C++ systems and full-stack tooling for AI data pipelines and evaluation workflows at a fast-moving AI infrastructure company.
Zencore seeks a Principal Architect with strong Google Cloud expertise to lead technical delivery and drive cloud modernization initiatives for enterprise customers.
Senior Cloud Engineer to design and operate secure, highly available ClickHouse Cloud platforms for regulated and mission-critical environments across cloud, hybrid, and on‑prem deployments.
Lead full‑stack development for Elsevier's CK AI Nursing team, building React and Node.js applications within a Micro Focus front-end to support clinical and educational workflows.