Anyone AI

Human Data Evals Lead (Remote/US/LATAM)

Posted a month ago

Mexico

⭐ 5-10 years experience

Apply Now

Please mention DailyRemote when applying

AI Summary

Lead the design and delivery of high-quality data proposals and benchmark packages for AI labs. Manage the end-to-end pilot process, including recruiting subject-matter experts and ensuring rigorous quality control.

Reports to: CEO

Owns: data proposals, sample development, quality, and pilot delivery

Location: Remote / Latam / US

The role

You will own Anyone AI’s data initiatives and proposals to AI labs, from the data proposal or responding to requests, through pilot delivery. You own how we build proposals and develop the sample packages and benchmarks: frontier-grade packages across reasoning, coding, agents, and tool use, multi-modal and others, produced in collaboration with subject-matter experts, with expert-verified ground truth, multi-model headroom results, and QC that survives buyer-side scrutiny. You are the person who designs the sample that demonstrates our quality, converts pilots into production engagements. On a small team, this is the operational center of the Human Data Division.

Responsibilities

Proposals & requests. Study public benchmarks and eval targets, and turn them into proposals and sample packages that demonstrate capability and win the work. Respond to lab data requests and pilots.
Sample & benchmark development. Design and build the sample packages, working with subject-matter experts. Every package meets the bar of our current sample set:
- Expert-verified, exact-match-checkable ground truth and gold reasoning trajectories.
- Multi-model evaluation showing real headroom, and proof the task discriminates the model, not just that it's hard.
- Rigorous QC structure: calibration layers, severity-weighted rubrics, deterministic verifiers, evidence maps, etc.
Subject-matter experts. Recruit, brief, calibrate, and review a pool of experts across coding, agentic/tool-use, and STEM/reasoning. Raise their output to our standard and keep it there; be the arbiter of what "correct" and "frontier-difficulty" mean.
Lab relationships. Be a direct point of contact for lab partners on Slack and calls, with support from the CEO and the wider team. Keep senior lab contacts informed, surface what they actually need, and pull in the CEO and subject-matter experts when the conversation calls for it.
Pilot delivery. Own pilots end to end: scoping, SOW, staffing, production, QC, and delivery. Nothing ships before it's lab-ready, and nothing comes back rejected as "not frontier-level" without us already knowing why.

Experience

Originated data or benchmark proposals for AI labs, translated eval targets into sample tasks that demonstrate capability, and owned the engagement through delivery.
Deep evaluation and quality expertise: LLM benchmarking, with real strength in code-model evaluation.
Built QC processes and artifact standards that met enterprise or lab requirements, and set a quality bar a team of experts was held to.
Thrives in ambiguous, fast-moving environments where the rules are still being written, and delivers under pressure.

Qualifications

5+ years in technical delivery, quality, or program management, with recent experience in AI/ML data, model evaluation, or benchmarking.
Hands-on experience delivering data or evaluation work to AI labs or enterprise ML teams, scoping through delivery.
Working fluency with how frontier models are evaluated: benchmarks, rubrics, pass rates, headroom, and what makes a task discriminate a model.
Proven people/vendor leadership, you've recruited, calibrated, and held a team or expert pool to a quality standard.
Fluent English. Spanish is a nice to have.

Automatically Apply to the Best Remote Jobs

Stop the endless job search. Our AI finds and applies to the best jobs for you.

Try it Now

Anyone AI

Human Data Evals Lead (Remote/US/LATAM)

AI Summary

The role

Responsibilities

Experience

Automatically Apply to the Best Remote Jobs

Ace Your Job Interview

How to Answer "How Do You Handle Criticism"?

How to Answer "Tell Me About Yourself?" in an Interview

How to Answer "What is your Experience with Customer Service?"

How to Answer "Describe Your Experience Working With Diverse Teams Or Different Cultures?"

How to Answer The Interview Question "What Sets You Apart From Other Candidates?"

How to Answer "Why Are You The Best Person For This Job"?

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Why Should We Hire You?"

How to Answer "What Areas Need Improvement?"

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Tell Me About a Time You Received Constructive Feedback"

How to Answer "What Is Your Greatest Accomplishment?"

Similar Jobs

Junior Crypto Trader (Remote)

Grants Post-Award Specialist (0033)

Cloud DevSecOps Engineer - Remote - Octopus by RTG

CONSULTOR ANALISTA DE DOCUMENTACIÓN

General Application

FSP Site Start Up Specialist

Anyone AI

Human Data Evals Lead (Remote/US/LATAM)

AI Summary

The role

Responsibilities

Experience

Automatically Apply to the Best Remote Jobs

Share This Job:

Similar Jobs

Junior Crypto Trader (Remote)

Grants Post-Award Specialist (0033)

Cloud DevSecOps Engineer - Remote - Octopus by RTG

CONSULTOR ANALISTA DE DOCUMENTACIÓN

General Application

FSP Site Start Up Specialist

Personalize your Remote Job Search in 3 Easy Steps!