The Mission
We live in a paradox: AI is accelerating the world’s capabilities, yet the average person feels more financially precarious than ever. Inflation is rising, wages are stagnant, and the traditional “retirement” model is broken. We aren’t building another chatbot. We are building the Financial Answer Machine, an intelligent guide designed to help people navigate a new financial reality.
Underpinned by a proprietary financial system, we are turning “average” advice into personalized, multi-modal financial power. We have closed an over-subscribed seed round and are looking for founding team members to help us build a bridge between the intelligence of AI and the rigid accuracy required for financial freedom. This is a rare opportunity to join at Day Zero and architect a business designed for outsized impact and massive scale.
The Role
Hence is hiring an AI Evaluation Lead to own how we measure the quality of AI-generated financial advice. Getting the advice right matters. A bad output here has real consequences for real people, and this role owns making sure we catch it.
You will work with an AI-generated test case library and automated scoring infrastructure that is already in place. Your job is to make sure we are measuring the right things, interpreting what the results are telling us, and determining what needs to change to keep the system performing well as it scales. Evaluation complexity grows with the platform, and this role grows with it.
This is not a monitoring and reporting role. It requires genuine judgment about AI system behavior, advice quality, and what the data is and is not capturing. You will report to our Head of Revenue & Compliance, work closely with the AI/ML team and founders, and partner with subject matter experts who provide domain judgment on complex or ambiguous cases. But you need enough personal finance literacy to make first-pass quality assessments independently and know when to escalate.
What You’ll Do (The Day-to-Day)
- Define and validate the evaluation set: what cases we should be testing, whether coverage is sufficient across domains, and where the current framework has gaps.
- Analyze scoring results to identify highest-frequency case types, patterns in what is performing well versus poorly, and anomalies that warrant closer review.
- Assess whether current measures are detecting the right failure modes or whether new measures are needed.
- Review flagged cases and make judgment calls on what the results mean and what should be done about them, drawing on both data and domain knowledge.
- Own the criteria and calibration for when human review is triggered: defining what rises to that level, what does not, and ensuring the threshold stays well-calibrated as the platform scales.
- Partner with subject matter experts on cases that require deeper domain judgment, and incorporate their input into evaluation design.
- Ensure evaluation coverage keeps pace with new domain additions and model changes before they ship.
- Translate findings into specific, actionable recommendations for the AI/ML team on what needs to change in the system.
- Evolve the evaluation framework as the system grows, new domains are added, and user patterns shift.
What We’re Looking For
You have worked on AI or ML system quality in a context where outputs had real stakes. You think analytically about what data is and is not telling you. You are comfortable making judgment calls in ambiguous situations rather than waiting for the answer to be obvious. You have enough AI/ML fluency to reason about why a system is producing what it is producing, not just whether the output looks right.
You bring enough personal finance literacy to read an advice response and have a genuine opinion about whether it is directionally sound. You do not need formal credentials or deep expertise across every domain the system covers—you will partner with subject matter experts for the complex judgment calls. What matters is that your review is substantive rather than mechanical, and that you can have an informed conversation with those experts about what you are seeing in the data.
- Fluency with how LLM-based systems behave in production, including output variance, failure modes, and the limits of automated scoring.
- Ability to assess whether an eval framework is measuring the right things, not just whether it is running correctly.
- Comfortable working with behavioral and interaction data to surface patterns and quality signals.
- Familiarity with evaluation and observability tooling.
Backgrounds that tend to fit:
- Model evaluation or QA on a consumer-facing AI product, particularly in a regulated or high-stakes context.
- Model risk or validation with LLM or generative AI exposure.
- Data science or analytics with ownership of production AI system quality.
- Operations quality control built around AI- or ML-generated outputs.
- Financial services or fintech product roles where you developed both analytical depth and personal finance domain familiarity.
This is probably not the right role for you if:
- Your background is primarily in building models rather than evaluating what they produce
- Personal finance is entirely unfamiliar territory. You do not need to be an expert, but you need enough baseline literacy to assess whether advice is reasonable and to work productively with the SMEs who provide deeper domain judgment
- You are looking for a well-defined role with stable processes. The framework is in place but evolving it is a core part of the job
- You default to manual review rather than thinking systematically about what should be automated and what requires human judgment
How we work
We are a fully remote, distributed team. Periodic in-person get-togethers will be integral to our operating cadence. We’re adults who prioritize outcomes and output over set schedules. We value clear writing, high ownership, fast iteration, direct communication, and thoughtful async collaboration.
As an early team member, you should expect broad ownership, frequent context shifts, and a high degree of autonomy. You will help shape not just the product, but also the technical standards and operating cadence of the company.
Compensation
Salary: 120-140k, plus early-stage option equity.
Final compensation will depend on level, experience, location, and scope of responsibility.
This role is open to candidates based in the United States.
Equal Opportunity & Accommodations:
Hence is proud to be an equal opportunity employer. We do not discriminate in hiring or any employment decision based on race, color, religion, national origin, age, sex (including pregnancy, childbirth, or related medical conditions), marital status, ancestry, physical or mental disability, genetic information, veteran status, gender identity or expression, sexual orientation, or other applicable legally protected characteristic. Hence is also committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If you need assistance or an accommodation due to a disability, please let your recruiter know.