GENERAL DESCRIPTION
The Senior Data Architect owns our canonical data architecture — the schema, contracts, tenancy, and governance that every product and every AI/ML workload builds on. You are the single owner of the canonical data model: one normalized definition of the core business objects shared across our products, and the standard the rest of engineering builds against. This is a foundational, hands-on role — you design, prototype, and ship reference implementations and in-repo guardrails, not just diagrams.
Our approach to AI is to build durable, domain-specific data assets rather than commodity model infrastructure: we don't pretrain foundation models and we don't ship thin wrappers around someone else's. The differentiated value lives in how our data is modeled, governed, and made trustworthy for AI — and that is the layer you own.
KEY RESPONSIBILITIES
AI/ML readiness
- Architect the data layer so AI/ML workloads — vector search, embeddings pipelines, RAG-grounded retrieval, model training — run on a clean, governed substrate.
- Make production data AI-ready: well-modeled, contract-enforced, lineage-tracked, and drift-detectable.
- Design the data-side integration patterns these workloads depend on, such as feature-store and vector-store patterns across document, relational, and embedding data.
Data architecture
- Own the canonical data model — the normalized definition of the core business objects shared across our products — and decide what is canonical versus tenant-specific.
- Establish data architecture standards, data contracts, and schema discipline the rest of engineering builds against, enforced in-repo.
- Exercise strong polyglot-persistence judgment: what belongs in document vs. relational vs. vector stores, and how to migrate between them without big-bang rewrites.
- Define the multi-tenant data architecture: tenancy isolation, data residency posture, and per-tenant cost attribution across storage and compute.
Modernization
- Lead staged modernization toward the right mix of stores and patterns for transactional, analytical, and AI/ML use cases — improving scalability, governance, and usability while minimizing disruption.
- Own the architectural direction of the data pipeline and lake / lakehouse layer: ingestion, transformation, orchestration, and storage tiers.
- Lead the move from homegrown pipelines to proven, industry-standard platforms, balancing build-vs-buy and total cost of ownership.
- Modernize legacy data-access patterns via incremental, strangler-fig migrations that keep production stable.
Technical leadership
- Drive hands-on prototypes, reference implementations, and in-repo guardrails.
- Define the data, storage, and retrieval patterns the rest of engineering builds against.
- Establish data quality, testing, lineage, and observability standards for pipelines and AI/ML serving.
- Mentor engineers on schema discipline, modern data practices, and AI/ML-readiness patterns.
- Make canonical decisions that are time-boxed, written, and defensible; hold disagree-and-commit rather than letting schema debate become a standing committee.
- Use AI-assisted development tools (Claude Code, Copilot, Cursor) as a force multiplier for schema design, query tuning, and migration scripting.
Cross-team partnership
- Partner with database engineering on production data health while owning long-term architectural direction.
- Partner with ML and application engineering on their data needs — structuring and governing data so it is retrieval-ready and safe to build on.
- Partner with platform / infrastructure on reliability, disaster recovery, residency, and the multi-tenant operational posture.
QUALIFICATIONS
- 8+ years in data architecture, data engineering, database administration, or analytics engineering, with 3+ years in senior / lead roles.
- Demonstrated ownership of a canonical or enterprise data model / cross-product schema — the model and contracts other teams built against.
- Hands-on MongoDB at production scale (Atlas M40+ ideal): document modeling, aggregation framework, indexing, change streams, sharding, replica sets — and the judgment to recognize the Mongo-as-RDBMS anti-pattern.
- Strong polyglot-persistence judgment: deciding what belongs in documents vs. relational vs. a vector store, and migrating between them incrementally.
- Hands-on relational depth: schema design, indexing strategy, and query tuning, plus familiarity with vector search (Atlas Vector Search, pgvector, or equivalent).
- Production experience making data AI/ML-ready: data architecture supporting RAG, semantic search, embeddings / vector pipelines, or agentic workloads.
- Multi-tenant architecture experience: data residency and per-tenant cost attribution.
- Pipeline / ELT / lake / lakehouse design at scale, with incremental migration strategies that minimize disruption.
- Cloud-native data services (Azure, AWS, or GCP).
- Strong grasp of data quality, testing, lineage, and monitoring — including observability for pipelines and AI/ML serving.
- Comfortable modeling a complex, specialized domain. MEP / AEC / construction experience is a plus; appetite to learn the domain is required.
NICE TO HAVE
- Knowledge-graph, ontology, or semantic-layer experience.
- CDC and cross-engine sync (MongoDB Change Streams, Debezium, or equivalent).
- Lakehouse platforms (Databricks, Snowflake, or open table formats — Iceberg, Delta, Hudi) and feature stores (Feast or equivalent).
- Data governance for AI/agent access to production data: query-cost controls, read-path safety, lineage, and audit for higher-risk use cases.
- SOC 2 and data-classification experience.
- Azure data ecosystem (Data Factory, Synapse, Functions, Event Grid).
- MongoDB certification (Associate DBA / Developer or higher) or substantive MongoDB University coursework.
WHAT SUCCESS LOOKS LIKE — FIRST YEAR
- The canonical data model is owned and enforced: teams build against stable, documented contracts instead of bespoke forks.
- Workloads sit in the right stores, legacy anti-patterns are receding, and reliability targets are holding.
- Tenancy is formalized and per-tenant cost attribution is instrumented, so cost and capacity are observable as we scale.
- The data substrate is AI-ready — model, contracts, and lineage in place — so AI/ML work builds on a solid foundation rather than waiting on data.
- You've done it in partnership: the data tier is healthier, and engineers build against your contracts.
BENEFITS
- Comprehensive and competitive health benefits plan
- Matching 401k contributions
- 20 days annual PTO
- Primarily remote work with occasional annual team onsites
This is a fully remote position open to candidates based in the United States.