Please mention DailyRemote when applying
Technology Stack
You'll work with and bring opinions on choosing between the following:
Category | Technologies / Providers
|
|---|---|
LLM Providers & APIs | Anthropic Claude (primary), OpenAI, AWS Bedrock |
Local / Self-Hosted LLMs | Ollama, LM Studio, llama.cpp, vLLM; open-weight model families (Llama, Qwen, Mistral, etc.) |
Agent Frameworks | LangChain / LangGraph, LlamaIndex, OpenAI Agents SDK, or equivalent |
Retrieval & Knowledge | Vector databases (Pinecone, Weaviate, pgvector); RAG, cache-augmented generation, tool-based agentic retrieval, GraphRAG, hybrid approaches |
Voice AI | ElevenLabs, VAPI, LiveKit, Deepgram |
LLM Observability & Eval | LangSmith, Braintrust, Phoenix, Helicone, or similar |
AI-Assisted Development | Claude Code |
RouteGenie Stack | Python, Django, PostgreSQL, Angular, TypeScript |
Qualifications & Requirements
Required Qualifications:
Experience: 3+ years of software engineering experience.
Production AI: 1+ year hands-on experience with production LLM / AI features shipped to real users (not prototypes or coursework).
Languages: Strong Python skills; comfort with TypeScript.
Frameworks: Hands-on experience with at least one agent framework and multiple retrieval/context-augmentation approaches, alongside the judgment to choose between them.
APIs: Production experience with major LLM provider APIs from our Tech Stack.
Architectural Judgment: Sound judgment on AI architecture choices. Ability to select the right model and execution environment (third-party API, foundational provider, local/self-hosted open-weight, specialized voice or embedding services) against cost, latency, accuracy, and data-residency constraints. Knows when traditional ML or no AI at all is the right call, and can implement classical ML when it fits.
Quality Measurement: Demonstrated experience measuring AI feature quality in production. Ability to describe specific metrics defined, test datasets built, and how regressions were detected and addressed when models, prompts, or data changed.
Communication: Working professional English; strong async written communication for collaboration across Mexico, Europe, and US time zones.
Strongly Preferred:
Voice AI: Experience with Voice AI. NEMT dispatch and customer-service flows are voice-heavy, and voice agents will be a major product surface.
Regulated Data: Experience in a healthcare or regulated-data context (HIPAA, PII/PHI handling) and the disciplines that come with it (audit logging, data minimization, access controls).
Self-Hosting: Local / self-hosted LLM experience running open-weight models on-prem or in a VPC. Critical for PHI-sensitive use cases where data cannot leave our infrastructure.
Anthropic Ecosystem: Claude API / Anthropic SDK experience — including Claude-specific patterns (extended thinking, prompt caching, tool use, computer use, Agents SDK).
Preferred (Nice-to-Have):
LLM observability / eval tooling experience (LangSmith, Braintrust, Phoenix, Helicone, or similar).
Cost and latency optimization at LLM scale (prompt caching, model routing, token budgeting).
Traditional ML / data science background (model training, feature engineering, evaluation methodology).
Django / PostgreSQL background.
Multi-tenant SaaS experience.
Open-source AI contributions or public agent projects.
Stop the endless job search. Our AI finds and applies to the best jobs for you.
Discover remote opportunities in Software Development
Answer easy questions
200,000+ jobs across 15+ categories
Get your best job matches
Only hand-screened, legit jobs
Find a remote job faster
No ads, scams, or junk
“ I was the first applicant for a remote marketing position that got listed on the company website the same day I applied. Had an interview within 48 hours!