Please mention DailyRemote when applying
THE ROLE
We're looking for a Senior MLOps Engineer who can set the standard for how we build, ship, and operate ML and AI systems at scale. You sit at the intersection of ML infrastructure and SRE — you'll own the path from model and pipeline to reliable production service, and you'll bring DevOps rigor to systems that are historically under-engineered.
This is not a ticket-processing role, and it's not a research role. You'll tackle hard problems — model serving reliability, inference cost and latency, reproducible pipelines, agentic workload operations — and have the scope to solve them properly. Seniors here identify problems before they're asked, and raise the ceiling on what the platform can do.
WHAT YOU'LL WORK ON
Build and operate model and inference serving infrastructure — managing latency, throughput, autoscaling, and reliability for real-time and batch inference across multiple tenants.
Own the ML deployment lifecycle — model registry, versioning, promotion workflows, rollout strategies (canary, shadow, A/B), and safe rollback.
Operate agentic and LLM workloads in production — managing inference providers and gateways, quota and throttling behavior (TPS/TUPS limits), guardrails, prompt/version management, and graceful degradation under load.
Build reproducible, automated ML pipelines — training, evaluation, and deployment pipelines as code, with lineage and reproducibility built in.
Extend infrastructure-as-code to ML systems — Terraform patterns and multi-account design that bring ML infrastructure under the same standards as the rest of the platform.
Operate GitOps for ML workloads — ArgoCD configuration and promotion workflows across environments and tenants.
Run ML and AI workloads on multi-tenant Kubernetes (AWS EKS) — managing GPU/accelerator scheduling, workload placement, tenant isolation, and cost-aware capacity.
Own ML reliability and observability — SLOs for inference services, model and data drift detection, performance regression monitoring, alert quality, on-call ergonomics, and runbook culture.
Drive ML cost efficiency — right-sizing accelerators, managing reserved/spot capacity, and attributing inference cost across tenants and workloads.
Use agentic coding tools for infrastructure and pipeline work — scaffolding environments, generating and reviewing IaC and pipeline code, and accelerating automation.
MUST HAVE
5+ years in platform engineering, SRE, MLOps, or infrastructure — with meaningful time operating production systems at scale.
Hands-on experience deploying and operating ML or AI workloads in production — serving, inference, or training infrastructure that real users depended on.
Strong SRE/DevOps foundation — you've owned reliability for production services, defined and measured SLOs, run post-mortems, and driven measurable improvements.
Deep IaC expertise — you actively manage complex Terraform state and multi-account configurations in production.
Strong GitOps background — you understand declarative infrastructure management at depth and have opinions on how to do it well.
Deep Kubernetes knowledge — you've operated clusters in production, dealt with real failure modes, and understand the system at the control plane level.
Strong AWS background — networking, compute, IAM, storage, multi-account design.
Hands-on experience building and operating CI/CD pipelines — GitHub Actions, CircleCI, GitLab CI, or equivalent — and an understanding of how ML pipelines differ from standard application CI/CD.
Automation-first thinking at a senior level — you implement systems that eliminate entire categories of manual work.
Active user of agentic coding tools — you know how to direct them effectively, review their output critically, and use them to multiply your output.
Strong communicator — you can articulate operational decisions, model performance trade-offs, and incident summaries clearly to engineers and leadership alike.
NICE TO HAVE
Experience with GPU/accelerator scheduling and node lifecycle management in production (e.g., Karpenter).
Experience operating LLM inference at scale — managing provider quotas/throttling (TPS/TUPS), gateways, caching, and guardrails (e.g., AWS Bedrock or equivalent).
Experience with ML pipeline and orchestration tooling — Argo Workflows, Kubeflow, Airflow, SageMaker Pipelines, or equivalent.
Experience with model registries, feature stores, and experiment tracking (e.g., MLflow, Feast, or equivalent).
Familiarity with model and data drift monitoring and ML-specific observability.
Background in FinOps — inference cost attribution, reserved capacity planning,
Familiarity with data infrastructure — object storage, CDC pipelines, or lakehouse patterns.
Experience with multi-tenant infrastructure — isolation patterns, noisy neighbor mitigation, and tenant lifecycle management.
Prior experience scaling ML or platform infrastructure at a startup moving toward enterprise-grade requirements.
WHAT YOU WON'T FIND HERE
A platform team that maintains the status quo. We're actively building — new scale requirements, new architectural domains, and an ML/AI footprint that's growing fast. Senior engineers here shape how the platform evolves, and the tools available to do it are better than they've ever been.
Type: Full-Time, remote
Work hours aligned with EST or PST
Stop the endless job search. Our AI finds and applies to the best jobs for you.
Discover remote opportunities in DevOps Engineer
Answer easy questions
200,000+ jobs across 15+ categories
Get your best job matches
Only hand-screened, legit jobs
Find a remote job faster
No ads, scams, or junk
“ I was the first applicant for a remote marketing position that got listed on the company website the same day I applied. Had an interview within 48 hours!