Wizdaa

Senior MLOps Engineer - SRE | DevOps

Posted 2 hours ago

Brazil, Iceland

⭐ 5-10 years experience

Apply Now

Please mention DailyRemote when applying

AI Summary

Build and operate scalable ML and AI inference infrastructure, focusing on reliability, latency, and cost efficiency. Own the end-to-end ML deployment lifecycle, including automated pipelines, GitOps workflows, and multi-tenant Kubernetes management.

THE ROLE

We're looking for a Senior MLOps Engineer who can set the standard for how we build, ship, and operate ML and AI systems at scale. You sit at the intersection of ML infrastructure and SRE — you'll own the path from model and pipeline to reliable production service, and you'll bring DevOps rigor to systems that are historically under-engineered.

This is not a ticket-processing role, and it's not a research role. You'll tackle hard problems — model serving reliability, inference cost and latency, reproducible pipelines, agentic workload operations — and have the scope to solve them properly. Seniors here identify problems before they're asked, and raise the ceiling on what the platform can do.

WHAT YOU'LL WORK ON

Build and operate model and inference serving infrastructure — managing latency, throughput, autoscaling, and reliability for real-time and batch inference across multiple tenants.
Own the ML deployment lifecycle — model registry, versioning, promotion workflows, rollout strategies (canary, shadow, A/B), and safe rollback.
Operate agentic and LLM workloads in production — managing inference providers and gateways, quota and throttling behavior (TPS/TUPS limits), guardrails, prompt/version management, and graceful degradation under load.
Build reproducible, automated ML pipelines — training, evaluation, and deployment pipelines as code, with lineage and reproducibility built in.
Extend infrastructure-as-code to ML systems — Terraform patterns and multi-account design that bring ML infrastructure under the same standards as the rest of the platform.
Operate GitOps for ML workloads — ArgoCD configuration and promotion workflows across environments and tenants.
Run ML and AI workloads on multi-tenant Kubernetes (AWS EKS) — managing GPU/accelerator scheduling, workload placement, tenant isolation, and cost-aware capacity.
Own ML reliability and observability — SLOs for inference services, model and data drift detection, performance regression monitoring, alert quality, on-call ergonomics, and runbook culture.
Drive ML cost efficiency — right-sizing accelerators, managing reserved/spot capacity, and attributing inference cost across tenants and workloads.
Use agentic coding tools for infrastructure and pipeline work — scaffolding environments, generating and reviewing IaC and pipeline code, and accelerating automation.

MUST HAVE

5+ years in platform engineering, SRE, MLOps, or infrastructure — with meaningful time operating production systems at scale.
Hands-on experience deploying and operating ML or AI workloads in production — serving, inference, or training infrastructure that real users depended on.
Strong SRE/DevOps foundation — you've owned reliability for production services, defined and measured SLOs, run post-mortems, and driven measurable improvements.
Deep IaC expertise — you actively manage complex Terraform state and multi-account configurations in production.
Strong GitOps background — you understand declarative infrastructure management at depth and have opinions on how to do it well.
Deep Kubernetes knowledge — you've operated clusters in production, dealt with real failure modes, and understand the system at the control plane level.
Strong AWS background — networking, compute, IAM, storage, multi-account design.
Hands-on experience building and operating CI/CD pipelines — GitHub Actions, CircleCI, GitLab CI, or equivalent — and an understanding of how ML pipelines differ from standard application CI/CD.
Automation-first thinking at a senior level — you implement systems that eliminate entire categories of manual work.
Active user of agentic coding tools — you know how to direct them effectively, review their output critically, and use them to multiply your output.
Strong communicator — you can articulate operational decisions, model performance trade-offs, and incident summaries clearly to engineers and leadership alike.

NICE TO HAVE

Experience with GPU/accelerator scheduling and node lifecycle management in production (e.g., Karpenter).

Experience operating LLM inference at scale — managing provider quotas/throttling (TPS/TUPS), gateways, caching, and guardrails (e.g., AWS Bedrock or equivalent).
Experience with ML pipeline and orchestration tooling — Argo Workflows, Kubeflow, Airflow, SageMaker Pipelines, or equivalent.
Experience with model registries, feature stores, and experiment tracking (e.g., MLflow, Feast, or equivalent).
Familiarity with model and data drift monitoring and ML-specific observability.
Background in FinOps — inference cost attribution, reserved capacity planning,
Familiarity with data infrastructure — object storage, CDC pipelines, or lakehouse patterns.
Experience with multi-tenant infrastructure — isolation patterns, noisy neighbor mitigation, and tenant lifecycle management.
Prior experience scaling ML or platform infrastructure at a startup moving toward enterprise-grade requirements.

WHAT YOU WON'T FIND HERE

A platform team that maintains the status quo. We're actively building — new scale requirements, new architectural domains, and an ML/AI footprint that's growing fast. Senior engineers here shape how the platform evolves, and the tools available to do it are better than they've ever been.

Type: Full-Time, remote

Work hours aligned with EST or PST

Automatically Apply to the Best Remote Jobs

Stop the endless job search. Our AI finds and applies to the best jobs for you.

Try it Now

Wizdaa

Senior MLOps Engineer - SRE | DevOps

AI Summary

Automatically Apply to the Best Remote Jobs

Ace Your Job Interview

How to Answer "How Do You Handle Criticism"?

How to Answer "Tell Me About Yourself?" in an Interview

How to Answer "What is your Experience with Customer Service?"

How to Answer "Describe Your Experience Working With Diverse Teams Or Different Cultures?"

How to Answer The Interview Question "What Sets You Apart From Other Candidates?"

How to Answer "Why Are You The Best Person For This Job"?

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Why Should We Hire You?"

How to Answer "What Areas Need Improvement?"

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Tell Me About a Time You Received Constructive Feedback"

How to Answer "What Is Your Greatest Accomplishment?"

Similar Jobs

Software Engineer II

Backend Entwickler C#/.NET(m/w/d) - Startup - hybrid/remote

Senior Backend Entwickler C#/.NET(m/w/d) - Startup - hybrid/remote

Business Analyst Mentor (part-time)

QA Engineer

Software Engineer III, Data Product

Wizdaa

Senior MLOps Engineer - SRE | DevOps

AI Summary

Automatically Apply to the Best Remote Jobs

Share This Job:

Similar Jobs

Software Engineer II

Backend Entwickler C#/.NET(m/w/d) - Startup - hybrid/remote

Senior Backend Entwickler C#/.NET(m/w/d) - Startup - hybrid/remote

Business Analyst Mentor (part-time)

QA Engineer

Software Engineer III, Data Product

Personalize your Remote Job Search in 3 Easy Steps!