fal

Operations Engineer, Fleet Reliability

Posted a month ago

Worldwide

⭐ 2-5 years experience

Apply Now

Please mention DailyRemote when applying

AI Summary

Provision, validate, and triage GPU nodes across various clusters while troubleshooting hardware and software issues. Monitor fleet health and develop or improve operational runbooks to ensure reliability.

fal is the generative media ecosystem powering the next generation of AI products. We build the infrastructure, tools, and model access that teams need to move from idea to production, and do it at scale without compromise. For developers and enterprises, fal is the foundation that makes generative media not just possible, but practical: a unified platform where high-performance inference, orchestration, and observability come together to unlock new categories of AI-native products.

As generative media reshapes industries across a market projected to grow by hundreds of billions over the next decade, fal is becoming the ecosystem that ambitious teams build on.

About the role

As we bring up owned clusters alongside our cloud capacity, we're hiring Operations Engineers to keep the fleet alive. This is a hands-on role. You're first in line when nodes go bad, GPUs throw ECC errors, IB links flap, or a rack stops responding. You'll provision new nodes, validate them, ship them to production, and troubleshoot whatever entropy throws at them. You'll be on-call. You'll be in the weeds.

You're a fit if you've:

Administered Linux Systems in the critical path before
Troubleshooted GPU node issues: NVLink, NCCL, IB, driver and firmware bugs
Has experience in observability systems like Grafana and Prometheus
Scripted your way out of repetitive work (bash, python, go, whatever)

Who you are:

Curious. You don't accept "it's flaky" as a root cause
Comfortable with ambiguity. The runbook doesn't exist yet for half of what you'll do
On-call doesn't scare you
You'd rather automate a problem than fix it twice

Responsibilities:

Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters
Troubleshoot hardware and software issues across compute, network, and storage
Monitor fleet health, take remediation action, push fixes upstream when needed
Write the runbooks. Improve the ones that exist. Delete the ones that don't work

Automatically Apply to the Best Remote Jobs

Stop the endless job search. Our AI finds and applies to the best jobs for you.

Try it Now

fal

🧑‍💻 Employees 51-200 employees 🏢 Industry Technology, Information and Internet

View More Jobs From fal

fal

Operations Engineer, Fleet Reliability

AI Summary

About the role

You're a fit if you've:

Who you are:

Responsibilities:

Automatically Apply to the Best Remote Jobs

Ace Your Job Interview

How to Answer "How Do You Handle Criticism"?

How to Answer "Tell Me About Yourself?" in an Interview

How to Answer "What is your Experience with Customer Service?"

How to Answer "Describe Your Experience Working With Diverse Teams Or Different Cultures?"

How to Answer The Interview Question "What Sets You Apart From Other Candidates?"

How to Answer "Why Are You The Best Person For This Job"?

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Why Should We Hire You?"

How to Answer "What Areas Need Improvement?"

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Tell Me About a Time You Received Constructive Feedback"

How to Answer "What Is Your Greatest Accomplishment?"

Similar Jobs

Sr. Data Scientist - Clinical Analytics - Remote

Application Analyst, Senior, ERP (Workday-Finance) - Remote

SQL Server Database Administrator (Mid-Level)

Senior AWS/Cloud Database Engineer

LATAM I Analytics Engineer

DBA – Cassandra

fal

Operations Engineer, Fleet Reliability

AI Summary

About the role

You're a fit if you've:

Who you are:

Responsibilities:

Automatically Apply to the Best Remote Jobs

Share This Job:

Similar Jobs

Sr. Data Scientist - Clinical Analytics - Remote

Application Analyst, Senior, ERP (Workday-Finance) - Remote

SQL Server Database Administrator (Mid-Level)

Senior AWS/Cloud Database Engineer

LATAM I Analytics Engineer

DBA – Cassandra

Personalize your Remote Job Search in 3 Easy Steps!