Provision, validate, and triage GPU nodes across various clusters while troubleshooting hardware and software issues. Monitor fleet health and develop or improve operational runbooks to ensure reliability.
fal
7 Remote Job Openings at fal
Lead the design, fit-out, and commissioning of high-density data center white space across owned and colocation sites. Build and manage smart hands and break-fix teams while establishing scalable operational standards for a multi-site portfolio.
Monitor and maintain the health and performance of InfiniBand and Ethernet fabrics to ensure stability at scale. Investigate fabric issues such as congestion and NCCL stalls while improving operational tooling and runbooks.
Build and lead the Fleet Reliability team to ensure GPU nodes are provisioned, validated, and operational. Drive the automation roadmap for self-healing and event-driven remediation while owning 24/7 coverage.
Act as the analytical lead for the Go-to-Market organization, owning metrics for pipeline health, sales execution, and revenue growth. Partner with sales leadership to build measurement frameworks and data foundations for rep performance and segment economics.
You will own and scale the end-to-end performance marketing engine, driving qualified traffic and revenue through search and paid social channels. You will also analyze performance data to identify bottlenecks and collaborate with product teams to optimize growth.
Provide advanced technical support to customers by troubleshooting API, integration, and platform issues. Collaborate with engineering teams to document bugs, improve platform reliability, and enhance developer documentation.