Build and maintain software and tooling to manage a large fleet of GPU servers, focusing on provisioning, health monitoring, and recovery. Optimize Linux systems and storage for AI workloads while implementing OS-level security and compliance.
fal
19 Remote Job Openings at fal
Own and operate Kubernetes infrastructure, including cluster lifecycle, networking, and multi-tenant isolation. Build and maintain CI/CD pipelines while leveraging AI to automate production issue resolution and improve system reliability.
Build and maintain software and tooling to manage a large fleet of GPU servers, focusing on provisioning, health monitoring, and recovery. Optimize Linux systems for AI workloads and implement OS-level security and storage management.
Own and operate Kubernetes infrastructure, including cluster lifecycle, networking, and multi-tenant isolation. Build and maintain CI/CD pipelines while leveraging AI to automate production issue resolution and improve system reliability.
Build and evolve a core Python/Rust platform focusing on request routing, AI workload orchestration, and GPU autoscaling. Design systems to handle 100x traffic growth while maintaining low latency and high reliability.
Provide advanced technical support to customers and internal teams by resolving API, UI, and integration issues. Collaborate with engineering to document bugs, improve platform reliability, and maintain technical documentation.
Build and scale a high-performing acquisition engine across search and paid social platforms to drive qualified traffic and revenue. Analyze performance deeply to identify bottlenecks and collaborate with product teams to improve funnel conversion.
Build and operate the data infrastructure and ETL pipelines to track cost, margin, and performance across production systems and vendor APIs. Partner with infrastructure and product teams to define data contracts and implement low-latency analytical write paths.
Build and lead the Fleet Reliability team to ensure GPU nodes are provisioned, validated, and operational. Drive the automation roadmap for self-healing and event-driven remediation while owning 24/7 coverage.
Provision, validate, and triage GPU nodes across various clusters while troubleshooting hardware and software issues. Monitor fleet health and develop or improve operational runbooks to ensure system reliability.
Monitor and maintain the health and performance of InfiniBand and Ethernet fabrics, including switches and HCAs. Investigate fabric issues, support new bring-ups, and improve operational tooling and runbooks.
Lead the design, fit-out, and commissioning of high-density data center white space across owned and colocation sites. Build and manage smart hands and break-fix teams while overseeing vendors and technical specifications for supercomputing clusters.
Build and lead the Fleet Reliability team to ensure GPU nodes are provisioned, validated, and operational. Drive the automation roadmap for self-healing and event-driven remediation while owning 24/7 coverage.
Lead the design, fit-out, and commissioning of high-density data center white space across owned and colocation sites. Build and manage smart hands and break-fix teams while establishing scalable operational standards for a multi-site portfolio.
Monitor and maintain the health and performance of InfiniBand and Ethernet fabrics to ensure stability at scale. Investigate fabric issues such as congestion and NCCL stalls while improving operational tooling and runbooks.
Provision, validate, and triage GPU nodes across various clusters while troubleshooting hardware and software issues. Monitor fleet health and develop or improve operational runbooks to ensure reliability.
Act as the analytical lead for the Go-to-Market organization, owning metrics for pipeline health, sales execution, and revenue growth. Partner with sales leadership to build measurement frameworks and data foundations for rep performance and segment economics.
You will own and scale the end-to-end performance marketing engine, driving qualified traffic and revenue through search and paid social channels. You will also analyze performance data to identify bottlenecks and collaborate with product teams to optimize growth.
Provide advanced technical support to customers by troubleshooting API, integration, and platform issues. Collaborate with engineering teams to document bugs, improve platform reliability, and enhance developer documentation.