Please mention DailyRemote when applying
ai& is a new global AI technology company dedicated to meeting the world's growing demand for AI. Our vision is twofold: to serve as a premier AI lab specializing in localization, and to act as a global infrastructure and compute provider. We are building a unified, optimized global platform that integrates next-generation data centers and infrastructure, heterogeneous compute serving, and advanced model services. We believe that the most effective way to build and scale AI is to own the stack from top to bottom.
At ai&, we empower small teams with the autonomy needed to tackle significant challenges. Our approach is to deconstruct large problems into manageable components and solve complex issues collaboratively. We seek highly motivated, mission-driven individuals who demonstrate strong personal agency. We value curiosity as the foundation of talent, and we are looking for people eager to develop alongside our evolving technology and expanding business.
We are actively hiring worldwide, with presence in Tokyo, SF, Austin, and Toronto. We are more than happy to meet exceptional talent where they are.
Role overview
As a Network Engineer at ai&, you are the domain expert on the lossless networking fabrics that tie our GPU fleet together. AI at scale lives and dies on the network. Collective communication operations, AllReduce, AllGather, ReduceScatter, are on the critical path of every distributed training and inference workload we run. Your job is to make sure the fabric is fast, lossless, and never the bottleneck.
You will work across RoCE v2 and InfiniBand fabrics, tune NCCL and network interfaces, and own the end-to-end network performance of our compute clusters. You will work closely with the systems, kernel, and inference teams to ensure that what gets built at the physical layer translates directly into performance at the workload layer.
Responsibilities
Lossless Fabric Design & Operations Design, deploy, and operate lossless networking fabrics across our data centers. Own RoCE v2 and InfiniBand (NDR/XDR) deployments end to end.
NCCL & Interface Tuning Tune NCCL, NICs, and DPUs to guarantee maximum bandwidth and zero packet loss for distributed AI workloads. Own the performance of collective communication operations across the fleet.
Network Architecture Design the network architecture for new data center deployments. Make topology, switch, and cabling decisions that scale from current clusters to future multi-site deployments.
Performance Monitoring & Optimization Instrument the network for observability. Proactively identify and eliminate bottlenecks before they affect workloads. Own network performance benchmarks and drive continuous improvement.
Cross-Team Collaboration Work closely with the systems, storage, and ML infrastructure teams to ensure the network fabric supports the demands of distributed training and inference at every scale.
You may be a fit if you have the following skills
AI Networking Expertise Deep experience designing and operating lossless AI networking fabrics. You have worked with InfiniBand and RoCE v2 at scale and you understand the trade-offs between them.
NCCL & Collective Communications Hands-on experience tuning NCCL for distributed AI workloads. You understand how collective communication patterns interact with network topology and you know how to optimize for both bandwidth and latency.
NIC & DPU Proficiency Experience configuring and tuning high-performance NICs and DPUs from vendors including NVIDIA ConnectX and Bluefield series.
Network Architecture Judgment You make network design decisions that hold up at scale. Fat-tree topologies, rail-optimized designs, congestion control — you have an informed view on all of it.
Great Team Spirit A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos.
Stop the endless job search. Our AI finds and applies to the best jobs for you.
Discover remote opportunities in Others
Answer easy questions
200,000+ jobs across 15+ categories
Get your best job matches
Only hand-screened, legit jobs
Find a remote job faster
No ads, scams, or junk
“ I was the first applicant for a remote marketing position that got listed on the company website the same day I applied. Had an interview within 48 hours!