Solutions Architect - AI / ML - Training & GPU infra

 Posted 2 months ago
     
5-10 years experience
Apply Now

Please mention DailyRemote when applying

AI Summary

Design and validate production-grade distributed training and large-scale inference architectures on massive GPU clusters. Collaborate with customers to debug, optimize, and scale ML workloads while influencing product roadmaps based on real-world performance requirements.

AI/ML Solutions Architect – Distributed Training & GPU Infrastructure

Company

Join a fast-moving AI infrastructure team working on the cutting edge of large-scale ML workloads. This role is ideal for engineers who enjoy solving deep technical challenges in distributed training, multi-GPU systems, and scalable AI inference infrastructure. You will work directly with AI-focused clients, helping them get the most out of modern GPUs (H100, B200, etc.) and ML frameworks such as PyTorch (and JAX in some environments).

Team & Responsibilities

Work alongside senior AI and infrastructure engineers building large-scale GPU platforms. As part of the customer solutions team, you will:

  • Design and validate production-grade distributed training (primary) and large-scale inference architectures on large GPU clusters, typically tens to thousands of GPUs

  • Work hands-on with customers to debug, optimize, and scale ML workloads across multi-node GPU environments

  • Act as a technical authority on GPU performance, networking, and schedulers, making trade-offs at scale and translating customer needs into concrete platform requirements

  • Collaborate closely with engineering, product, and R&D to influence roadmap decisions based on real-world ML workloads

  • This is a hands-on, technical role; you are expected to work directly in customer environments, not only advise at a high level

Required skills and experience

  • Hands-on experience designing and operating enterprise-scale, production-grade, multi-node GPU workloads for training (7B+ model size) or inference

  • Strong background in distributed deep learning (PyTorch Distributed, DeepSpeed, ...) on GPU clusters

  • Deep understanding of GPU architecture and interconnects (H100/A100 class, NVLink, InfiniBand)

  • Experience with Kubernetes or Slurm

  • Experience with performance tuning using GPU profiling and monitoring tools

This role is not a fit if your experience is limited to single-node training, high-level AI strategy, or non-production research environments. We are looking for engineers and architects who thrive at the intersection of AI workloads and large-scale infrastructure.

What's offered

Location: Remote from anywhere in Europe

Total compensation up to EUR 250k (base + variable / OTE), depending on level and experience

Similar Jobs

See all Remote Software Development jobs →

Personalize your Remote Job Search in 3 Easy Steps!

Discover remote opportunities in Solutions Architect

Answer easy questions

Answer easy questions

200,000+ jobs across 15+ categories

Get your best job matches

Get your best job matches

Only hand-screened, legit jobs

Find a remote job faster

Find a remote job faster

No ads, scams, or junk

I was the first applicant for a remote marketing position that got listed on the company website the same day I applied. Had an interview within 48 hours!

Sarah J. — Sarah J. · Marketing Manager ★★★★★ Verified