Please mention DailyRemote when applying
Context
AS+ provides run support for GPU clusters operated by a cloud infrastructure partner. We are building a support team to handle day-to-day incidents on these clusters. This first role focuses on weekday coverage. The work sits low in the stack β hardware and network diagnosis β rather than high-level HPC or application support.
Responsibilities
Diagnose and triage incidents on GPU compute clusters, determining whether a fault originates on our side or the client's.
Investigate hardware failures: collect and analyze hardware logs, identify failed components, and document findings for resolution or RMA.
Diagnose GPU hardware faults (failure detection and isolation β not performance tuning or porting).
Configure and troubleshoot network connectivity, including InfiniBand fabric.
Work directly with the client as first line of support, in English.
Required skills
Solid system and network fundamentals β low-level networking and connectivity diagnosis.
Hands-on hardware troubleshooting, ideally on Dell server hardware.
Ability to diagnose GPU hardware failures (no deep GPU expertise required).
InfiniBand knowledge (important).
Fluent English (all client communication is in English).
Not required
No advanced OS administration.
No Slurm or workload-scheduler expertise.
No HPC application or GPU-porting background.
Setup
Full remote.
Weekday coverage (first hire; the team will expand to cover a wider window).
Stop the endless job search. Our AI finds and applies to the best jobs for you.
Discover remote opportunities in Support
Answer easy questions
200,000+ jobs across 15+ categories
Get your best job matches
Only hand-screened, legit jobs
Find a remote job faster
No ads, scams, or junk
“ I was the first applicant for a remote marketing position that got listed on the company website the same day I applied. Had an interview within 48 hours!