Design, deploy, and operate high-performance networking fabrics for GPU clusters, including InfiniBand and RoCE. Manage customer connectivity, BGP peering, and network automation while coordinating with the NOC for monitoring and alerting.
STN Inc
6 Remote Job Openings at STN Inc
Provide tiered technical support for GPUaaS customers via ticketing, chat, and email channels. Resolve Tier 1 issues using runbooks and escalate complex problems to specialists while maintaining knowledge base documentation.
The NOC Engineer manages 24/7 monitoring and first-response for GPUaaS infrastructure to protect customer SLAs. Responsibilities include triaging alerts, executing runbooks, and coordinating with on-call specialists during incidents.
The role is responsible for owning security operations and maintaining the compliance posture for the GPUaaS platform, specifically managing SOC 2 and SOC 3 programs. Key duties include leading incident response, managing vulnerability assessments, and handling customer security questionnaires.
Build and operate the multi-tenant orchestration and scheduling layer to transform raw GPU infrastructure into a cloud service. Design customer-facing APIs, CLIs, and automation for node provisioning and image management.
The SRE owns reliability, observability, and incident response for the GPUaaS platform. Key duties include defining SLOs, building the observability stack, and leading major incident resolution.