Please mention DailyRemote when applying
Platform and software · shared across customers
Reports to: Director, Site Reliability
Location: Remote (US)
Department: Cloud Platform Engineering / SRE/Reliability
The Site Reliability Engineer (SRE) owns reliability, observability, and incident response for the GPU One (GPUaaS) platform. The SRE defines and enforces SLOs aligned with contractual SLAs, builds the observability stack, and leads major incidents to resolution.
Define and operate Service Level Objectives (SLOs) aligned with customer SLAs
Build and maintain the observability stack including metrics, logs, traces, and alerting
Lead incident response and chair post-incident reviews
Drive automation to reduce toil and improve mean-time-to-recover (MTTR)
Author and maintain operational runbooks alongside the NOC
Manage on-call rotation, escalation paths, and incident-management tooling
Coordinate cross-functionally with NOC, Platform Engineering, and Network Engineering
Drive chaos engineering, game days, and reliability testing programs
Produce SLA performance reports in coordination with the SLA Manager
Mentor junior engineers and contribute to engineering culture
5+ years in SRE, DevOps, or production engineering roles
Strong programming skills in Go, Python, or both
Hands-on experience operating Kubernetes-based platforms at scale
Deep familiarity with observability tooling (Prometheus, Grafana, Datadog, OpenTelemetry)
Strong incident management experience including major-incident command
GPU or HPC platform operational experience
Familiarity with SLA-driven customer environments and credit calculations
Experience with chaos engineering tools (Gremlin, Litmus, or similar)
Published SRE content or contributions
Stop the endless job search. Our AI finds and applies to the best jobs for you.
Discover remote opportunities in Site Reliability Engineer
Answer easy questions
200,000+ jobs across 15+ categories
Get your best job matches
Only hand-screened, legit jobs
Find a remote job faster
No ads, scams, or junk
“ I was the first applicant for a remote marketing position that got listed on the company website the same day I applied. Had an interview within 48 hours!