STN Inc

Site Reliability Engineer

Posted 22 days ago

United States

⭐ 5-10 years experience

Apply Now

Please mention DailyRemote when applying

AI Summary

The SRE owns reliability, observability, and incident response for the GPUaaS platform. Key duties include defining SLOs, building the observability stack, and leading major incident resolution.

Site Reliability Engineer

Platform and software · shared across customers

Reports to: Director, Site Reliability

Location: Remote (US)

Department: Cloud Platform Engineering / SRE/Reliability

Position summary

The Site Reliability Engineer (SRE) owns reliability, observability, and incident response for the GPU One (GPUaaS) platform. The SRE defines and enforces SLOs aligned with contractual SLAs, builds the observability stack, and leads major incidents to resolution.

Key responsibilities

Define and operate Service Level Objectives (SLOs) aligned with customer SLAs
Build and maintain the observability stack including metrics, logs, traces, and alerting
Lead incident response and chair post-incident reviews
Drive automation to reduce toil and improve mean-time-to-recover (MTTR)
Author and maintain operational runbooks alongside the NOC
Manage on-call rotation, escalation paths, and incident-management tooling
Coordinate cross-functionally with NOC, Platform Engineering, and Network Engineering
Drive chaos engineering, game days, and reliability testing programs
Produce SLA performance reports in coordination with the SLA Manager
Mentor junior engineers and contribute to engineering culture

Required qualifications

5+ years in SRE, DevOps, or production engineering roles
Strong programming skills in Go, Python, or both
Hands-on experience operating Kubernetes-based platforms at scale
Deep familiarity with observability tooling (Prometheus, Grafana, Datadog, OpenTelemetry)
Strong incident management experience including major-incident command

Preferred qualifications

GPU or HPC platform operational experience
Familiarity with SLA-driven customer environments and credit calculations
Experience with chaos engineering tools (Gremlin, Litmus, or similar)
Published SRE content or contributions

Automatically Apply to the Best Remote Jobs

Stop the endless job search. Our AI finds and applies to the best jobs for you.

Try it Now

STN Inc

🧑‍💻 Employees 11-50 employees 🏢 Industry IT Services and IT Consulting

View More Jobs From STN Inc

STN Inc

Site Reliability Engineer

AI Summary

Site Reliability Engineer

Position summary

Key responsibilities

Required qualifications

Preferred qualifications

Automatically Apply to the Best Remote Jobs

Ace Your Job Interview

How to Answer "How Do You Handle Criticism"?

How to Answer "Tell Me About Yourself?" in an Interview

How to Answer "What is your Experience with Customer Service?"

How to Answer "Describe Your Experience Working With Diverse Teams Or Different Cultures?"

How to Answer The Interview Question "What Sets You Apart From Other Candidates?"

How to Answer "Why Are You The Best Person For This Job"?

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Why Should We Hire You?"

How to Answer "What Areas Need Improvement?"

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Tell Me About a Time You Received Constructive Feedback"

How to Answer "What Is Your Greatest Accomplishment?"

Similar Jobs

Adobe Martech Solution Architect

Manufacturing Engineer 2 (Nuclear I&C)

AEP Enterprise Architect

Analytics Consultant (AA, CJA)

Sr Project Manager, Adobe Experience Cloud (AEM/AEP, Commerce)

Unpaid Internship: Digital Marketing(SaaS AI App) Intern(BA/MA student in Publishing, Literature, or related)

STN Inc

Site Reliability Engineer

AI Summary

Site Reliability Engineer

Position summary

Key responsibilities

Required qualifications

Preferred qualifications

Automatically Apply to the Best Remote Jobs

Share This Job:

Similar Jobs

Adobe Martech Solution Architect

Manufacturing Engineer 2 (Nuclear I&C)

AEP Enterprise Architect

Analytics Consultant (AA, CJA)

Sr Project Manager, Adobe Experience Cloud (AEM/AEP, Commerce)

Unpaid Internship: Digital Marketing(SaaS AI App) Intern(BA/MA student in Publishing, Literature, or related)

Personalize your Remote Job Search in 3 Easy Steps!