Site Reliability Engineering (SRE) - 100% Remote

Apply for this position Please mention DailyRemote when applying
Posted 6 days ago United States Salary undisclosed
Before you apply - make sure the job is legit.

Attempting to apply for jobs might take you off this site to a different website not owned by us. Any consequence as a result for attempting to apply for jobs is strictly at your own risk and we assume no liability.

Job Description

Location: REMOTE
Description: Our client is currently seeking a Site Reliability Engineering (SRE) -Remote


  • From a practice perspective, focus will be on defining consistent, best practices for teams
  • Define SRE framework
  • Define reliable design patterns
  • Define canned reliability user stories for feature delivery
  • Observability: define what good looks like for baseline monitoring/alerting
  • Develop Scorecards, gates, technical debt oversight for organization
  • Define Capacity Management processes: define what good looks like, stress tests, load tests
  • Emergency Response: define consistent problem management process, PIRs,
  • Culture: Job descriptions, training, common language, definitions
  • From a chapter perspective, SREs will be accountable for:
  • Leading teams in developing SRE playbooks
  • Ensuring reliability is built into new designs
  • Ensuring canned reliability users stories are executed for every feature
  • Performing design reviews of existing apps
  • Performing production readiness reviews
  • Executing capacity management processes
  • Executing chaos testing
  • Identifying operational functions that need to be automated

Minimum Qualifications:

  • Bachelor's Degree in Information Technology or related area
  • 5+ years of SRE experience in a highly customer-focused environment
  • Proficiency in designing resilient app patterns
  • Expertise in 24x7 site monitoring and ability to own uptime & performance SLA's for large scale distributed systems
  • Expertise and operational experience at operating highly available, scalable and fault-tolerant systems using container platforms
  • Familiar with OS tuning, optimization and system requirements for vertical scaling
  • Proficiency in one or more general purpose programming languages: Python, Go, shell scripting (Unix/Linux), Java
  • Expertise in automation tools experience such as Chef, Puppet, Ansible

Preferred Skills

  • Strong leadership skills and the ability to motivate teams.
  • Ability to drive change, and motivate engineers to develop simple solutions for complex operational challenges.
  • Experience collaborating and partnering effectively with several other teams.
  • Experience leading discussions with senior leadership, and are able to tailor the level of technical detail to suit your audience.

This job and many more are available through The Judge Group. Find us on the web at