Site Reliability Engineering (SRE) - 100% Remote

Apply for this position Please mention DailyRemote when applying
Posted 6 days ago United States Salary undisclosed
Before you apply - make sure the job is legit.

Attempting to apply for jobs might take you off this site to a different website not owned by us. Any consequence as a result for attempting to apply for jobs is strictly at your own risk and we assume no liability.

Job Description

Location: REMOTE
Description: Our client is currently seeking a Site Reliability Engineering (SRE) -Remote


Responsibilities:

  • From a practice perspective, focus will be on defining consistent, best practices for teams
  • Define SRE framework
  • Define reliable design patterns
  • Define canned reliability user stories for feature delivery
  • Observability: define what good looks like for baseline monitoring/alerting
  • Develop Scorecards, gates, technical debt oversight for organization
  • Define Capacity Management processes: define what good looks like, stress tests, load tests
  • Emergency Response: define consistent problem management process, PIRs,
  • Culture: Job descriptions, training, common language, definitions
  • From a chapter perspective, SREs will be accountable for:
  • Leading teams in developing SRE playbooks
  • Ensuring reliability is built into new designs
  • Ensuring canned reliability users stories are executed for every feature
  • Performing design reviews of existing apps
  • Performing production readiness reviews
  • Executing capacity management processes
  • Executing chaos testing
  • Identifying operational functions that need to be automated

Minimum Qualifications:

  • Bachelor's Degree in Information Technology or related area
  • 5+ years of SRE experience in a highly customer-focused environment
  • Proficiency in designing resilient app patterns
  • Expertise in 24x7 site monitoring and ability to own uptime & performance SLA's for large scale distributed systems
  • Expertise and operational experience at operating highly available, scalable and fault-tolerant systems using container platforms
  • Familiar with OS tuning, optimization and system requirements for vertical scaling
  • Proficiency in one or more general purpose programming languages: Python, Go, shell scripting (Unix/Linux), Java
  • Expertise in automation tools experience such as Chef, Puppet, Ansible

Preferred Skills

  • Strong leadership skills and the ability to motivate teams.
  • Ability to drive change, and motivate engineers to develop simple solutions for complex operational challenges.
  • Experience collaborating and partnering effectively with several other teams.
  • Experience leading discussions with senior leadership, and are able to tailor the level of technical detail to suit your audience.


Contact:
This job and many more are available through The Judge Group. Find us on the web at