1) Can you please provide a summary of the project/initiative which describes what's being done? This SRE/ production operations role is on provider search and directories platform/operations team responsible for Site Reliability Engineering practices to improve availability / reliability, latency, performance, efficiency, monitoring, emergency response and capacity planning / forecasting / management etc.
2) What does the ideal candidate background look like (ex: healthcare specific background, etc.)? Ideal candidate possess experience in Site Reliability Engineering, production operations and optimization practices and software development regardless of any specific business domain background. 3) Of the required skills listed, which would you consider the top 3?
Please list your expectations regarding years of experience for each requirement.
6+ years of experience in site reliability engineering practices
6+ Experience in supporting and operating large-scale production systems
6+ years of experience in Unix/Linux shell scripting, infrastructure and application logging, monitoring and observability tools, intelligent alerting, and automated self-healing
4) What experience will set candidates apart from one another? Experience in site reliability engineering practices
5) Are you open to candidates that would need to be 100% remote for the duration of the engagement? Yes,
open to candidates that would need to be 100% remote for the duration of the engagement.
6) Are you open to candidates that cannot convert to FTE without sponsorship? Initially we are not open to candidates that cannot convert to FTE without sponsorship. I would like to evaluate this option after looking at the profile submissions.
7) What does the team structure look like how many members and what is the break-down of the team's skill sets (ex: 1 PM, 4 Developers, etc.)? 1 Sr Manager., Software Engineering, 1 Sr Systems Mgmt. Analyst, 4 Software Engineering, 1 Sr Site Reliability Engineer (SRE), 1 Lead SRE (this position) 8) What does the interview process look like?
We can revise this further but here are our initial thoughts on the interview process.
Vendor interview (Initial vendor screening)
Round 1 Initial Technical screening (Manager interview to get initial sense)
Round-2a and 2b Technical Interview (Engineering Managers / Directors, Software Engineering) minimum 2 hour
Leadership round (VP/Sr Director)
b. Video vs. phone? Video
How technical will the interviews be? Deep dive technical assessment.
When do you anticipate starting the interview process? Immediately as soon as vendor identifies the profile. The Lead Site Reliability Engineer
is a technical Subject Matter Expert that pro-actively drives the technical stability and performance of the applications in the provider technology portfolio. They combine software and systems engineering to design solutions in physical, virtual and cloud environments that automate fault detection, containment, and resolution without customer impact or human intervention. These solutions typically involve software development for metrics and event collection/correlation across distributed architectures, automation, monitoring, intelligent alerting, random fault injection, and self-healing. Focus areas include
High Availability, Disaster Recovery, Sustained Resiliency, Chaos Engineering
Service and Operational Level Agreements
SRE - Standards and best practices
Application scalability/Capacity Management
Technical debt Reduction
Logging, monitoring, intelligent alerting, self-healing
Security Vulnerabilities and Compliance
Application Knowledge Support Artifacts, etc. Primary Responsibilities:
Responsible in Site Reliability Engineering practices improve availability / reliability, latency, performance, efficiency, monitoring, emergency response and capacity planning / forecasting / management
Design self-healing and resiliency patterns
Responsible for running production systems - ensure applications are available per business SLAs.
Accountable for facilitation, communication, and resolution of high / critical business impact issues and drive blameless post-mortems and Root Cause Analysis.
Communicates system related problems and collaborates with other IT teams and managers on solutions, enhancements, and process improvements.
Responsible for production best practices, technical and operating standards, design and implementation of performance and operational enhancements.
Work with engineering teams across SDLC activities to implement best practices to make applications secure and reliable.
Integrate security/compliance tools in deployment pipelines. leadership and teaming skills to coordinate and perform vulnerability assessments using tools and remediation of vulnerabilities within established timeframes.
Responsible in coordination, technical planning and implementation of Product Life cycle upgrades, production maintenance and technology debt reduction activities.
Ability to drive technical features including intake, prioritization, creation, grooming and implementation
Drive Chaos Engineering practices to test under real-world conditions
Provide inputs in architectural and design decisions
Design and implement end-to-end monitoring solutions for Application and Infrastructure components, based on cutting edge SLO-based telemetry tools
Lead a team of talented software development engineers responsible for a hybrid of software engineering and operations, with an emphasis on reducing operational toil
Manage on-call rotations across continents, using a follow-the-sun model You will be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role, as well as providing development for other roles you may be interested in. Required Qualifications:
BS or MS in Computer Science, a related field, or equivalent experience
6+ years of experience in site reliability engineering practices
Experience in supporting and operating large-scale production systems
Experience in programming in Java Spring Boot and APIs
Knowledge in Unix/Linux shell, can write shell scripts, and understands Linux internals
Experience with CI/CD and infrastructure automation tools - Jenkins, Terraform, etc.
Experience in infrastructure and application logging, monitoring and observability tools, intelligent alerting, and automated self-healing Preferred Qualifications:
Experience in public cloud ecosystems AWS
Experience in Elastic Search
Experience in Kafka Streaming
Experience with containers, such as with Kubernetes
Experience with Chaos Engineering