Site Reliability Engineer (SRE)
Number of Resources: 1
Job Type: Contract
Estimated Duration: 6 month contract to start
Desired Experience: 5-7+ years of experience
Location: 100% remote About The Role
This individual in this role will build, maintain, and support the IR (Information Retrieval) Platform Infrastructure for our client' s ecommerce platform, performing the following:
- Evaluate service mesh solutions and create an adoption plan for Search Platform (this project will be the primary focus for this role).
- Design, build and support the core infrastructure of our search platform.
- Work cross-functionally with various platform teams, ML teams and product partners to build the next generation of our high availability search platform in the cloud.
- Build and maintain observability and test tooling - logging, monitoring, distributed tracing, alerting and offline test tools needed for search.
- Practice continuous learning and agile delivery model to stay informed and focused on our deliverables.
- Support GKE services and maintenance that includes software upgrades, performance tuning and GKE cluster tuning and optimization.
- Build GKE Tooling for IR Platform' s test environment and automate deployments.
- Search Disaster Recovery Planning and Testing for Zonal Failures.
The states of California and Colorado are ineligible.
Advanced development experience in Python
- Another programming language (Java, Scala, Go) is also acceptable; however, the role will mostly utilize Python.
- The engineer should have full command of the language they choose, and experience developing in that language (beyond only scripting).
- The engineer will ideally have experience working with CI/CD pipeline, including technologies similar to Jenkins/BuildKite/GitHub Actions
Advanced experience with Kubernetes/Docker.
- This engineer will lead the team in best practices, technical understanding in Kubernetes, and managing infrastructure at scale
- This skill should include hands-on cloud provider experience (AWS, GCP, or Azure)
Advanced (Hands-On) Infrastructure Experience at Scale
- Examples are experience with monitoring, logging, alerting [Grafana/Prometheus], distributive tracing, or security
- Must include experience with Unix/Linux operating systems and networking stack (e.g., TCP/IP, routing, network topologies, and hardware, SDN)
- Must include infrastructure at scale (deployed to production with significant traffic)
- An engineer with multiple consulting experiences in the infrastructure realm is more likely to bring the expertise this team is seeking.
Communication and Collaboration Skills
- Be able to confidently recommend alternative solutions and the why
- Be a team player that loves to collaborate
- Be highly self-driven and able to take on work independently
- Be a strong collaborator and communicator who leads the engineers around you to grow and learn