Site Reliability Lead Engineer --DevOps with Google Cloud Platform ll St Louis, MO (REMOTE FOR NOW) ll Long Term Contract

Apply for this position Please mention DailyRemote when applying
timePosted 12 days ago location United States salarySalary undisclosed
Before you apply - make sure the job is legit.

Attempting to apply for jobs might take you off this site to a different website not owned by us. Any consequence as a result for attempting to apply for jobs is strictly at your own risk and we assume no liability.

Job Description

Site Reliability Lead Engineer --DevOps with Google Cloud Platform

St Louis, MO

Long Term Contract


Client is looking for Site Reliability Engineer to manage end to end application and system stack and to work with one of the leading financial services organization in the US. Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations.

SRE is also an engineering approach to building and running production systems engineer solutions to operational problems. As SREs are responsible for overall system operation, utilizing a breadth of tools and approaches to solve a broad set of problems. Practices such as limiting time spent on operational work, blameless post-mortems, proactive identification, and prevention of potential outages.


As a Site Reliability Engineer,

You will engage in and improve the software development lifecycle from inception and design, through development, deployment, operation and refinement

Develop and maintain the large-scale infrastructure

Partner with the development teams, to help them improve the scalability and reliability the services they own

Own build tools and CI/CD automation pipeline

You will influence and design infrastructure, architecture, standards and methods for large-scale systems

You will support services prior to production via infrastructure design, software platform development, load testing, capacity planning and launch reviews

You will maintain services during deployment and in production by measuring and monitoring key performance and service level indicators including availability, latency, and overall system health

You will automate system scalability and continually work to improve system resiliency, performance and efficiency

Investigate, diagnose, and resolve performance and reliability problems in a wide range of large-scale and high-throughput services

Collaborate with architects and application engineers to ensure applications are maintainable, scalable, and follow appropriate disaster recovery and high availability strategies

Contributions to handbook, runbooks, and general documentation

You will remediate tasks within corrective action plan via sustainable, preventative, and automated measures whenever possible


BS degree in Computer Science or related technical field, or equivalent job experience required

Over 4 years of SRE experience working in Google Cloud Platform

Strong working knowledge on Google Cloud Platform (Google Cloud Platform)

Experience in DevOps and CI/CD pipelines and build tools like Jenkins.

Experience in software development in one or more of the following: C, C++, Java, Go and/or Perl, and Python.

Must have great communication skills

Experience operating a production environment at high scale with emphasis on availability, latency

Deep knowledge of container orchestration tools such as Docker, Kubernetes

Hands-on experience with infrastructure-as-code frameworks, such as Terraform and CloudFormation

Familiar with configuration management tools and Deployment tools such as Chef, Octopus

Strong team player with a "can do" attitude, and the flexibility to jump in wherever needed

Demonstrable cross-functional knowledge with systems, storage, networking, security and databases

System administration skills, including automation and orchestration of Linux/Windows using Chef, Puppet, Ansible, Salt Stack and/or containers (Docker, Kubernetes, etc.)

Proficiency with continuous integration and continuous delivery tooling and practices

Strong analytical and troubleshooting skills

Extra Points for any of the following:

You have expertise designing, analyzing and troubleshooting large-scale distributed systems.

You take a system problem-solving approach, coupled with strong communication skills and a sense of ownership and drive

You are passionate for automation with a desire to eliminate toil whenever possible

You've built software or maintained systems in a highly secure, regulated or compliant industry

You thrive in and have experience and passion for working within a DevOps culture and as part of a team

Thanks & Regards,

Hridesh Pathak

Sr Manager- Client Service & Delivery

Okaya Inc.

4949 Expy Dr N, Suite 101, Ronkonkoma, NY 11779


Landline:extn 621

Email ||


- provided by Dice