DevOps/SRE Engineer - Remote till COVID

Apply for this position Please mention DailyRemote when applying
Posted 9 days ago United States Salary undisclosed
Before you apply - make sure the job is legit.

Attempting to apply for jobs might take you off this site to a different website not owned by us. Any consequence as a result for attempting to apply for jobs is strictly at your own risk and we assume no liability.

Job Description

Position- DevOps/SRE Engineer
Remote till COVID
Job Type: Fulltime

Job Description:
Design and manage infrastructure provisioning using IaC tools like Terraform
Develop automation and scripting in bash, python, powershell
Set up Observability/Monitoring with TimeSeries Databases for Metrics - Graphite, InfluxDB, Druid
Monitor and report of VALET (Volume, Availability, Latency, Errors, Tickets) metrics
Stand up scalable and fault tolerant logging and aggregation, message queuing solutions using ELK, Kafka, Splunk ec
Infrastructure monitoring - cpu, memory, disk, network - across large clusters of servers
Create comprehensive SRE dashboards using Grafana, setting and configuring alerts based on critical thresholds, triggering self healing scripts
Set and configure critical alert thresholds, trigger alerts on tools like PagerDuty, Slack, Teams
Use APM tools like New Relic, Datadog, AppDynamics, etc
Setup Incident Management, Root Cause Analysis (RCA), Blameless Post Mortems and Remediation processes
Integrate and maintain Ticketing tools/services like Zendesk and CMDB tools like ServiceNow for inventory and configuration management.
Use Source Code Management tools like git and proper gitflow and branching strategies.
Create CI/CD pipelines using Github Actions, Azure DevOps Pipelines, Spinnaker
Leverage containerization and orchestration using Kubernetes as well as monitoring of Kubernetes cluster nodes, services and applications using Prometheus and Grafana
Setup workflow orchestration using Apache Airflow, Amazon SWF (simple workflow service)
Use Chaos Engineering with tools like ChaosMonkey, Gremlin to conduct experiments by injecting faults into running production env.
Use cloud cost optimization exercises leveraing auto scaling concepts and use of reserved instances for persistent production services
Must have Skills:
Bachelors Degree with 4+ years of experience in Development, Operations, DevOps or SRE Teams
IaC using Terraform
Scripting knowledge - bash, python and/or powershell
Cloud knowledge - AWS and/or Azure,
CI/CD PipeLines with Github Actions, Azure DevOps
Monitoring and Dashboarding using Grafana
Standing up a scalable and fault tolerant ELK stack,
SRE knowledge - Monitoring, Incident Management
Metrics Dashboard using TimeSeries Databases like Graphite, InfluxDB, Grafana; showcase VALET metrics
APM tools experience - NewRelic, Datadog, Appdynamics
Chaos Engineering tools - Gremlin, Chaos Monkey