Possible expired job

This job was posted 2 months ago and may be expired now. If that's the case, you can browse similar jobs here. Apologies for the inconvenience.

Site Reliability Engineer (SRE) - Remote

Apply for this position Please mention DailyRemote when applying
Posted 2 months ago United States Salary undisclosed
Before you apply - make sure the job is legit.

Attempting to apply for jobs might take you off this site to a different website not owned by us. Any consequence as a result for attempting to apply for jobs is strictly at your own risk and we assume no liability.

Job Description

Role: Site Reliability Engineer (SRE)

Location: Cupertino, CA

Duration: 6+ months

Type of hire: W2

Description:
You will be working on maintaining and improving client's next generation Telemetry system. The system is critical for a wide range of our teams to maintain their services' reliability and health. Expecting candidate to be highly self-motivated with a passion for excellence, quality and detail. They will support development and operations with a focus to improve stability and scalability of the overall system(SRE).
Key qualifications:

  • Certified Kubernetes Application Developer (CKAD) - Good to have
  • Deployment and triage large scale distributed applications on k8s and other cloud platforms. - Must have
  • Experience with technologies like Cassandra, Zookeeper, Kafka, Spark - Minimum knowledge
  • Ability to troubleshoot issues across the entire software stack - Must have
  • Experience with Helm, Docker, Terraform and general containerization of mirco-services - Must have
  • Strong coding in Python or Scala
  • Experience with PrometheGrafana and Telegraf - Good to have
  • Knowledge of the Linux operation system and its variations - Good to have (Not looking for sys-admin)
  • Excellent communications skills - Must have
  • On-call for applications running on k8s and other platforms
  • P2 Customer PD Incidents
  • Keeping dev and integration and production clusters online and meeting SLOs
  • Building tooling and automation for various operational tasks
  • Improving overall application health and customer experience
- provided by Dice