Responsible for the architecture, deployment, and maintenance of highly available ELK clusters and Fleet managed agents. This includes automating operational tasks, optimizing index performance, and implementing robust security and disaster recovery plans.
Job Description:
- Architecture,
deploying, managing, and maintaining highly available and fault-tolerant
ELK clusters across diverse environments, encompassing, Logstash, Kibana,
and Beats agents.
- Implementing
a Fleet managed large scale deployment of Elastic agents.
- Developing
and implementing comprehensive monitoring, alerting, and dash boarding strategies using Kibana visualizations and integrated alerting mechanisms
to proactively identify and address system anomalies and performance
degradations.
- Automating
routine operational tasks, deployment pipelines, and cluster upgrades
through sophisticated scripting (e.g., Python, Bash) and
infrastructure-as-code principles utilizing tools like Ansible.
- Performing
in-depth performance tuning and optimization of Elasticsearch indices,
query performance, and underlying hardware/cloud resources to ensure
maximum throughput and minimal latency.
- Managing
the ingestion pipelines, configuring Logstash filters and outputs, and
ensuring efficient data flow from various sources into the Elasticsearch data stores.
- Implementing
and enforcing robust security measures across the ELK stack, including
access control, encryption (TLS/SSL), and regular vulnerability
assessments.
- Troubleshooting
complex issues across the entire stack, from data sources and ingestion
agents through to the Elasticsearch cluster and Kibana interface,
employing systematic diagnostic methodologies.
- Collaborating
closely with development and operations teams to understand application
requirements, optimize data schemas, and facilitate effective log analysis
and troubleshooting.
- Designing
and executing disaster recovery and business continuity plans specifically
tailored for the ELK platform, ensuring data integrity and service
availability.
- Maintaining detailed documentation for system
architecture, operational procedures, troubleshooting guides, and configuration
standards
Requirements
Requirement:
- Demonstrable
extensive hands-on experience managing large-scale Elasticsearch clusters,
including deep understanding of index management, shard allocation,
replication strategies, and cluster health monitoring.
- Proven
expertise in administering and troubleshooting complex Linux operating
systems (e.g., RHEL, Debian) at an expert level, including performance
analysis.
- Solid
foundational knowledge of web applications, their underlying
architectures, and how they interact with logging and monitoring systems.
- A bachelor’s
degree in computer science, Information Technology, Engineering, or a
closely related technical field, or equivalent practical experience.
- Possession
of relevant industry certifications such as Elastic Certified Engineer,
AWS Certified SysOps Administrator, Red Hat Certified Engineer (RHCE), or
equivalent validation of core competencies.
- A
minimum of five to seven years of progressive experience in Site
Reliability Engineering, Systems Administration, or DevOps roles with a
strong focus on large-scale distributed systems.
- Proficiency
with essential infrastructure management tools, including configuration
management systems (Ansible, Chef, Puppet) and orchestration platforms (OpenShift).
- Expertise
in scripting languages such as Bash for automation, system administration
tasks, and developing operational tooling.
- Thorough
understanding of networking concepts, including TCP/IP, HTTP/S protocols,
DNS, load balancing, and firewall configurations relevant to distributed
systems.
Preferred Qualifications
- Experience
with message queuing technologies like Kafka or RabbitMQ for buffering and
decoupling data ingestion processes.
- Hands-on
experience with container orchestration systems such as OpenShift,
including deploying and managing Logstash within containerized
environments.
- Familiarity
with various data collection agents beyond Beats, such as Fluentd or
Vector, and their respective configuration nuances.
- Knowledge
of distributed tracing systems (e.g., Jaeger, Zipkin) and their potential
integration or correlation with ELK data.
- Familiarity
with CI/CD pipelines and integrating ELK stack deployments and updates
into automated release processes.
- A
strong grasp of system security best practices, including intrusion
detection, vulnerability management, and security hardening techniques for
distributed systems.
Benefits
What We Offer
Competitive salaries and
comprehensive health benefits
Flexible work hours and remote work options.
Professional development and training opportunities.
A supportive and inclusive work environment