ClickHouse is looking for an experienced engineer to join our Observability team. We build and operate the telemetry platform that powers both internal monitoring and the observability features our customers rely on. Our systems ingest trillions of events per day with sustained throughput in the tens of millions per second. Engineers on the team are hybrid software, systems, and infrastructure engineers who ensure this platform is reliable, scalable, and efficient. We work closely with product and infrastructure teams and play a key role in major engineering initiatives across the company.
We're looking for someone who thrives in fast-paced environments, isn't afraid to get hands-on during incidents, and knows when to automate the pain away. While experience in roles like Software Engineer, SRE, Systems Engineer, or DevOps is valuable, we care most about your problem-solving skills and mindset. If you enjoy tackling complex challenges across system design, infrastructure, automation, and incident response—while helping us scale with confidence—you’ll fit right in.
What you’ll do
- Design, build, and operate distributed systems that power observability across ClickHouse Cloud
- Own reliability, performance, and cost-efficiency of our telemetry pipeline and storage systems
- Take part in the on-call rotation and help drive root-cause resolution and long-term fixes
- Build tooling and automation to eliminate repetitive operational work
- Help shape the roadmap for observability by identifying bottlenecks and scaling challenges
- Collaborate with other engineering teams to improve their observability posture
- Contribute to design discussions, architecture reviews, and mentor teammates
What we’re looking for
- Strong bias for action and ownership — you ship, fix, and improve systems proactively
- Great production debugging skills and a problem-solving mindset
- Strong communication skills; comfortable working in a remote, async-friendly team
- Experience balancing system performance, reliability, and cost
- Ability to iterate quickly: build MVPs, collect feedback, and improve continuously
Requirements
- 5+ years building and running production systems at scale
- Proficiency in at least one systems-level language (we use Go, but C++, Rust, Python, etc. are fine)
- Experience with Kubernetes, Helm, ArgoCD, and Terraform or similar IaC tools
- Comfortable working with at least one major cloud provider (AWS, GCP, Azure)
- Familiarity with OpenTelemetry, Prometheus, Grafana, or similar tools
- Experience with ClickHouse
#LI-Remote