Ensure the reliability, observability, and performance of high-load AI platform services across cloud and on-prem environments. Establish monitoring systems, manage incidents, and collaborate with developers to optimize infrastructure and scaling.
We currently have several large-scale projects and are expanding our infrastructure team. Our product is an advanced platform for creating and managing AI agents. It can be deployed directly inside a customer’s infrastructure and delivered as an enterprise solution, while also being available as a SaaS version.
Under the hood, there is real-time voice and telephony, GPU and LLM inference, streaming analytics, and all of this runs both in the cloud and on-prem, including in banking environments. There is a lot of infrastructure; it is complex, interesting, and sometimes at the edge of what is possible. That is why we are looking for a strong SRE who, like us, cares about making systems transparent, reliable, and built the right way.
This is a role for a strong, independent engineer. A Senior SRE with real influence and a voice in how things are built and operated.
You will also handle DevOps tasks for the team, but your main focus and area of expertise should be SRE: reliability, observability, incident management, and performance under load.
Requirements
- 5+ years in SRE/DevOps. You have not just seen production; you have been responsible for the reliability of high-load production systems.
- Deep, practical understanding of Docker and Kubernetes. You have operated them in production, not just used them in tutorials.
- Mature understanding of metrics and alerts, with real hands-on experience writing, tuning, and maintaining them.
- Practical experience with Prometheus, Alertmanager, and Grafana.
- Ability and willingness to build dashboards and make them clear, useful, and easy to work with.
- Experience with SLIs/SLOs, reliability management, incident investigation, and postmortems.
- Experience with load testing and basic capacity planning.
- Python: you can write code and confidently read and modify other people’s code for automation, exporters, tooling, and related tasks.
- Cloud experience with GCP and/or AWS, strong Linux skills, and solid networking knowledge at an operational level.
- DevOps fundamentals: CI/CD and infrastructure as code, including GitHub Actions, Terraform, Ansible, and similar tools.
- Willingness to understand and support the product in customer environments, including on-prem deployments.
- Ownership mindset: you take responsibility for a task, drive it to completion, and think one step ahead.
- Friendly, non-toxic, and pleasant to work with.
- Strong communication with developers: you can clearly and constructively explain your position, defend it when needed, and find common ground.
- Willingness and ability to mentor, teach, and share knowledge with others.
- Analytical mindset: you dig down to the root cause instead of just treating symptoms.
- Proactivity: you would rather prevent an outage than heroically fight it later.
- Strong attention to detail and reliability.
Nice to have
- Experience using AI agents for routine and recurring tasks.
- Real-time telephony: SIP, FreeSWITCH, RTP, WebRTC.
- GPU/ML serving: Triton, vLLM, RunPod, Nebius, Lambda, run:ai, DCGM; understanding of the specifics of deploying LLM/ML models.
- Streaming data and analytics: Kafka, ClickHouse.
- Deep experience with IaC and GitOps, such as Terraform, Ansible, ArgoCD; logging with Loki/ELK; gRPC.
- Experience working in isolated and highly secure environments.
- Experience preparing systems for significant growth in load.
Responsibilities
- You will be responsible for the reliability of our services: SLIs/SLOs, availability, and identifying and eliminating bottlenecks across the system.
- You will set up monitoring for services, metrics, alerts, and dashboards. This will rarely come as a clearly defined task; more often, you will decide what is important to measure and bring it to a clear, usable view.
- You will build and maintain Grafana dashboards that people actually use, both our team and our customers.
- You will run load testing, analyze the results, and provide recommendations on resources and scaling.
- You will investigate incidents, participate in on-call rotations, write and lead postmortems, and ensure the same failure does not happen again.
- You will work closely with developers: communicate and defend your position, challenge technical decisions, and find win-win solutions.
- You will develop and support Kubernetes-based infrastructure across our clouds, including GCP and AWS, automate routine work, and help with CI/CD and general team tasks.
- You will take part in delivering and supporting the platform for customers, including on-prem deployments.
- You will mentor colleagues and help raise the engineering bar across the team.
What we offer
- The team has built award-winning AI products for tech corporations — devices, voice assistants, products that are actually in the world
- Cutting-edge tech stack: Speech Technologies, NLP, Generative AI (LLMs, diffusion models), voice-first agentic architecture with privacy-first and on-premises deployment
- High engineering bar and real ownership — the team cares about what actually works in production, not what looks good in a demo, and you'll see the impact of your work directly
- Fast career progression — a senior-heavy team and a high volume of real problems means you grow faster than you would anywhere else
- Startup pace with enterprise stability — real clients, real revenue, no bureaucracy
- Fully remote across Europe
- 21 vacation days + public holidays + 5 sick days
- Private English lessons via Preply