SRE

 Posted 3 hours ago
     
5-10 years experience
Apply Now

Please mention DailyRemote when applying

AI Summary

Ensure the reliability, observability, and performance of high-load AI platform services across cloud and on-prem environments. Establish monitoring systems, manage incidents, and collaborate with developers to optimize infrastructure and scaling.

We currently have several large-scale projects and are expanding our infrastructure team. Our product is an advanced platform for creating and managing AI agents. It can be deployed directly inside a customer’s infrastructure and delivered as an enterprise solution, while also being available as a SaaS version.

Under the hood, there is real-time voice and telephony, GPU and LLM inference, streaming analytics, and all of this runs both in the cloud and on-prem, including in banking environments. There is a lot of infrastructure; it is complex, interesting, and sometimes at the edge of what is possible. That is why we are looking for a strong SRE who, like us, cares about making systems transparent, reliable, and built the right way.

This is a role for a strong, independent engineer. A Senior SRE with real influence and a voice in how things are built and operated.

You will also handle DevOps tasks for the team, but your main focus and area of expertise should be SRE: reliability, observability, incident management, and performance under load.


Requirements

  • 5+ years in SRE/DevOps. You have not just seen production; you have been responsible for the reliability of high-load production systems.
  • Deep, practical understanding of Docker and Kubernetes. You have operated them in production, not just used them in tutorials.
  • Mature understanding of metrics and alerts, with real hands-on experience writing, tuning, and maintaining them.
  • Practical experience with Prometheus, Alertmanager, and Grafana.
  • Ability and willingness to build dashboards and make them clear, useful, and easy to work with.
  • Experience with SLIs/SLOs, reliability management, incident investigation, and postmortems.
  • Experience with load testing and basic capacity planning.
  • Python: you can write code and confidently read and modify other people’s code for automation, exporters, tooling, and related tasks.
  • Cloud experience with GCP and/or AWS, strong Linux skills, and solid networking knowledge at an operational level.
  • DevOps fundamentals: CI/CD and infrastructure as code, including GitHub Actions, Terraform, Ansible, and similar tools.
  • Willingness to understand and support the product in customer environments, including on-prem deployments.
  • Ownership mindset: you take responsibility for a task, drive it to completion, and think one step ahead.
  • Friendly, non-toxic, and pleasant to work with.
  • Strong communication with developers: you can clearly and constructively explain your position, defend it when needed, and find common ground.
  • Willingness and ability to mentor, teach, and share knowledge with others.
  • Analytical mindset: you dig down to the root cause instead of just treating symptoms.
  • Proactivity: you would rather prevent an outage than heroically fight it later.
  • Strong attention to detail and reliability.


Nice to have

  • Experience using AI agents for routine and recurring tasks.
  • Real-time telephony: SIP, FreeSWITCH, RTP, WebRTC.
  • GPU/ML serving: Triton, vLLM, RunPod, Nebius, Lambda, run:ai, DCGM; understanding of the specifics of deploying LLM/ML models.
  • Streaming data and analytics: Kafka, ClickHouse.
  • Deep experience with IaC and GitOps, such as Terraform, Ansible, ArgoCD; logging with Loki/ELK; gRPC.
  • Experience working in isolated and highly secure environments.
  • Experience preparing systems for significant growth in load.



Responsibilities

  • You will be responsible for the reliability of our services: SLIs/SLOs, availability, and identifying and eliminating bottlenecks across the system.
  • You will set up monitoring for services, metrics, alerts, and dashboards. This will rarely come as a clearly defined task; more often, you will decide what is important to measure and bring it to a clear, usable view.
  • You will build and maintain Grafana dashboards that people actually use, both our team and our customers.
  • You will run load testing, analyze the results, and provide recommendations on resources and scaling.
  • You will investigate incidents, participate in on-call rotations, write and lead postmortems, and ensure the same failure does not happen again.
  • You will work closely with developers: communicate and defend your position, challenge technical decisions, and find win-win solutions.
  • You will develop and support Kubernetes-based infrastructure across our clouds, including GCP and AWS, automate routine work, and help with CI/CD and general team tasks.
  • You will take part in delivering and supporting the platform for customers, including on-prem deployments.
  • You will mentor colleagues and help raise the engineering bar across the team.

What we offer

  • The team has built award-winning AI products for tech corporations — devices, voice assistants, products that are actually in the world 
  • Cutting-edge tech stack: Speech Technologies, NLP, Generative AI (LLMs, diffusion models), voice-first agentic architecture with privacy-first and on-premises deployment
  • High engineering bar and real ownership — the team cares about what actually works in production, not what looks good in a demo, and you'll see the impact of your work directly 
  • Fast career progression — a senior-heavy team and a high volume of real problems means you grow faster than you would anywhere else 
  • Startup pace with enterprise stability — real clients, real revenue, no bureaucracy 
  • Fully remote across Europe
  • 21 vacation days + public holidays + 5 sick days 
  • Private English lessons via Preply

Similar Jobs

See all Remote Software Development jobs →

Personalize your Remote Job Search in 3 Easy Steps!

Discover remote opportunities in Software Development

Answer easy questions

Answer easy questions

200,000+ jobs across 15+ categories

Get your best job matches

Get your best job matches

Only hand-screened, legit jobs

Find a remote job faster

Find a remote job faster

No ads, scams, or junk

I was the first applicant for a remote marketing position that got listed on the company website the same day I applied. Had an interview within 48 hours!

Sarah J. — Sarah J. · Marketing Manager ★★★★★ Verified