Deploy and evolve a microservices-based platform across AWS, GCP, and on-prem environments using Kubernetes. Build and maintain CI/CD pipelines, observability stacks, and GPU-based ML inference services.
We are looking to strengthen our team for a DevOps/SRE Engineer!
Requirements
- Minimum 5 years of experience in a DevOps and/or Site Reliability Engineering role
- Strong hands-on experience with Linux system administration
- Extensive experience deploying, operating, and scaling Kubernetes in both cloud and bare-metal environments
- Deep expertise and practical experience with at least one major cloud provider (preferably Google Cloud Platform)
- Experience with ML inference on GPU/CPU is a strong plus
- Proven experience implementing SRE practices and building observability stacks using Grafana, Prometheus, and Loki
- Strong adherence to GitOps, Infrastructure as Code (IaC), and CI/CD principles
- Advanced expertise in Terraform, Ansible, and Python
- Comfortable working in high-uncertainty environments: we are building a new product, requirements evolve quickly, and the ability to rapidly learn new technologies and patterns is essential
- Proactive mindset: ability to look beyond DevOps tasks and actively debug and understand the product
- Strategic thinking: ability to choose technologies and architectural approaches based on long-term goals rather than short-term compromises
Responsibilities
- Deploy, operate, and evolve a microservices-based platform running in Kubernetes clusters across AWS, GCP, and on-prem (Rancher)
- Operate and support GPU-based ML inference services (Triton Inference Server, vLLM) deployed on RunPod, Scaleway, and Nebius
- Build and maintain Docker images for all microservices and ensure a stable service lifecycle
- Maintain and scale development and production Kubernetes clusters, actively participate in deployment debugging, incident investigation, and performance troubleshooting
- Develop, maintain, and evolve custom Helm charts for each service
- Design and operate CI/CD pipelines using GitHub (code and pipelines) and GitLab for on-prem customer deployments
- Ensure platform compliance with SOC 2 requirements and actively contribute to improving security and compliance processes
- Manage cluster access via NetBird VPN, implementing role-based access control using group policies
- Deploy and manage infrastructure using IaC practices with Terraform and Ansible
- Develop and continuously improve observability systems:
- Grafana & Prometheus for metrics
- ELK stack for centralized log storage and analysis
- Continuously optimize infrastructure in the areas of IaC, IAM, Observability, and CI/CD
- Work with a technology stack, including: Python, Kubernetes, Linux, Docker, GitHub CI/CD, PostgreSQL, ClickHouse, Kafka, Superset, Terraform, Ansible
What we offer
- The team has built award-winning AI products for tech corporations — devices, voice assistants, products that are actually in the world
- Cutting-edge tech stack: Speech Technologies, NLP, Generative AI (LLMs, diffusion models), voice-first agentic architecture with privacy-first and on-premises deployment
- High engineering bar and real ownership — the team cares about what actually works in production, not what looks good in a demo, and you'll see the impact of your work directly
- Fast career progression — a senior-heavy team and a high volume of real problems means you grow faster than you would anywhere else
- Startup pace with enterprise stability — real clients, real revenue, no bureaucracy
- Fully remote
- 21 vacation days + public holidays + 5 sick days
- Private English lessons via Preply