MADIFF Polska

Site Reliability Engineer (AI)

Posted a month ago

Worldwide

⭐ 5-10 years experience

Apply Now

Please mention DailyRemote when applying

AI Summary

Build and maintain a central monitoring and alerting layer for AI applications and pipelines to ensure system stability. Manage incidents through triage and root cause analysis while optimizing CI/CD processes and telemetry standards.

This is a remote position.

We are looking for a Senior Site Reliability Engineer to support advanced AI platforms responsible for production-grade applications and pipelines. The role focuses on building and maintaining reliability, scalability, and operational excellence across multiple AI-driven systems.

The engineer will work on a central operational layer for monitoring and managing AI workloads, improving system stability, and reducing incidents. This is a hands-on role requiring direct involvement in diagnosing production issues, implementing fixes, and optimising monitoring, alerting, and CI/CD processes.

The position requires close collaboration with engineering teams to improve release quality, standardise telemetry, and ensure stable and predictable system behaviour in a distributed cloud environment.

Responsibilities

• Build and maintain central monitoring and alerting layer for AI applications and pipelines

• Define and implement SLIs, alerts, and operational dashboards

• Manage incidents including triage, coordination, root cause analysis, and prevention

• Standardise telemetry across systems including latency, throughput, and failures

• Optimise CI CD pipelines and introduce quality gates for reliability

• Work closely with engineering teams to reduce recurring issues and improve stability

Requirements

• Minimum 5+ years of experience in SRE, Platform, or Production Engineering

• Strong hands on experience with Kubernetes and production environments

• Experience with Azure and Azure DevOps

• Experience with monitoring tools such as Datadog

• Strong understanding of incident management and root cause analysis

• Ability to build practical monitoring and alerting systems

Nice to have

• Experience with AI or LLM pipelines

• Experience building monitoring platforms across multiple systems

• Experience with Grafana

• Experience working in large scale or distributed environments

Expectations

• Strong ownership mindset and accountability for system stability

• Proactive approach to identifying risks and improvements

• Hands on engineer actively working with systems, not only coordinating

• Comfortable working in dynamic and evolving environments

Benefits

• Solid, competitive salary

• Work in a multinational environment on international projects

• Comprehensive healthcare

• Long-term B2B contract with a stable project pipeline

• Work model: fully remote

Automatically Apply to the Best Remote Jobs

Stop the endless job search. Our AI finds and applies to the best jobs for you.

Try it Now

MADIFF Polska

🧑‍💻 Employees 51-200 employees 🏢 Industry IT Services and IT Consulting

View More Jobs From MADIFF Polska

MADIFF Polska

Site Reliability Engineer (AI)

AI Summary

Requirements

Benefits

Automatically Apply to the Best Remote Jobs

Ace Your Job Interview

How to Answer "How Do You Handle Criticism"?

How to Answer "Tell Me About Yourself?" in an Interview

How to Answer "What is your Experience with Customer Service?"

How to Answer "Describe Your Experience Working With Diverse Teams Or Different Cultures?"

How to Answer The Interview Question "What Sets You Apart From Other Candidates?"

How to Answer "Why Are You The Best Person For This Job"?

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Why Should We Hire You?"

How to Answer "What Areas Need Improvement?"

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Tell Me About a Time You Received Constructive Feedback"

How to Answer "What Is Your Greatest Accomplishment?"

Similar Jobs

Senior Technical Product Manager - AI Innovation, Remote

[Job - 29879] Senior Mobile Developer, Colombia

Staff AI Engineer | Colombia | English C1

Manager, Business/Data Analyst

Product Support Engineer - EMEA

QA Engineer - AI Native

MADIFF Polska

Site Reliability Engineer (AI)

AI Summary

Requirements

Benefits

Automatically Apply to the Best Remote Jobs

Share This Job:

Similar Jobs

Senior Technical Product Manager - AI Innovation, Remote

[Job - 29879] Senior Mobile Developer, Colombia

Staff AI Engineer | Colombia | English C1

Manager, Business/Data Analyst

Product Support Engineer - EMEA

QA Engineer - AI Native

Personalize your Remote Job Search in 3 Easy Steps!