Hard Rock Digital

Senior Site Reliability Engineer

Posted 2 months ago

Poland

⭐ 5-10 years experience

Apply Now

Please mention DailyRemote when applying

AI Summary

Maintain the reliability and performance of high-traffic Java applications while pioneering AI-driven operations. Design and build autonomous AI agents to automate incident response, alert triage, and observability workflows.

Location: Poland only, fully remote

Job Type: B2B, full time

Overview

Hard Rock Digital is a team focused on becoming the best online sportsbook, casino, and social gaming company in the world. We care about each customer's interaction, experience, behaviour, and insight and strive to ensure we’re always acting authentically.

Rooted in the kindred spirits of the Seminole Tribe of Florida, the new Hard Rock Digital taps a brand known all over the world as the leader in gaming, entertainment, and hospitality. We’re taking that foundation of success and bringing it to the digital space.

What’s the position?

We are looking for a Senior Site Reliability Engineer who combines deep infrastructure expertise with a forward-thinking approach to AI-driven operations. In this role you will maintain and improve the reliability, scalability, and performance of our Java-based applications while pioneering the use of large language models (LLMs), agentic workflows, and intelligent automation to transform how we monitor, respond to, and prevent incidents.

You will design and build autonomous and semi-autonomous AI agents that consume observability data, triage alerts, generate runbooks, automate incident response steps, and surface actionable insights—reducing toil and accelerating mean time to resolution. This is a hands-on engineering role for someone who is equally comfortable tuning a JVM, writing PromQL, and prototyping an agentic pipeline with tool-calling LLMs.

Key Responsibilities

Application Reliability & Performance

Ensure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environment.
Troubleshoot and resolve complex issues across production and non-production environments.
Participate in pre- and post-deployment performance testing and monitoring to continuously improve application performance.
Optimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scaling.

Monitoring, Observability & AIOps

Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alerting.
Implement and refine observability strategies that enhance visibility into application and infrastructure health.
Create and maintain dashboards, alerts, and log queries for comprehensive system health monitoring.
Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction.

AI & Agentic Workflow Engineering

Design, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization.
Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approval.
Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents.
Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems.
Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improving.
Champion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization.

Incident Management & Root Cause Analysis

Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence.
Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidents.
Document and share lessons learned, contributing to a culture of continuous improvement.

Automation & Toil Reduction

Identify repetitive operational workflows and engineer AI-augmented or fully automated replacements.
Build self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural language.
Measure and report on toil reduction metrics to quantify the impact of automation initiatives.

Collaboration & Cross-functional Support

Work closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilities.
Collaborate with DevOps and NOC teams to support the application platform.
Communicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholders.
Provide feedback on application performance, potential improvements, and observability metrics.

Why This Role Is Different

This is not a traditional SRE position with AI bolted on as an afterthought. We are building a team that treats AI and agentic automation as core competencies—on par with Kubernetes expertise or observability design. You will have the autonomy to experiment with cutting-edge AI tools, the backing of leadership to deploy them in production, and a mandate to measurably reduce operational toil through intelligent systems.

What are we looking for?

Core SRE & Infrastructure (Required)

Degree in Computer Science or a related field, or equivalent professional experience.
5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems.
3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and security.
Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management.
Proficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshooting.
Advanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection.
Proficiency in PromQL and experience with Loki for log aggregation and analysis.
Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization.
Cloud platform expertise (AWS preferred; GCP or Azure also valued).
Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or Ansible.
ArgoCD proficiency for GitOps workflows and continuous deployment.
Strong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automation.
Proven track record with on-call rotations, incident response, and root cause analysis.

AI, Automation & Agentic Systems (Required)

1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context.
Demonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasks.
Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines.
Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent).
Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examples.
Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows.
Experience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agents.
Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems.

Preferred / Bonus

Hands-on experience with vector databases (Pinecone, Weaviate, pgvector) for RAG-based knowledge retrieval.
Experience with LLM evaluation frameworks (e.g., Galileo, LangSmith, Braintrust) for monitoring agent quality in production.
Contributions to open-source AI/ML or SRE tooling projects.
Background in data engineering or ML pipelines that complements SRE responsibilities.

Soft Skills

Strong communication skills (written and verbal) with the ability to translate complex AI and infrastructure concepts for diverse audiences.
Proactive problem-solver with a bias toward automation and continuous improvement.
Ability to mentor junior team members on both traditional SRE practices and emerging AI-driven approaches.
Positive attitude and openness to constructive feedback.

Automatically Apply to the Best Remote Jobs

Stop the endless job search. Our AI finds and applies to the best jobs for you.

Try it Now

Hard Rock Digital

Senior Site Reliability Engineer

AI Summary

Automatically Apply to the Best Remote Jobs

Ace Your Job Interview

How to Answer "How Do You Handle Criticism"?

How to Answer "Tell Me About Yourself?" in an Interview

How to Answer "What is your Experience with Customer Service?"

How to Answer "Describe Your Experience Working With Diverse Teams Or Different Cultures?"

How to Answer The Interview Question "What Sets You Apart From Other Candidates?"

How to Answer "Why Are You The Best Person For This Job"?

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Why Should We Hire You?"

How to Answer "What Areas Need Improvement?"

How to Answer "Tell Me About A Time When You Had To Balance Competing Priorities?"

How to Answer "Tell Me About a Time You Received Constructive Feedback"

How to Answer "What Is Your Greatest Accomplishment?"

Similar Jobs

Forward Deployment Engineer

Full Stack Engineer (TypeScript / React / Node.js)

Technical Support Engineer, Enterprise Solutions - REMOTE

Experienced JS Back-End Developer — Visual Challenge Solving Platform

🌏GPU Programming Software Engineer, Remote - Contract

Senior Software Engineer-Developer Experience Team

Hard Rock Digital

Senior Site Reliability Engineer

AI Summary

Automatically Apply to the Best Remote Jobs

Share This Job:

Similar Jobs

Forward Deployment Engineer

Full Stack Engineer (TypeScript / React / Node.js)

Technical Support Engineer, Enterprise Solutions - REMOTE

Experienced JS Back-End Developer — Visual Challenge Solving Platform

🌏GPU Programming Software Engineer, Remote - Contract

Senior Software Engineer-Developer Experience Team

Personalize your Remote Job Search in 3 Easy Steps!