Design and evolve Azure landing zones and production-grade AKS clusters aligned with Microsoft frameworks. Implement secure networking, identity management, and GitOps-based CI/CD pipelines to ensure high availability and scalability.
The XTIUM global team is made up of a group of diverse and talented professionals who are all driven by the same goal: excellence and continuous improvement. We are all about embracing challenges, keeping the lines of communication open and working together. We take ownership of our work, focus on learning and growing and hold ourselves accountable to our colleagues and customers. Together, we strive to push boundaries, make an impact and inspire each other to reach our full potential.
Job Description:
Key Responsibilities
Azure Landing Zone & Governance
- Design and evolve Azure landing zones aligned with the Microsoft Cloud Adoption Framework and Azure Well-Architected Framework.
- Own subscription, management group, resource group, naming, tagging, RBAC, and governance standards.
- Establish Azure Policy, Defender for Cloud, Sentinel, Log Analytics, and diagnostic standards.
- Review current cloud architecture and produce prioritized remediation plans
- Design highly available, secure, multi-region, and zone-redundant architectures for AI platform.
- Architect and deploy scalable Azure PaaS—including managed kubernetes and databases to meet enterprise performance, data integration and security requirements
- Design and manage enterprise Azure API Management (APIM) instances to securely expose, govern, and scale microservices and PaaS endpoints.
Azure Kubernetes Service (AKS)
- Design and operate production-grade AKS clusters, including node pool strategies, autoscaling, and node image upgrade pipelines.
- Configure AKS networking: Azure CNI or Overlay/Cilium, private cluster endpoints, egress controls via Azure Firewall, and ingress via AGIC or NGINX with WAF.
- Implement AKS security hardening: Entra workload identity federation, Azure RBAC for Kubernetes, admission controllers (Azure Policy / Gatekeeper), and Key Vault secrets injection via CSI driver.
- Establish GitOps-based cluster configuration management using Flux CD or Argo CD; define cluster upgrade and patching cadence.
- Set up AKS observability: Container Insights, Prometheus/Grafana, and distributed tracing via Application Insights or OpenTelemetry.
- Define AKS backup and disaster recovery using Velero, and integrate AKS cost governance into overall cloud cost management practices.
Networking & Connectivity
- Design secure Azure networking: hub-spoke topology, private endpoints, private DNS zones, ingress, and egress controls.
- Define peering, ExpressRoute, and VPN patterns for hybrid and multi-region connectivity.
- Govern DNS resolution for AKS private clusters, including CoreDNS customization and Azure Private DNS integration.
- Implement robust Azure load balancing solutions using Traffic Manager, Application Gateway, and Azure Load Balancer to optimize traffic distribution, performance, and high availability.
Identity, Security & Compliance
- Partner with security and engineering teams to strengthen identity, access, secrets management, monitoring, and compliance controls.
- Design Entra ID integration, RBAC, managed identities, service principals, workload identity federation, and privileged access patterns.
- Co-implement high-priority security, reliability, and governance improvements.
Infrastructure as Code & CI/CD
- Develop and maintain Terraform / OpenTofu IaC modules for AKS clusters, node pools, networking, RBAC, and add-on configuration.
- Establish GitOps workflows for Kubernetes manifests and Helm chart promotion across environments.
- Integrate cluster provisioning and application deployment into GitHub Actions or Azure DevOps pipelines.
Reliability, DR & Operations
- Define backup, disaster recovery, RTO/RPO, and restore validation practices for AKS and supporting infrastructure.
- Establish and validate enterprise-grade backup, disaster recovery, and data retention policies to guarantee business continuity and strict RPO/RTO compliance.
- Implement chaos engineering and failure injection practices to validate cluster resilience.
- Build documentation, architecture diagrams, runbooks, and operational standards.
- Mentor DevOps, platform, and engineering teams on Azure and AKS best practices.
Initial Focus (Phase 1)
In the first engagement phase, this person will:
- Review the current Azure environment, AKS footprint, and deployment approach.
- Identify security, reliability, governance, and operational gaps — including AKS cluster configuration, node security, and workload identity posture.
- Design a target-state Azure landing zone with AKS as a first-class platform component.
- Establish IaC standards for AKS cluster lifecycle, node pool management, and add-on configuration.
- Implement or guide deployment of foundational AKS and Azure networking infrastructure.
- Create documentation and knowledge-transfer materials for internal teams.
Required Qualifications
- 8+ years of experience in cloud infrastructure, platform engineering, DevOps, or security architecture.
- 5+ years of hands-on Microsoft Azure experience.
- 3+ years of production AKS experience — cluster networking, security, GitOps, and workload operations.
- Strong experience designing Azure landing zones or enterprise Azure environments.
- Hands-on experience with Terraform, OpenTofu, Bicep, or similar IaC tooling.
- Strong knowledge of Azure networking, private connectivity, DNS, routing, and ingress patterns.
- Experience with Entra workload identity, managed identities, Azure RBAC for Kubernetes, and admission control.
- Experience with Azure Policy, Defender for Cloud (including Defender for Containers), Sentinel, and Log Analytics.
- Experience designing backup, restore, resiliency, and disaster recovery practices.
- Ability to create clear technical documentation, diagrams, standards, and runbooks.
- Strong communication skills across engineering, security, and leadership teams.
Preferred Qualifications
- Azure Solutions Architect Expert certification.
- Azure Security Engineer or Cybersecurity Architect certification.
- Certified Kubernetes Administrator (CKA) or equivalent hands-on AKS experience.
- Terraform Associate or equivalent IaC experience.
- Experience with GitHub Enterprise, GitHub Actions, or Azure DevOps for CI/CD pipelines.
- Experience in regulated, SaaS, multi-tenant, or enterprise production environments.
- Familiarity with CIS Azure Benchmarks, CIS Kubernetes Benchmark, Microsoft Cloud Security Benchmark, and Zero Trust principles.
- Experience mentoring platform or DevOps engineers on Azure and Kubernetes best practices.
Remote