Please mention DailyRemote when applying
We are looking for a Staff Infrastructure Engineer to lead the technical direction and execution of balenaCloud’s infrastructure and reliability architecture. As our customer base and device fleets expand globally, we need a dedicated technical lead to drive our transition into multi-region hosting and single-tenant dedicated instances, natively within Amazon Web Services (AWS).
At balena, we don't have traditional managers or hierarchy; we rely on high levels of trust, autonomy, and alignment. You will be joining at the Staff Level (Tactical scope / Domain Leader). Given the company strategy (the Why), you define the Tactics and the What, design the How, and heavily participate in the Do.
This role represents a dual leadership mandate: you will operate across both Infrastructure Engineering (planning for immense scale, multi-region hosting, and deep AWS automation) and Reliability Engineering (designing the observability tooling, defining operational procedures, and scaling the team's ability to debug and improve the system). Our infrastructure is deeply rooted in AWS, and we need an engineer who can drop in and be highly effective within this ecosystem immediately.
As a Staff Level engineer, you are one of the most experienced team members in your domain. You are not a "ticket solver"; you gain significant autonomy but own the responsibility for your architectural decisions.
AWS-Native Architecture: Architect, automate, and optimize deeply integrated AWS environments. You will leverage the right AWS services to build a system that hosts balenaCloud reliably, delivering maximum performance and deep cost/resource optimization on a per-device basis.
Infrastructure & Reliability: Bridge the gap between building for scale and running for stability. You will not only design the infrastructure but also drive the reliability practices for our growing systems, driving continuous improvement, robust feedback loops, and incident resilience.
Architect for Massive B2B Scale: Design infrastructure capable of handling enterprise-level loads: billions of requests per week (>30 Million/hour) and terabytes of data per day. Your mental model should align with massive B2B platforms rather than B2C media streaming.
Multi-Region & Single-Tenant Hosting: Own the technical tactics and execution to deploy single-tenant, single-region balenaCloud instances (e.g., dedicated instances in the EU, Australia, US, or Japan) to satisfy strict customer data sovereignty needs.
Kubernetes at Scale: Architect and manage multiple balenaCloud stacks simultaneously, overseeing the deployment and orchestration of many independent Kubernetes clusters for various customers.
Decade-Long Reliability: We are responsible for physical devices in the real world that will stay deployed for decades. Short-term, fragile infrastructure solutions are unacceptable, as they risk rendering devices lost in the field. Your designs and implementations must meet our >10-year durability bar.
Team Enablement & Async Collaboration: You will scale your knowledge across an overwhelmed engineering team. You will document, articulate, and demonstrate decision proposals based on objective facts and empirical evidence, minimizing the need for synchronous calls.
Experience: Minimum of 6 years of highly relevant professional work experience in infrastructure and reliability engineering.
Deep AWS Expertise: Proven, hands-on mastery of the AWS ecosystem. You must be able to navigate, architect, and optimize AWS services with immediate effectiveness.
Observability & Reliability: Deep understanding of Site Reliability Engineering principles. You have proven experience building highly usable observability tooling, metrics, and monitoring systems from the ground up to support high availability.
Exceptional Documentation Skills: Strong, hands-on ability to write clear, actionable, and maintainable technical documentation, scaling plans, and onboarding materials for the team.
Distributed Systems: Proven experience in multiple geolocation hosting with distributed data and processing, specifically in multi-tenant SaaS environments.
Core Stack & Automation: Deep expertise with Kubernetes deployments at scale, managing massive Postgres/RDS databases, and proven mastery of Infrastructure as Code and infrastructure automation.
Scale Testing: Extensive experience in load and scale testing, specifically handling magnitudes of 10k–100k simultaneous connections.
Remote & Async Communication: Fluent English. Intrinsic motivation to prioritize open, text-based communication in a public knowledge base. You actively work to reduce synchronous call time to respect scarce overlapping hours across global timezones.
Abstract Thinking: Ability to identify, research, and advocate for solutions to complicated problems with minimal technical guidance, working from a defined company strategy.
Compliance: Experience deploying solutions into special compliance environments (e.g., federal services, FedRAMP, GovCloud).
AWS Certifications: High-level AWS certifications (e.g., AWS Certified Solutions Architect - Professional) are a strong bonus.
To succeed in this role, you should fit the following profile based on our internal leveling guide (Tactical level / Domain Leader):
Given: The Company Strategy and Environment (e.g., "We need to scale our AWS infrastructure to support dedicated regional hosting to satisfy global data sovereignty laws, while improving overall fleet reliability").
You: Define the What and the How (researching AWS networking options, advocating for a specific EKS cluster architecture, writing the scaling plans, observability specs, and IaC), and heavily participate in the Do (hands-on coding and infrastructure provisioning).
Enable: You elevate the entire company. You remove systemic friction, prevent architectural dead-ends by identifying doomed approaches early, and mentor Domain Contributors. You back up your decision-making with solid reasoning. You execute within architectural decisions that hold up over a 10+ year horizon, and raise flags early when tactics or designs threaten that durability.
Competitive salary
Autonomous vacation allowance
12 weeks of paid parental leave for new parents
Equipment of your choice and hardware for side projects
Books of your choice to help you in your work
Annual company gathering in an international location, Balena Summit 2024
Working with a talented and globally distributed team
Flexible schedules by default
Balena is a highly distributed team that has embraced a remote-first approach since 2013. We are a group of individuals from across the globe working together to achieve our mission: “Enable people to leverage technology to address the real world challenges of our time.”.
Balena wants to do good in the world and here is our why. Our focus is on enabling team members to be the best they can be rather than controlling what everyone does from the top down, and this creates challenges that require just as much creative thinking as our product.
We have been remote-first since 2013 and have team members in different corners of the world who work and communicate asynchronously.
We like to think from first principles and are usually resistant to using ready-made solutions unless we deeply understand the rationale.
We organize ourselves based on the best use of our collective abilities to solve our highest priority problems at any given time, rather than by a strict hierarchy. Read more about our Intentional Work Framework.
We practice radical candor and transparency with open, honest, and clear communication.
We’re not afraid to fail as long as we learn from our mistakes.
We’re always looking for common patterns that allow us to reduce complexity.
We embrace short-term pain for long-term gain, building products that will stand the test of time.
Does any of this sound interesting to you? Work with us, and we will offer you the opportunity to add value by finding and solving problems while constantly learning your craft. We will enable and support your growth, while you should also be open and flexible to figure things out and challenge yourself.
Stop the endless job search. Our AI finds and applies to the best jobs for you.
Discover remote opportunities in Software Development
Answer easy questions
200,000+ jobs across 15+ categories
Get your best job matches
Only hand-screened, legit jobs
Find a remote job faster
No ads, scams, or junk
“ I was the first applicant for a remote marketing position that got listed on the company website the same day I applied. Had an interview within 48 hours!