Computing power is an extremely valuable resource that can determine a technology’s ability to scale. We believe self-driving technology will someday soon be ubiquitous, and to deploy Aurora’s products broadly, we’ve invested heavily in the design of a flexible and efficient computing infrastructure that scales with our business.
In Part 2 of our series on Aurora’s Supercomputer, Batch API, we’ll focus on how we supply and orchestrate the massive computing resources we require. We will look at cluster management tools and discuss why Kubernetes is the best choice for Aurora, why we decided to build our own cluster autoscaler for batch workloads, and how we can seamlessly deploy resources across multiple clusters and multiple regions.
These features are what make Batch API the perfect fit for our needs at Aurora, allowing our teams to move quickly and take full advantage of the benefits of batch-processing in the AWS cloud ecosystem.
Disclosure: Amazon has invested in Aurora.
As we mentioned in Part 1, we use Kubernetes to manage the nitty-gritty details of the actual computing infrastructure that Batch API leverages to execute workloads. Why did we decide that Kubernetes was the best choice for Aurora and how does it help us scale our system to meet our users’ (the engineers, scientists, and developers building and supporting the Aurora Driver autonomy system) never-ending needs?
It really came down to three main selling points:
1. A vast open-source ecosystem
Kubernetes is open source, but more than that it has a very large and active ecosystem built around it. This makes it a great starting point for many different components that add value to our compute infrastructure, such as:
Many of the tools and components that are available through this active open source ecosystem can be used out of the box with minimal development effort.
Compute Infrastructure is a small team within Aurora, and trying to manage the entire Kubernetes stack along with the Batch API system means that, when we run into problems with lower-level components, we need fast solutions that don’t add to our overall support overhead. For example, we recently had a problem scaling our clusters due to timeout errors from CoreDNS, the centralized DNS component of Kubernetes, due to the limits of kube-proxy. In order to keep scaling, we needed a more distributed solution. Fortunately, we were able to simply deploy node-local-dns to cache DNS entries locally on each node. If we’d had to develop such a solution on our own, it would have caused a lot of delays and resulted in yet another custom component to maintain.
2. Flexible and extensible APIs
Kubernetes APIs were designed with flexibility and extensibility in mind, which has given us the ability to customize our batch processes in ways we wouldn’t otherwise be able to. For example, we chose to make each task a simple unix process that we execute on the user’s behalf. This is done in a worker, which is represented by a Kubernetes pod, and the Kubernetes PodSpec provides many options for running code as a unix process.
We can also leverage features such as the ability to add arbitrary labels and annotations to pods to store all sorts of data that we can expose to our users, such as grouping together pods that belong to the same job or allowing users to query for all workers with a user-defined label. We also add labels that our cloud teams can use to collect cost-accounting data, allowing users to see exactly how much it costs to run that 1 million-task job!
Such features give us tools to enforce standards and provide common information to all pods. Kubernetes’ mutating webhook interface means we can modify worker pods at runtime to add useful environment variables such as the job name, the cluster it belongs to, if it is for a production or staging environment, etc. And we can also reject any pods that don’t have the required labels to make sure all pods are accounted for and can be traced back to an owning team.
We have even been able to provide different resource service tiers to users by leveraging Kubernetes node labels and taints. We can provide users with cheaper resources at the cost of predictability and robustness by allowing them to request AWS Spot instances for their workers or run specific GPU workloads to ensure they get exactly what they need.
3. Managed control plane support in AWS
Aurora uses AWS for our computing environments, and AWS provides managed control plane support for Kubernetes with their Elastic Kubernetes Service (EKS).
In the past, we had run Kubernetes clusters in EC2 with self-hosted control planes. But when EKS became available, we made the decision to migrate our Kubernetes clusters and let someone else handle the day-to-day management of the control plane. Now we no longer have to worry about managing or scaling the API servers or etcd service.
EKS has a lot of great built-in integrations that give us a better starting point with minimal effort. For example, audit logs are sent to CloudWatch automatically so we can review past data and trace problems to their source, without having to manage the log storage ourselves. And, because we are already leveraging so much of AWS, we can take advantage of features like IAM roles for service accounts (IRSA) and drop less optimal and difficult to maintain systems.
Kubernetes has been crucial in allowing us to operate computing at this scale reliably and effectively. But it’s just the starting point for managing all of these clusters and resources. In order to scale our business, we need to be able to (literally) scale the clusters.
Like many users of Kubernetes, we run our clusters in the cloud. This allows us to manage our costs by only buying what we need, and leverage the “scale-to-zero” paradigm where we can scale the cluster down to essentially zero nodes when we aren’t using it. Of course, this means we need a cluster “autoscaler”—a component that looks at the current demand for resources (pending and running pods) and resource availability (provisioned nodes) and then determines if it needs to provision more nodes or terminate ones we aren’t using.
Batch clusters that have highly variable workloads and scale-to-zero capabilities like ours tend to see excessive “node churn” where nodes are provisioned and destroyed at a rapid pace as demand rises and falls. When this rate of change got too large, we found that the open source Kubernetes Cluster Autoscaler (which we’ll simply call the OSS Autoscaler for brevity) we were using was unable to keep up and would either exceed its memory limits (requiring manual intervention and restart) or stop responding altogether.
We also run many heterogeneous workloads in the same clusters and need a wide variety of node types to maintain efficiency. The demand profile for batch clusters is not fixed, nor is it stable or easy to predict. Some tasks may need a single CPU and a few GB of RAM, while others may need 12 CPUs, hundreds of GB of RAM, and a GPU. Shared public cloud capacities are unpredictable, so we configured the OSS Autoscaler to request any combination of dozens of different instance types. But because we run both spot and on-demand varieties, this quickly caused the number of different combinations of instance types the autoscaler needed to keep track of to become very large. As shown below in Figure 2, we sometimes have dozens of different instance types running at the same time.
For example, to make an EC2 request for a fairly common type, the r5.8xlarge, we’d have the autoscaler ask EC2 for some number of r5.8xlarge instances, and if it didn’t work, ask for something else. But EC2 has a lot of dimensions when requesting an instance type beyond simply what type and what region, such as availability zone, spot vs. on-demand, etc.
So for the r5.8xlarge instance in a single region, we’d use an equation like this:
(Instance variations¹) x (Availability Zones) x (Spot v.s OnDemand)
and end up with 40 different configurations (4 x 5 x 2 = 40). Multiply that by the different instance types available, and you quickly end up with hundreds of options!
We found that the OSS Autoscaler was insufficient for our needs given the complexity of our configuration. Adding additional types to support new workloads led to long queue times, unhappy users, and stressed on-call engineers trying to help everyone.
We needed to come up with an autoscaler solution that fit our specific needs and aligned with Aurora’s business. Thus, we developed the Batch Autoscaler with a few goals in mind:
1. Cost efficiency - When given options, select the most cost-effective strategy.
2. Fine-grained limits - Allow limits to be set globally, as well as per launch configuration.
3. Faster scale up - When demand rises quickly, react quickly.
Like the OSS Autoscaler, the Batch Autoscaler keeps track of EC2 instance prices (both spot and on-demand) using AWS APIs. The Batch Autoscaler then uses this information to create a prioritized list of instance types that match the current demand so that, when multiple launch configs match the given demand profile, we request the most cost-effective options first.
The Batch Autoscaler also allows for customized prioritization strategies depending on our goals. For example, during the run-up to Aurora Illuminated, we made modifications to the prioritization strategy to weight the autoscaler toward larger instance types and types that we had reserved via AWS on-demand capacity reservations (ODCRs). During this time, we were seeing large increases in workloads and needed to keep up with the demand without causing user delays or over-stressing our Kubernetes clusters with many small nodes. So we temporarily sacrificed some cost efficiency in the name of overall throughput and increased compute density.
Where the OSS Autoscaler had limits per launch config, the Batch Autoscaler has a powerful set of limits that can be leveraged to not only control the maximum number of nodes in the system, but also allow for much more flexibility in how clusters react. Limits can be set at a global level in addition to any specific launch-configuration level, and can be set either as simple node count limits or as limits on EC2 spend rate:
Max cluster size (nodes) - Keeps a cluster from growing too large.
Max cluster spend ($/hr) - Keeps overall cluster spend rate predictable and capped.
Max node addition/removal rates - Reduces the strain on Kubernetes by limiting the number of nodes added (or removed) at once.
While the OSS Autoscaler is meant to handle different cloud environments, we optimized the Batch Autoscaler for the primary cloud environment we run in: AWS. We also optimized the Batch Autoscaler algorithm for the kinds of spiky demands we tend to see in our batch-processing system. Because of this, the Batch Autoscaler can react to changes extremely quickly and scale nodes up much faster than the OSS Autoscaler could.
When running in AWS environments, the OSS Autoscaler relies on AWS Auto-Scaling Groups (ASGs) to create or destroy nodes in the cluster based on demand. ASGs provide a simple way of creating EC2 instances and they are well-suited for clusters with more stable resource demands (such as clusters with long-running services). But the unpredictable nature of batch-processing workloads means we need to react in different ways on an almost minute-by-minute basis. Since ASGs do not offer quick feedback if a certain instance type is unavailable, the OSS Autoscaler struggled to keep up with the cluster demand, often trying the same ASG over and over again.
We designed the Batch Autoscaler to instead interact with the EC2 APIs directly by sending requests for instances using EC2 Launch Templates to define the configuration of each EC2 type. The on-demand EC2 API will respond with how many instances it was able to provision, and if the number we get back is less than what we asked for, we can temporarily mark that config as “capacity-limited” and skip it next time, speeding up the request time immensely. This allows for much more fine-grained control over what instances we want and much faster feedback of capacity constraints, giving the Batch Autoscaler the information it needs to react to ever-changing cloud markets.
Spot instances, however, are trickier to manage because the spot market is highly volatile and the EC2 APIs do not easily tell you if there is availability for any given launch config. Therefore, the Batch Autoscaler uses Spot Fleet Requests to create spot instances and cycle through instance types as necessary.
Autoscaling is an extremely important tool for managing the cost of resources in cloud environments, especially when attempting to balance the needs of users against the cost of running large computing systems. It allows us to pay for only what we need and temporarily burst to larger sizes than would otherwise be feasible if attempting to run a large static cluster. But autoscaling really only focuses on scaling resources for a single cluster, and anytime the word “single” appears when talking about running systems at scale, you have a weak point in your design. In order to scale further while maintaining high levels of reliability, we need to be able to manage resources for multiple teams across multiple clusters.
Batch API runs heterogeneous workloads in a multi-tenant environment, which means that different teams must coexist within the same Kubernetes clusters and must share the same, often volatile, pool of resources. As we scaled up, we kept running into the following concerns:
1. How do we keep one user from consuming too many resources and starving out others 2. How can we maintain functionality in the event of a bad configuration change, a botched upgrade, or a cluster control plane failure? 3. How can we continue to scale when we hit the limits (e.g., too many pods, too many nodes) of a single Kubernetes cluster?
Flexible resource management
The first idea that might come to mind when we talk about managing resources for different teams in a Kubernetes cluster is to just use the built-in resource quota feature. Kubernetes resource quotas work great if you are running in a single cluster and have fairly homogenous workloads such that you don’t need any additional flexibility in defining quotas. In a cluster that only runs CPU-based jobs or hosts long-lived services, this is a pretty simple approach and will work for most use cases.
But we run heterogeneous workloads on multiple Kubernetes clusters, many of which require exotic resources such as GPUs. Teams need access to a variety of GPU types and need to have control over the resource types their jobs use (e.g., a P4d EC2 instance is vastly different from a G4dn EC2 instance). This kind of flexibility isn’t possible with simple Kubernetes quotas that treat all resources as homogenous.
So for our purposes, resource quotas exist at a team level. If a team has a quota of 100 CPUs, they can use all 100 CPUs in a single cluster or allow them to be split between clusters.
With EKS (or any managed control plane system) you gain reduced support burdens and better overall stability at the expense of full insight and control over the backend systems. For example, as the total number of Kubernetes objects grows, the EKS control planes in our batch clusters become increasingly unstable. Too many pods and too many nodes in a single cluster cause timeouts and failures, slowing down our users’ work.
To scale our computing systems, we must find these danger zones and devise ways to either work around them or make them less dangerous. But how do we do that in a seamless way that is invisible to our users?
In this case, a single cluster was a single point of failure, so we needed to add redundancy. To prevent our systems from hitting this critical limit, we added additional Kubernetes clusters to spread the load among multiple control plane deployments.
And because resources—even in cloud environments—are not infinite, we also needed something that could:
Act like a quartermaster and keep track of supply and demand for each team to prevent any one person or team from impacting others.
Act like a traffic controller to direct workloads to different Kubernetes clusters based on some sort of load balancing strategy.
Allow for fast mitigation of outages due to a cluster failure.
This led us to develop a new, supplemental, customizable service for managing the placement of workloads between teams and across clusters as our needs change. In keeping with our tradition of wildly original naming conventions, we call this placement management engine...Placement API.
Placement API sits behind Batch API and is not a user-facing service. In a nutshell, whenever Batch API wants to create a worker, it sends a request to Placement API describing how many resources it needs. Each team has one fixed pool of resources that can be deployed in whatever Kubernetes clusters are available. Teams can define what CPUs, RAM, and GPUs they want, and Batch API will communicate those resource requirements to Placement API. Placement API will then check against its internal model of the entire computing environment and either accept the request (if there are sufficient resources) or reject it. Using an internal load-balancing mechanism, Placement API will decide where that worker should be placed, and then Batch API will create a pod in the assigned cluster.
Take a look at Figure 5, below. The top graph shows Placement API “reservations” for CPUs and how much each of our two production clusters are currently using. The bottom graph shows CPU reservations by team across clusters. Placement API is balancing the load between the active clusters and also managing each team’s resources so that they are distributed evenly across all clusters.
We can disable any cluster on-the-fly to stop traffic and Placement API will automatically start balancing traffic among the remaining clusters. This allows us to not only mitigate outages more quickly, but also reduce inherent maintenance operations risks by directing traffic elsewhere when we deploy potentially disruptive changes. Test environments are smaller and have less overall load, so issues often go unnoticed. But with Placement API, we can deploy changes in production in an isolated way and quickly react if we find problems.
Managing large batch computing systems with inherently unpredictable workloads and use cases is an extremely hard problem, especially when coupled with the complexities of running in cloud environments. Aurora’s Compute Infrastructure team has developed extremely powerful tools for the Batch API platform in order to take full advantage of cloud-native systems like Kubernetes and AWS EC2 resources. These custom tools, such as Placement API and the Batch Autoscaler, ensure that our supercomputer meets our specific needs as a company with massive, frequently unpredictable computing demands.
In Part 3, we’ll discuss how we approach system analytics and debugging at scale, and also explore what the future holds as Batch API approaches a billion tasks per day.