Topology Aware Scheduling in Kubernetes Part 1: The High Level Business Case

18 février 2021The OpenShift Team3 minutes (temps de lecture)

This post was written by: Swati Sehgal, Alexey Perevalov, Killian Muldoon & Francesco Romani

How do you get the most out of your bare-metal hardware? Believe it or not, the physical layout in a computer of the resources a workload uses, from memory and CPU to storage and I/O, can have a dramatic impact on performance. Until recently Kubernetes users had no direct way to influence this key interaction between hardware and software, commonly called Resource Topology.

This blog post series describes Topology Aware Scheduling, a feature being rolled out in Kubernetes in 2021. Topology Aware Scheduling enables the Kubernetes control plane to keep to Resource Topology constraints when placing Pods on Nodes.This approach complements Topology Manager, which was initially introduced in Kubernetes 1.17, the node-level Resource Topology enforcer in kubelet, but more on that later.

Why does resource topology matter?

Non-Uniform Memory Access (NUMA) is a compute platform architecture that allows different CPUs to access different regions of memories at different speeds. The relative locations of CPUs, memory, and PCI devices are what we’re talking about when we say Resource Topology.

This architecture has major advantages. Any CPU core can potentially access all memory on a system, but there are some potential pitfalls with performance. For example, in the diagram below, memory closer to CPU core 1 will be quicker to access by CPU core 1 than memory close to CPU core 7.

FIGURE 1: A Non-uniform Memory Access (NUMA) system

It’s straightforward so far, and the underlying operating system will manage most of this, even in a Kubernetes cluster. When you’re trying to squeeze low-latency performance from bare metal, though, you need to dedicate isolated resources to specific applications. As we add new kinds of resources, things get increasingly complicated.

For I/O-constrained workloads, the network interface on a distant NUMA zone slows down how quickly information can reach the application. High-performance workloads, like those running the 5G network, can’t operate to spec under these conditions.

Taking an example of a pod requesting 2 CPUs and a PCI device, FIGURE 2 shows a scenario where resources are not NUMA aligned whereas FIGURE 3 shows a scenario where resources are NUMA aligned:

FIGURE 2: A NUMA System with no Resource Alignment

FIGURE 3: A NUMA System with Resource Alignment

Without handling Resource Topology, Kubernetes as it exists in 1.20 can’t meet the needs of these sorts of applications. End users can (and have!) found ways around this by adding constraints to their clusters. One option is to replace bare-metal deployments with VMs, while another is to limit the pod configs available to developers.

Does Kubernetes default-scheduler consider Resource Topology when assigning pods to nodes?

Kubernetes Topology Manager allows workloads to run in an environment optimized for low latency. Performance-critical workloads require topology information to use co-located CPU cores and devices for industries like telecommunications, High Powered Computing (HPC), and Internet of Things (IoT), but the current native scheduler does not select a node based on its topology. This happens due to the scheduler’s lack of knowledge of Resource Topology, which can lead to unpredictable application performance. In general, this means under performance, and in the worst case, complete mismatch of resource requests and kubelet policies such as scheduling a pod destined to fail, potentially entering a failure loop.

Exposing cluster level topology to the scheduler empowers it to make intelligent NUMA aware placement decisions optimizing cluster wide performance of workloads.

What is the business case for enabling Topology aware scheduling in Kubernetes?

A company could make a business by providing a public cloud or by selling a cloud solution to third parties (for example, telecom operators for NFV use cases and to others). In case of public cloud, the cloud provider in its end user agreement or in public offer can provide only tariffs with a fixed number of resources. In this case, the problem of resource alignment is solved by IAAS level and by the number of resources (NIC, GPU) we can find in tariffs, and these numbers are aligned to numbers per NUMA.

Another case is when a company sells cloud solutions and clients demand more flexibility. Flexibility to them is the ability to work on bare metal and the ability to request any number and kind of resources. So the solution that makes kube scheduler topology aware is interesting for those companies who sell cloud solutions to third parties.

In the next part of the blog post, we talk about Topology Manager and explain the design of Topology aware Scheduling in more detail.