OpenShift Sandboxed Containers 101

13 de agosto de 2021Snir Sheriber, The OpenShift Team9 minutos (tempo de leitura)

Introduction

This blog assumes that the reader is familiar with the OpenShift sandboxed containers documentation and has installed Kata Containers via the OpenShift sandboxed containers operator on their OpenShift Containers Platform (OCP) deployment. From there, we will take the reader on a short journey to OpenShift sandboxed containers to run sandboxed workloads using Kata containers in practice. We will show how to create and inspect the sandboxed containers and their workloads.

We will then present a comparison of standard workloads versus sandboxed workloads. From the user perspective running these workloads are largely the same. On the other hand, from the system perspective, there are fundamental differences between how standard and sandboxed workloads are running. This will be discussed as well.

This is a hands-on blog, and you should be familiar with OCP and Linux commands to gain the most out of it.

Prerequisites

Familiarity with OpenShift sandboxed Containers documentation
Deployed OpenShift 4.8+ Cluster with installed Kata Containers using the OpenShift Sandboxed Containers operator
Installed the OpenShift’s “oc” command line client, authenticated on the cluster
Familiarity with accessing and authenticating with the web user interface of the OCP cluster

Creating Kata Containers Workloads

Let's start with the fundamentals and show you how to create workloads from the OpenShift web interface and the CLI. A few points to mention:

We will be using single container pods, so when we refer to a pod, we also refer to the single workload container running within it.
The “default” namespace is used for these guidelines. Thus, in case you are deploying a pod on a different namespace, make sure you adjust the oc commands or YAML files accordingly.
The application deployed in this example is a simple http server. Accessing it is not required during this demonstration. However, in case you want to allow external traffic to access this application, you should follow the OCP instructions for ingress traffic flow

Creating a Kata Container Workload From the OCP UI

Login into the web interface using your credentials (user and password).
Go into Workloads > Pods and then click “Create Pod.”
You are now presented with an example YAML pod definition.
To make this workload run as a sandboxed (Kata) container within that pod, you should use the “kata” runtime class. A runtime class tells OCP which runtime to use. Add runtimeClassName: kata under “spec:”

Alternatively, you can paste the following YAML file to create a sandboxed container named “example-fedora” pulling the fedora image from the registry.fedoraproject.org container registry:

apiVersion: v1
kind: Pod
metadata:
  name: example-fedora
spec:
  containers:
    - name: example-fedora
      image: registry.fedoraproject.org/fedora
      ports:
        - containerPort: 8080
      command: ["python3"]
      args: [ "-m", "http.server", "8080"]
  runtimeClassName: kata

Note that, as usual for YAML files, the indentations here matter, so make sure the runtimeClassName is at the same indentation level as the “containers:” Pay attention: There is no editor warning.
When the YAML file is ready, hit the “Create” button and a new window will open for this pod.
You should look at the status as it changes from “CreatingContainer” to “Running” (possibly with a few additional states on the way).
If you hit a problem, you will see an error state in the status (red marking) such as an “ErrImagePull.” When there is an error, you should investigate the cause of the error, address it, and then delete and re-create the pod.
At any point, you can delete the pod by simply hitting the delete button.

Creating a Kata Container Workload From the CLI

Open a CLI and login into your cluster with your user and password.
For example:

$ oc login -u kubeadmin -p ktqrs-mLIDh-oQAoL-CmCjd

Create a file with a pod definition.

For example, create an example-fedora.yaml file with the following:

apiVersion: v1
kind: Pod
metadata:
  name: example-fedora
spec:
  containers:
    - name: example-fedora
      image: registry.fedoraproject.org/fedora
      ports:
        - containerPort: 8080
      command: ["python3"]
      args: [ "-m", "http.server", "8080"]
  runtimeClassName: kata

The YAML file is identical to the one we have provided above for using the web user interface.
Note again that the essential attribute identifying a Kata container workload is the additional “kata” runtime class (in yellow).
Now, using this YAML you can create a pod with the following command:
```
$ oc create -f example-fedora.yaml
```

Check the status of the pod and make sure it is in a “Running” state:

$ oc get pod example-fedora

You should get a status similar to the following:

NAME             READY   STATUS    RESTARTS   AGE
example-fedora   1/1     Running   0          4m13s

Check the full pod details with the following command:
```
$ oc describe pod example-fedora
```
What we expect to see is a similar event sequence and additional pod information similar to the following:

...
Events:
  Type    Reason          Age   From               Message
  ----    ------          ----  ----               -------
  Normal  Scheduled       83s   default-scheduler  Successfully assigned default/example-fedora to kluster0-zqnvb-worker-0-qcpfg
  Normal  AddedInterface  79s   multus             Add eth0 [10.128.2.78/23]
  Normal  Pulling         72s   kubelet            Pulling image "registry.fedoraproject.org/fedora"
  Normal  Pulled          41s   kubelet            Successfully pulled image "registry.fedoraproject.org/fedora" in 20.575421044s
  Normal  Created         39s   kubelet            Created container example-fedora
  Normal  Started         39s   kubelet            Started container example-fedora

Check that the runtimeClass is “kata” with the following command (you should see kata):
```
$ oc get pod example-fedora -o yaml | grep runtimeClass
```
To delete the pod, you should simply run:
```
$ oc delete -f example-fedora.yaml
```

Pod Level Metrics and Statistics for Kata Containers Workloads

Pod level metrics allow you to see the resource consumption of pods at runtime. A complete view of the metrics can be done through the OCP UI.

It should be noted that the pod-level metrics account for all resources consumed by the pod infrastructure. In the case of Kata Containers workloads, it also accounts for various ancillary processes that are required to sandbox your workload, notably the whole virtual machine that was created for the pod.

Metrics and statistics in the web interface

Using the OCP UI resources usage metrics are presented live in updating diagrams. To view your pod metrics, go to Workloads > Pods (under the pod’s namespace) and select the pod you would like to inspect:

As you can see, the memory, CPU, filesystem, and networking statistics are presented

Under the Hood of Kata Containers Workloads: QEMU Virtual Machines

Workloads in OCP run within a pod. In the case of Kata containers, pods are bound to a QEMU virtual machine (VM), which provides an additional layer of isolation. In this section, we will guide you on how to observe the VMs and QEMU processes associated with this pod.

For doing this, we will create a standard containers workload in addition to the Kata containers workload we created to help identify the differences between the two runtimes. Note that each pod will contain a single container workload.

Observing That a Container Is Sandboxed and Running Within a Virtual Machine

The workload runs within a VM with its own kernel. Let’s see how we can confirm that the kernel used in the Kata containers is different from the one used on the host:

Using the oc exec you can invoke a command inside a running container. We will use this to compare different values between sandboxed containers and standard containers.
Note: By default, if a pod contains a single container, the exec command will be invoked on that container. Otherwise, the preferred container within the pod should be mentioned explicitly.

In addition to the running sandboxed workload, we will now create a standard container workload for comparison. An example of creating such a workload is as follows:

apiVersion: v1
kind: Pod
metadata:
  name: example-fedora-vanilla
spec:
  containers:
    - name: example-fedora
      image: registry.fedoraproject.org/fedora
      ports:
        - containerPort: 8080
      command: ["python3"]
      args: [ "-m", "http.server", "8080"]

Let’s present three approaches of how to see the differences between the kernel used in the VM and the host/standard OCP used kernel:
Option 1 - Check for uptime. This indicates how long the kernel has been running.
- For the sandboxed workloads run:
```
$ oc exec example-fedora -- cat /proc/uptime
38.97 37.52
```
- And, similarly, for the standard container run:
```
$ oc exec example-fedora-vanilla -- cat /proc/uptime
3457796.46 39635167.42
```
  You should spot a significant difference. The uptime of the standard container kernel is essentially the uptime of the node that is hosting it, while the uptime of the sandboxed workload is of its sandbox (VM) that was created at the pod creation time.

Option 2 - Comparing kernel’s command lines from /proc/cmdline:

For the sandboxed workload run:

$ oc exec example-fedora -- cat /proc/cmdline
tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci=lastbus=0 iommu=off quiet panic=1 nr_cpus=12 agent.use_vsock=true scsi_mod.scan=none systemd.unified_cgroup_hierarchy=0

And, similarly, for the standard container:

$ oc exec example-fedora-vanilla -- cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.11.16-200.fc33.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap intel_iommu=on iommu=pt

As these are two different kernel instances running in different environments (a node machine and a lightweight VM), their cmdline could certainly provide some hints. You should be able to spot indications such as Kata agent parameters on the VM kernel and OSTree-related parameters on the standard container kernel (which is essentially the node’s kernel).

Option 3 - Number of cores:
As Kata is configured by default to run with one vCPU per VM, it is likely you will see a difference in the number of CPUs each container is seeing. Try that by running:

For the sandboxed workload run:

- ```
$ oc exec example-fedora -- nproc
1
```

And similarly for the standard container:

$ oc exec example-fedora-vanilla -- nproc
32

In spite of these mentioned differences, Kata containers used by OpenShift Sandboxed Containers will run with the very same kernel version on the VM as the underlying OS (RHCOS) is running with. The VM image is generated at host startup, making sure it is compatible with the kernel currently used by the host. You can validate that by running:

For the sandboxed workload run:

$ oc exec example-fedora -- uname -r
4.18.0-305.el8.x86_64

And, similarly, for the standard container:

$ oc exec example-fedora-vanilla -- uname -r
4.18.0-305.el8.x86_64

Viewing the QEMU Process Associated with the Sandboxed Workload

OpenShift sandboxed containers are sandboxed by a VM running QEMU. If we look at the node where the workload runs, we can verify the QEMU process is also running there.

Follow these steps to observe the associated QEMU process:

Find the node which your workload pod has been assigned to:

$ oc get pod/example-fedora -o=jsonpath='{.spec.nodeName}'

Get its CRI `containerID:`

$ oc get pod/example-fedora -o=jsonpath='{.status.containerStatuses[*].containerID}'

Save for later (ignore the “crio://” prefix).

Get into the node identified above using a debug pod:
```
$ oc debug node/<node_name>
```
Then, as mentioned in the prompt you see, run `chroot /host` to use host binaries.
Now you are logged in a shell within the node that hosts the sandboxed workload, you can probe around
Look for the running qemu processes:
```
sh-4.4# ps aux | grep qemu
```
It is expected you will see a qemu process running for each pod running sandboxed workloads on that host.
If you want to assure a QEMU process is indeed running the container we inspected, you can get the CRI `sandboxID,` which associated with your `containerID` (which we fetched in previous step) by running:
```
sh-4.4# crictl inspect <container_id> | jq -r '.info.sandboxID'
```

This ID should match the name of the guest defined in the QEMU command you grepped in the previous step. it looks as follows:

/usr/libexec/qemu-kiwi -name sandbox-<sandboxID> ...

Summary

In this blog, we have given you a 101 course on how to play with sandboxed workloads (backed by Kata Containers) and observe their internals. A key point to convey here is that, from the point of view of an OCP user, sandboxed workloads and standard workloads look and feel the same. The differences all happen in the implementation at the Linux and virtual machine level, which a typical OCP user will not notice.

From the system view, however, we saw differences. In the case of a sandboxed container, the pod is bound to a virtual machine with its own kernel. This means, for example, that resources and devices given to the container can be finely controlled. Additionally, a container is exposed only to the virtual machine resources and not to the host's resources in contrast to standard containers.

This means that containers that require host access, which are often referred to as “privileged containers,” will not function as expected with OpenShift sandboxed containers, and this is by design. Such containers can access host resources and for example install software on the host or control devices and even reboot the node. When you run a workload under OpenShift sandboxed containers, you specifically prevent the container from doing any of these things.

As a consequence of this design, sandboxed workloads require virtualization. Therefore, sandboxed workloads can currently only be deployed on bare metal clusters, while standard workloads will run on any type of cloud offering.

We hope this blog helps to shed some light on running Kata container workloads. Stay tuned for the next blogs in the series.