Building containers by hand: The PID namespace

30 aprile 2021Steve Ovens9 minuti (tempo di lettura)

Continuing with the namespaces series, this article covers the PID namespace. If you want a general overview of all the namespaces, check out the first article. Previously, you created a new mnt namespace. Interestingly, as you discovered, even after creating a new mnt namespace, you still had access to the original host's process IDs (PIDs). When you tried to mount the /proc namespace, you received the rather perplexing permission denied error, as seen below:

root@new-mnt$ mount -t proc proc /proc
mount: permission denied (are you root?)

root@new-mnt$ whoami
root

While you could create all sorts of mounts in the new mount namespace, you couldn't interact or change /proc. In this article, I step through the PID namespace and demonstrate how you can use it, along with the mnt namespace, to secure your fledgling container further.

[ Readers also liked: How Linux PID namespaces work with containers ]

What are process IDs?

Before jumping straight into the PID namespace, I think it's a good idea to provide just a little bit of background as to why this namespace is important.

When a process is created on most Unix-like operating systems, it is given a specific numeric identifier called a process ID (PID). This PID helps to identify a process uniquely even if there are two processes that share the same human-readable name. For instance, if there are multiple ssh sessions active on a system and you need to close a specific connection, the PID provides a way for the administrator to ensure the correct session is closed.

All of these processes are tracked in a special file system called procfs. While this file system can technically be mounted anywhere, most tooling (and conventions) expect the procfs to be mounted under /proc. If you do a listing of /proc, you will see a folder for every process currently running on your system. Inside this folder are all sorts of special files used for tracking various aspects of the process. For the purpose of this article, these files are not important. It is sufficient to know that /proc is where most Unix-like systems store information regarding processes on a running system.

The PID namespace

One of the main reasons for the PID namespace is to allow for process isolation. More specifically, as the man page says:

PID namespaces isolate the process ID number space, meaning that processes in different PID namespaces can have the same PID.

This is important because it means that processes can be guaranteed not to have a conflicting PID with any other process. When considering a single system, of course, there is no chance of PIDs conflicting because the system continuously increments the process ID number and never assigns the same number twice. When dealing with containers across multiple machines, this issue becomes more salient. As described in the man page:

PID namespaces allow containers to provide functionality such as suspending/resuming the set of processes in the container and migrating the container to a new host while the processes inside the container maintain the same PIDs.

Aside from isolation, the PID system works almost identically to that outside of the namespace. The process IDs inside the new namespace start at 1, with the first process considered the init process. The init process is handled very differently from all the other PIDs on a host. This has particular implications for the running system that are outside the scope of this series. If you are interested in more information, see the "Signals and the init process" section of the LWN Namespaces article.

What is of note, however, is that whichever process has PID 1 is vital to the namespaces' longevity. If PID 1 is terminated for any reason, the kernel will send a SIGKILL to all remaining processes in the namespace, effectively shutting down that namespace.

Exploring PID namespaces

If you are wondering, like I was, about whether you can nest PID namespaces, the answer is yes. In fact, the kernel makes room for up to 32 nested PID namespaces. This is considered a one-way relationship. That means that the parent can see the PIDs of children, grandchildren, etc. However, it can't see any of the PIDs of its ancestors. Consider the following:

[user@localhost ~] sudo unshare -fp /bin/bash
[root@localhost ~] sleep 90000 &

[root@localhost ~] ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
[truncated ]
.....
root       11627   11620  0 09:16 pts/0    00:00:00 sudo unshare -fp /bin/bash
root       11633   11627  0 09:17 pts/0    00:00:00 unshare -fp /bin/bash
root       11634   11633  0 09:17 pts/0    00:00:00 /bin/bash
root       11639   11634  0 09:17 pts/0    00:00:00 sleep 90000
root       11641   11634  0 09:17 pts/0    00:00:00 ps -ef

[root@localhost ~] sudo unshare -fp /bin/bash
[root@localhost ~] sleep 8000 &

[root@localhost ~] ps -ef
[truncated ]
.....

UID          PID    PPID  C STIME TTY          TIME CMD
root       11650   11634  0 09:17 pts/0    00:00:00 sudo unshare -fp /bin/bash
root       11654   11650  0 09:17 pts/0    00:00:00 unshare -fp /bin/bash
root       11655   11654  0 09:17 pts/0    00:00:00 /bin/bash
root       11661   11655  0 09:17 pts/0    00:00:00 sleep 8000
root       11671   11655  0 09:17 pts/0    00:00:00 ps -ef

You will note that I truncated the output because the PID namespace appears to have full access to all of the PIDs in /proc. Observe what happens if you attempt to stop a process that is in an ancestor:

[root@localhost ~] kill -9 11361
bash: kill: (11361) - No such process

Why is this? Simply put, traditional tools like ps are not namespace-aware and actually read from the /proc directory. If you did an ls /proc, you would still see all of the folders and files from before because, as discussed in the last article, the PID namespace inherits all of the mnt namespace mounts. I address this situation later on. For now, return to the example at hand.

In another shell, identify the sleeping processes:

[user@localhost ~] ps -ef |grep sleep
root       11639   11634  0 09:17 pts/0    00:00:00 sleep 90000
root       11661   11655  0 09:17 pts/0    00:00:00 sleep 8000

If you want to verify that these processes are in different namespaces, you will have to find the bash process PID. Remember, since you ran sudo unshare -fp /bin/bash, the bash process is the init process in the new namespace. Therefore, it is the PID that will be linked to the namespace ID. Let's grab the PIDs:

[root@localhost ~] ps -ef |grep bash
root       11627   11620  0 09:16 pts/0    00:00:00 sudo unshare -fp /bin/bash
root       11633   11627  0 09:17 pts/0    00:00:00 unshare -fp /bin/bash
root       11634   11633  0 09:17 pts/0    00:00:00 /bin/bash
root       11650   11634  0 09:17 pts/0    00:00:00 sudo unshare -fp /bin/bash
root       11654   11650  0 09:17 pts/0    00:00:00 unshare -fp /bin/bash
root       11655   11654  0 09:17 pts/0    00:00:00 /bin/bash

You can see the PIDs 11634 and 11655 in the output. If you compare this to the output of lsns (list namespaces), you will see the following:

[root@localhost ~] lsns |grep bash
        NS TYPE   NPROCS    PID USER             COMMAND
4026532952 pid         4 11634 root             /bin/bash
4026532954 pid         4 11655 root             /bin/bash

As you can see, the namespace IDs are different, and thus the processes are in different namespaces.

Now that you've established the namespaces are indeed different, let's look at the PID ancestry mentioned before. You can do this by identifying the NSpid attribute of a given PID in the /proc directory, as seen below:

sudo cat /proc/11655/status |grep NSpid
NSpid:    11655    6    1

The columns are read from left to right and indicate the PID in their respective namespaces. The left-most PID is the primary or root namespace. In this case, it has a PID of 11655, a secondary PID of 6, and a tertiary PID of 1. Since the namespaces own each descendant PID namespace, you can think of it like this:

On the host, the bash process running the sleep 8000 command has a PID of 11655.
Inside the first "container," the bash process running the sleep 8000 command has a PID of 6.
Inside the nested second "container," the PID is 1. This is the container that actually launched the process.

Each one of these bash commands was created inside its own namespace but is visible to the parent (in this case, the root namespace).

/proc, PID, and unprivileged users

The astute reader would have noticed that, in the last two articles, a regular user could create both user and mnt namespaces. In this article, I have been using the sudo command. This is because you cannot create a PID namespace on its own with an unprivileged user. The answer to this is to combine multiple namespace creations into one event. There are a few different solutions to mounting /proc as an unprivileged user.

If you simply try to create a new user namespace, you'll get a strange result:

[ user@localhost ~] unshare -Urp
-bash: fork: Cannot allocate memory
-bash-5.1# ps -ef
-bash: fork: Cannot allocate memory
-bash-5.1# ls
-bash: fork: Cannot allocate memory

What is happening here? Remember how the first process inside a new PID namespace becomes the init process? In this case, the current shell cannot move namespaces. It exists in the root namespace, and when you created a new PID namespace, the system did not know how to handle it. The solution to this is to have the process fork itself. This allows the current shell to become a child process of the unshare command. Using the -f flag results in the namespace being created:

[ user@localhost ~] unshare -Urfp
[ root@localhost ~]

However, you still see contamination of the /proc mount point. There are two solutions to this. First, you could create a new mnt namespace and then remount /proc yourself:

[ user@localhost  ~]$ unshare -Urpmf
[ root@localhost ~]# mount -t proc proc /proc
[ root@localhost ~]# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 09:31 pts/0    00:00:00 -bash
root          10       1  0 09:31 pts/0    00:00:00 ps -ef

Indeed, for many years this was the only option but, a flag --mount-proc was created some time ago to do this in a single step. The man page reads:

Just before running the program, mount the proc filesystem at mountpoint (default is /proc). This is useful when creating a new PID namespace. It also implies creating a new mount namespace since the /proc mount would otherwise mess up existing programs on the system. The new proc filesystem is explicitly mounted as private (with MS_PRIVATE|MS_REC).

So, therefore, you may see references to the following command:

unshare -Urpf --mount-proc

This creates a new mnt namespace while mounting /proc for you.

Entering a namespace

To reduce complexity, I have exited the namespaces created earlier. I have created a new namespace with the following command:

unshare -Urfp --mount-proc

I have also created a different sleep process just to help identify the namespace. Since I only have a single new namespace, I can use the lsns command to determine the correct PID:

[ user@localhost  ~]$ lsns |grep bash
4026532965 pid         2 13142 user -bash

Then run the nsenter command:

sudo nsenter -t 13142 -a

The -a flag tells the nsenter command to enter all namespaces of that PID. sudo is required with the -a flag, or else you will not be able to change to all the appropriate namespaces. You should now be able to list all the PIDS in this NS:

[ root@localhost  ~]$ ps -ef

UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 09:54 pts/0    00:00:00 -bash
root           8       1  0 09:54 pts/0    00:00:00 sleep 99999
root          25       0  0 10:15 pts/1    00:00:00 -bash
root          31      25  0 10:15 pts/1    00:00:00 ps -ef

[ Learn the basics of using Kubernetes in this free cheat sheet. ]

Wrapping up

The PID namespace is an important one when it comes to building isolated environments. It allows processes to have their own PIDs regardless of the host system. In a world where multiple hosts may be involved in orchestrating isolated environments (containers), it becomes crucial to have a facility that guarantees unique PIDs when freezing and migrating processes. On top of that, for security reasons, if you are running namespaces for application isolation, the PID namespace is vital for preventing information leaks by way of which processes a host may be running.

When combined with the user and mnt namespaces, the PID namespace provides a great deal of protection without requiring root privileges. Modern browsers such as Firefox and Vivaldi make use of namespaces to provide browser sandboxing. In the next article, I'll demonstrate the net namespace and see how you can continue to construct your container by hand by adding in discrete network components.

Sull'autore

Steve Ovens

Steve is a dedicated IT professional and Linux advocate. Prior to joining Red Hat, he spent several years in financial, automotive, and movie industries. Steve currently works for Red Hat as an OpenShift consultant and has certifications ranging from the RHCA (in DevOps), to Ansible, to Containerized Applications and more. He spends a lot of time discussing technology and writing tutorials on various technical subjects with friends, family, and anyone who is interested in listening.

Read full bio

Ricerca per canale

Esplora tutti i canali

Piattaforme

Prova e acquista

In primo piano

Per settore

In primo piano

Argomenti

Articoli

Scopri di più

Per i clienti

Per i partner

Chi siamo

Open source

Per saperne di più

Suggerimenti

Seleziona la tua lingua

Seleziona la tua lingua

Building containers by hand: The PID namespace

What are process IDs?

The PID namespace

Exploring PID namespaces

/proc, PID, and unprivileged users

Entering a namespace

Wrapping up

Sull'autore

Steve Ovens

Altri risultati simili a questo

Ricerca per canale

Prodotti

Strumenti

Prova, acquista, vendi

Comunica

Informazioni su Red Hat

Seleziona la tua lingua

Red Hat legal and privacy links

Red Hat legal and privacy links