Building a container by hand using namespaces: The mount namespace

Check out some theory and practice around the mount namespace

Posted: March 23, 2021 by Steve Ovens (Sudoer, Red Hat)

Building a container by hand using namespaces: The mount namespace — Image by Deedee86 from Pixabay

This article looks at the mount namespace and is the third in the Linux Namespace series. In the first article, I gave an introduction to the seven most commonly used namespaces, laying the groundwork for the hands-on work started in the user namespaces article. My goal is to build out some fundamental knowledge as to how the underpinnings of Linux containers work. If you're interested in how Linux controls the resources on a system, check out the CGroup series, I wrote earlier. Hopefully, by the time you're done with the namespaces hands-on work, I can tie CGroups and namespaces together in a meaningful way, completing the picture for you.

For now, however, this article examines the mount namespace and how it can help you get closer to understanding the isolation that Linux containers brings to sysadmins and, by extension, platforms like OpenShift and Kubernetes.

[ You might also like: Sharing supplemental groups with Podman containers ]

The mount namespace

The mount namespace doesn't behave as you might expect after creating a new user namespace. By default, if you were to create a new mount namespace with unshare -m, your view of the system would remain largely unchanged and unconfined. That's because whenever you create a new mount namespace, a copy of the mount points from the parent namespace is created in the new mount namespace. That means that any action taken on files inside a poorly configured mount namespace will impact the host.

Some setup steps for mount namespaces

So what use is the mount namespace then? To help demonstrate this, I use an Alpine Linux tarball.

In summary, download it, untar it, and move it into a new directory, giving the top-level directory permissions for an unprivileged user:

[root@localhost ~] export CONTAINER_ROOT_FOLDER=/container_practice
[root@localhost ~] mkdir -p ${CONTAINER_ROOT_FOLDER}/fakeroot
[root@localhost ~] cd ${CONTAINER_ROOT_FOLDER}
[root@localhost ~] wget https://dl-cdn.alpinelinux.org/alpine/v3.13/releases/x86_64/alpine-minirootfs-3.13.1-x86_64.tar.gz
[root@localhost ~] tar xvf alpine-minirootfs-3.13.1-x86_64.tar.gz -C fakeroot
[root@localhost ~] chown container-user. -R ${CONTAINER_ROOT_FOLDER}/fakeroot

The fakeroot directory needs to be owned by the user container-user because once you create a new user namespace, the root user in the new namespace will be mapped to the container-user outside of the namespace. This means that a process inside of the new namespace will think that it has the capabilities required to modify its files. Still, the host's file system permissions will prevent the container-user account from changing the Alpine files from the tarball (which have root as the owner).

So what happens if you simply start a new mount namespace?

PS1='\u@new-mnt$ ' unshare -Umr

Now that you're inside the new namespace, you might not expect to see any of the original mount points from the host. However, this isn't the case:

root@new-mnt$ df -h
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/cs-root   36G  5.2G   31G  15% /
tmpfs                737M     0  737M   0% /sys/fs/cgroup
devtmpfs             720M     0  720M   0% /dev
tmpfs                737M     0  737M   0% /dev/shm
tmpfs                737M  8.6M  728M   2% /run
tmpfs                148M     0  148M   0% /run/user/0
/dev/vda1            976M  197M  713M  22% /boot


root@new-mnt$ ls /
bin   container_practice  etc   lib    media  opt   root  sbin  sys  usr
boot  dev                 home  lib64  mnt    proc  run   srv   tmp  var

The reason for this is that systemd defaults to recursively sharing the mount points with all new namespaces. If you mounted a tmpfs filesystem somewhere, for example, /mnt inside the new mount namespace, can the host see it?

root@new-mnt$ mount -t tmpfs tmpfs /mnt

root@new-mnt$ findmnt |grep mnt
└─/mnt     tmpfs               tmpfs      rw,relatime,seclabel,uid=1000,gid=1000

The host, however, doesn't see this:

[root@localhost ~]# findmnt |grep mnt

So at the very least, you know that the mount namespace is functioning correctly. This is a good time to take a small detour to discuss the propagation of mount points. I'm briefly summarizing but if you are interested in a greater understanding, have a look at Michael Kerrisk's LWN article as well as the man page for the mount namespace. I don't normally rely so much on the man pages as I often find that they're not easily digestible. However, in this case, they are full of examples and in (mostly) plain English.

Theory of mountpoints

Mounts propagate by default because of a feature in the kernel called the shared subtree. This allows every mount point to have its own propagation type associated with it. This metadata determines whether new mounts under a given path are propagated to other mount points. The example given in the man page is that of an optical disk. If your optical disk automatically mounted under /cdrom, the contents would only be visible in other namespaces if the appropriate propagation type is set.

Peer groups and mount states

The kernel documentation says that a "peer group is defined as a group of vfsmounts that propagate events to each other." Events are things such as mounting a network share or unmounting an optical device. Why is this important, you ask? Well, when it comes to the mount namespace, peer groups are often the deciding factor as to whether or not a mount is visible and can be interacted with. A mount state determines whether a member in a peer group can receive the event. According to the same kernel documentation, there are five mount states:

shared - A mount that belongs to a peer group. Any changes that occur will propagate through all members of the peer group.
slave - One-way propagation. The master mount point will propagate events to a slave, but the master will not see any actions the slave takes.
shared and slave - Indicates that the mount point has a master, but it also has its own peer group. The master will not be notified of changes to a mount point, but any peer group members downstream will.
private - Does not receive or forward any propagation events.
unbindable - Does not receive or forward any propagation events and cannot be bind mounted.

It's important to note that the mount point state is per mount point. This means that if you have / and /boot, for example, you'd have to separately apply the desired state to each mount point.

In case you're wondering about containers, most container engines use private mount states when mounting a volume inside a container. Don't worry too much about this for now. I just want to provide some context. If you want to try some specific mounting scenarios, look at the man pages as the examples are quite good.

Creating our mount namespace

If you're using a programming language like Go or C, you could use the raw system kernel calls to create the appropriate environment for your new namespace(s). However, since the intent behind this is to help you understand how to interact with a container that already exists, you'll have to do some bash trickery to get your new mount namespace into the desired state.

First, create the new mount namespace as a regular user:

unshare -Urm

Once you're inside the namespace, look at the findmnt of the mapper device, which contains the root file system (for brevity, I removed most of the mount options from the output):

findmnt |grep mapper

/       /dev/mapper/cs-root      xfs           rw,relatime,[...]

There is only one mount point that has the root device mapper. This is important because one of the things you have to do is bind the mapper device into the Alpine directory:

export CONTAINER_ROOT_FOLDER=/container_practice
mount --bind ${CONTAINER_ROOT_FOLDER}/fakeroot ${CONTAINER_ROOT_FOLDER}/fakeroot
cd ${CONTAINER_ROOT_FOLDER}/fakeroot

This is because you're using a utility called pivot_root to perform a chroot-like action. pivot_root takes two arguments: new_root and old_root (sometimes referred to as put_old). pivot_root moves the root file system of the current process to the directory put_old and makes new_root the new root file system.

IMPORTANT: A note about chroot. chroot is often thought of as having extra security benefits. To some extent, this is true, as it takes a more significant amount of expertise to break free of it. A carefully constructed chroot can be very secure. However, chroot does not modify or restrict Linux capabilities which I touched on in the previous namespace article. Nor does it limit system calls to the kernel. This means that a sufficiently skilled aggressor could potentially escape a chroot that has not been well thought through. The mount and user namespaces help to solve this problem.

If you use pivot_root without the bind mount, the command responds with:

pivot_root: failed to change root from `.' to `old_root/': Invalid argument

To switch to the Alpine root filesystem, first, make a directory for old_root and then pivot into the intended (Alpine) root filesystem. Since the Alpine Linux root filesystem doesn't have symlinks for /bin and /sbin, you'll have to add those to your path and then finally, unmount the old_root:

mkdir old_root
pivot_root . old_root
PATH=/bin:/sbin:$PATH
umount -l /old_root

You now have a nice environment where the user and mount namespaces work together to provide a layer of isolation from the host. You no longer have access to binaries on the host. Try issuing the findmnt command that you used before:

root@new-mnt$ findmnt
-bash: findmnt: command not found

You can also look at the root filesystem or attempt to see what's mounted:

root@new-mnt$ ls -l /
total 12
drwxr-xr-x    2 root     root          4096 Jan 28 21:51 bin
drwxr-xr-x    2 root     root            18 Feb 17 22:53 dev
drwxr-xr-x   15 root     root          4096 Jan 28 21:51 etc
drwxr-xr-x    2 root     root             6 Jan 28 21:51 home
drwxr-xr-x    7 root     root           247 Jan 28 21:51 lib
drwxr-xr-x    5 root     root            44 Jan 28 21:51 media
drwxr-xr-x    2 root     root             6 Jan 28 21:51 mnt
drwxrwxr-x    2 root     root             6 Feb 17 23:09 old_root
drwxr-xr-x    2 root     root             6 Jan 28 21:51 opt
drwxr-xr-x    2 root     root             6 Jan 28 21:51 proc
drwxr-xr-x    2 root     root             6 Feb 17 22:53 put_old
drwx------    2 root     root            27 Feb 17 22:53 root
drwxr-xr-x    2 root     root             6 Jan 28 21:51 run
drwxr-xr-x    2 root     root          4096 Jan 28 21:51 sbin
drwxr-xr-x    2 root     root             6 Jan 28 21:51 srv
drwxr-xr-x    2 root     root             6 Jan 28 21:51 sys
drwxrwxrwt    2 root     root             6 Feb 19 16:38 tmp
drwxr-xr-x    7 root     root            66 Jan 28 21:51 usr
drwxr-xr-x   12 root     root           137 Jan 28 21:51 var


root@new-mnt$ mount
mount: no /proc/mounts

Interestingly, there is no proc filesystem mounted by default. Try to mount it:

root@new-mnt$ mount -t proc proc /proc
mount: permission denied (are you root?)

root@new-mnt$ whoami
root

Because proc is a special type of mount related to the PID namespace you can't mount it even though you're in your own mount namespace. This goes back to the capability inheritance that I discussed earlier. I'll pick up this discussion in the next article when I cover the PID namespace. However, as a reminder about inheritance, have a look at the diagram below:

In the next article, I'll rehash this diagram, but if you've followed along since the beginning, you should be able to make some inferences before then.

[ The API owner's manual: 7 best practices of effective API programs ]

Wrapping up

In this article, I covered some deeper theory around the mount namespace. I discussed peer groups and how they relate to the mount states that are applied to each mount point on a system. For the hands-on part, you downloaded a minimal Alpine Linux file system and then walked through how to use the user and mount namespaces to create an environment that looks a lot like chroot except potentially more secure.

For now, test mounting file systems inside and outside of your new namespace. Try creating new mount points that use the shared, private, and slave mount states. In the next article, I'll use the PID namespace to continue building out the primitive container to gain access to the proc file system and process isolation.

The 7 most used Linux namespaces

Check out this brief overview of what the seven most used Linux namespaces are.

Building a Linux container by hand using namespaces

How user namespaces in Linux relate to container security.

A Linux sysadmin's introduction to cgroups

Defining cgroups and how they help with resource management and performance tuning in this first article kicking off a four-part series covering cgroups and resource management.

Topics: Linux Linux administration Containers Security

Building a container by hand using namespaces: The mount namespace

The mount namespace

Some setup steps for mount namespaces

Theory of mountpoints

Peer groups and mount states

Creating our mount namespace

Wrapping up

Steve Ovens

Try Red Hat Enterprise Linux

Download it at no charge from the Red Hat Developer program.

Building a container by hand using namespaces: The mount namespace

The mount namespace

Some setup steps for mount namespaces

Theory of mountpoints

Peer groups and mount states

Creating our mount namespace

Wrapping up

Steve Ovens

Try Red Hat Enterprise Linux

Download it at no charge from the Red Hat Developer program.

Related Content