Container permission denied: How to diagnose this error

Learn what is causing a container permissions error and how to work around the issue without resorting to the --privileged flag.

Posted: April 29, 2022 by Dan Walsh (Red Hat)

Test tubes — ^{Photo by Kindel Media from Pexels}

Over the years, I have often given a talk using the story of Goldilocks and the Three Bears and how it compares to container security.

In the story, Goldilocks complains that Papa Bear's porridge is too hot, Mama Bear's is too cold, and Baby Bear's is just right. In the next section, she finds Papa Bear's bed is too hard, Mama Bear's bed is too soft, and Baby Bear's bed is just right.

If you set the security on containers too tight, many containers will not run.

If you set the security on containers too loose, you didn't really secure them.

When I want to lock down containers, I look for the Goldilocks level, where the container can be as secure as possible. Still, most containers run within the default constraints.

Why does your container fail with "permission denied"?

Many users' only choice is to run with --privileged mode. When the container runs fine with --privileged, users need to understand what those privileges mean: They mean you are beyond Mama Bear's territory. The --privileged flag turns off all security separation on the container. The container processes get the same privilege as if they were run directly by the user. If the user is root, the processes get full root privileges.

Note: Even in --privileged mode, containers are still subject to namespace protections, including the user namespace. I will cover those later in this article.

This article explains how to figure out what the container is trying to do that is blocked by container security and how to run your container with more protection than --privileged.

Why Podman?

Because I work on Podman, most of the rest of this article covers using it to secure containers, but the concepts and separation apply to other container engines like Buildah, Docker, CRI-O, and containerd.

Podman uses many security mechanisms for isolating containers from the host system and other containers. These security mechanisms can cause a permission-denied error, and sadly only the kernel knows which one is blocking access to the container process. I saw this problem coming, and back in 2013, I opened a feature discussion called FriendlyEPERM. Lots of security features were being added to the Linux kernel that could cause a process to get EPERM, and there would be no reasonable way for the user or administrator to figure out what happened. Only the kernel would know. And it might spread some crumbs around the system to help diagnose the issue, but it didn't do this consistently.

FriendlyEPERM's goal was to have the kernel write the reason for EPERM into the/proc filesystem to allow logging tools to inform the user why the process was denied access. FriendlyEPERM never happened because it would be inherently racy, and no one ever figured out a way to have the kernel reveal to a process why it was denied access.

Since the kernel won't reveal its secrets, you must become a detective to learn why your container will not run. The rest of this article goes through the different security mechanisms, how to diagnose what is causing the problem, and how to work around the issue without requiring the --privileged flag.

1. Confirm the problem is security

Use the --privileged flag to ensure it is a security problem. Sometimes the problem is related to something other than security, such as namespaces. I cover namespaces at the end of this article. If your container runs with the --privileged flag, the problem is likely a security issue. If it still does not run, the problem may be with namespaces.

If the container runs in --privileged mode, here are the security mechanisms I would try.

a. Is SELinux the issue?

SELinux is a labeling system that protects the filesystem from container processes. If the content on the host system leaks into a container or a container process escapes, then SELinux blocks access. SELinux can easily cause permission-denied errors, especially when you're using volumes. Many articles have been written on SELinux, container volumes, and the use of the :z and :Z flags.

SELinux can be diagnosed relatively quickly by checking for Access Vector Cache (AVC) messages in the /var/log/audit/audit.log or running the container in permissive mode with sudo setenforce 0. Another alternative is running a container without SELinux separation:

$ podman run --security-opt label=disable …

Of course, I would never recommend disabling SELinux, but understanding that it is causing the failures makes problems easier to diagnose.

The classic SELinux issue is the process is not allowed to write to a volume when running Podman on the container:

$ mkdir /tmp/data

$ podman run -v /tmp/data:/data fedora touch /data/content
touch: cannot touch '/data/content': Permission denied

If you run the container with --privileged, it works:

$ podman run --privileged -v /tmp/data:/data fedora touch /data/content

So you now know that this is a privilege problem.

If you look in the audit.log using ausearch, you see an AVC record:

$ sudo ausearch -m avc -ts recent

----

time->Wed Mar 30 10:39:57 2022

type=AVC msg=audit(1648651197.081:7952): avc: denied { write } for pid=1235318 comm="touch" name="data" dev="tmpfs" ino=1518754 scontext=system_u:system_r:container_t:s0:c236,c859 tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=dir permissive=1

It seems to be an SELinux issue.

[ Improve your skills managing and using SELinux with this helpful guide. ]

You could have also set the SELinux system in permissive mode:

$ sudo setenforce 0

$ podman run -v /tmp/data:/data fedora touch /data/content

$ sudo setenforce 1

And the command works.

Finally, you can use the :Z option and tell Podman to relabel the content so that it is private to the container and run the container fully locked down with SELinux in enforcing mode:

$ podman run -v /tmp/data:/data:Z fedora touch /data/content

If SELinux is not the issue, turn the security back on (sudo setenforce 1), and check another security mechanism.

b. Is AppArmor the issue?

AppArmor is similar to SELinux in that rules are added to the kernel to control process access to the system. Like SELinux, AppArmor could cause a permission-denied error. You can verify whether it is the problem by turning off AppArmor separation:

$ podman run --security-opt apparmor=unconfined …

Our team has heard of cases where unconfined is still not working. You can try disabling the apparmor profile or AppArmor itself.

c. Test capabilities

Podman drops Linux capabilities when it starts a container. Podman runs root processes with the following capabilities by default:

CHOWN
DAC_OVERRIDE
FOWNER
FSETID
KILL
NET_BIND_SERVICE
SETFCAP
SETGID
SETPCAP
SETUID
SYS_CHROOT

Imagine running a build with a Containerfile that attempts to create a device node:

$ cat /tmp/Containerfile
from fedora
run mknod /dev/mynull c 1 3

Running rootful podman build on this Containerfile fails:

# podman build /tmp
STEP 1/2: FROM fedora
STEP 2/2: run mknod /dev/mynull c 1 3
mknod: /dev/mynull: Operation not permitted
Error: error building at STEP "RUN mknod /dev/mynull c 1 3": error while running runtime: exit status 1

Since podman build does not even have a --privileged flag, you need to start diagnosing a workaround. Check whether SELinux is causing the problem:

# setenforce 0

# podman build /tmp
STEP 1/2: FROM fedora
STEP 2/2: run mknod /dev/mynull c 1 3
mknod: /dev/mynull: Operation not permitted
Error: error building at STEP "RUN mknod /dev/mynull c 1 3": error while running runtime: exit status 1

# setenforce 1

Nope. The podman build command still fails while in permissive mode, so the problem is not likely to be SELinux. You could try adding all capabilities. (Note: Podman running with --privileged mode turns on all capabilities.)

You can turn on all capabilities for running a container by executing the following command:

# podman build --cap-add all /tmp/
STEP 1/2: FROM fedora
STEP 2/2: run mknod /dev/mynull c 1 3
COMMIT
--> ee04c826eb9
Ee04c826eb9bd8726fb234c83d9cc4c9218c433f56a804e0f06bbefa43fcf586

Because the container runs fine with all capabilities, you need to figure out which capability is required.

The most powerful Linux capability is SYS_ADMIN, so attempt that one:

# podman build --no-cache --cap-add sys_admin /tmp
STEP 1/2: FROM fedora
STEP 2/2: run mknod /dev/mynull c 1 3
mknod: /dev/mynull: Operation not permitted
Error: error building at STEP "RUN mknod /dev/mynull c 1 3": error while running runtime: exit status 1

The container ran fine with one of these missing capabilities, so you know one of them is the problem.

During diagnosis, ask what the service was attempting to do when it got permission denied. If it has something to do with the network, look at the network capabilities. Then search the capabilities list for something network related. Try to add those (NET_BIND_SERVICE, NET_BROADCAST, NET_ADMIN, NET_RAW, CAP_IPC_LOCK). In this case, the build is attempting to create a device node, so check that capability:

# podman build --no-cache --cap-add mknod /tmp
STEP 1/2: FROM fedora
STEP 2/2: run mknod /dev/mynull c 1 3
COMMIT
--> e24c8234f10
E24c8234f10fc2f8284ae91253891077dc038c30587f5a4a09f8e315218e7f14

Obviously, CAP_MKNOD is the missing capability.

Sometimes users have problems with a Podman container, and they tell me that it works with Docker. One reason for this is Podman runs with tighter security and fewer Linux capabilities than Docker. Podman drops a few capabilities that Docker allows by default. If a container runs with Docker but not Podman, try adding the missing capabilities: NET_RAW, SYS_CHROOT, AUDIT_WRITE, MKNOD.

# podman build --no-cache --cap-add sys_chroot,net_raw,audit_write,mknod /tmp
STEP 1/2: FROM fedora
STEP 2/2: run mknod /dev/mynull c 1 3
COMMIT
--> 3faef5a3084
3faef5a3084427343165c3d322e410108e18ba2b86f45ccee0ab6771a654fcf

If you want to really get down and dirty, you can use strace to attempt to get the actual syscall that is being denied.

[ Check out this free guide to boosting hybrid cloud security and protecting your business. ]

d. Test SECCOMP

Podman uses SECCOMP to limit the number of system calls available within a container. The list of syscalls is shipped in the /usr/share/containers/seccomp.json file. Working with seccomp files is a little advanced, so I usually just tell people to see if the container runs with seccomp separation disabled. You can disable SECCOMP easily and see if the container runs:

$ podman run –security-opt seccomp=unconfined

Sometimes SECCOMP denials show up in /var/log/audit/audit.log. Instead of turning off SECCOMP entirely, generate a profile for the specific workload and container. Please refer to Improving Linux container security with seccomp to learn how to do that with Podman.

e. Test masked kernel filesystems

Podman masks over several kernel filesystems to prevent processes within the container from certain activities on the kernel filesystems. Sometimes the processes inside the container might need to access one of these masked kernel filesystems. When running in --privileged mode, Podman does not mask any of the kernel filesystems. You can also run containers without the masks by executing:

$ podman run --security-opt unmask=all …

Use man podman run to display the unmask options:

● unmask=ALL or /path/1:/path/2, or shell expanded paths (/proc/*): Paths to unmask separated by a colon. If set to ALL, it will unmask all the paths that are masked or made read only by default. The default masked paths are /proc/acpi, /proc/kcore, /proc/keys, /proc/latency_stats, /proc/sched_debug, /proc/scsi, /proc/timer_list, /proc/timer_stats, /sys/firmware, and /sys/fs/selinux.. The default paths that are read only are /proc/asound, /proc/bus, /proc/fs, /proc/irq, /proc/sys, /proc/sysrq-trigger, /sys/fs/cgroup.

2. Namespace issues

I have covered all the standard security separations. Next, I will look at namespaces.

[ Learn how to explain orchestration in plain English. ]

a. Is user namespace the issue?

One of the most common issues and bug reports our team gets is that the XYZ container image works fine with Docker but blows up with Podman. This is almost invariably because the user is running rootful Docker and rootless Podman. Rootless Podman uses the user namespace, which causes some security issues and can cause permission to be denied. You can diagnose this by telling the user to attempt to run the container as root, which would match up to the default experience with Docker:

$ sudo podman run …

For example, examine what happens if you try to run a MariaDB image:

$ mkdir /tmp/data

$ podman run --env MARIADB_ROOT_PASSWORD=passwd -v /tmp/data/:/var/lib/mysql mariadb
Error: open /tmp/data/mysql: permission denied

Remember from the first section of this article that SELinux blocks access to random content on disk, so you need to add the :Z option:

$ podman run --env MARIADB_ROOT_PASSWORD=passwd -v /tmp/data/:/var/lib/mysql:Z mariadb
Error: open /tmp/data/mysql: permission denied

Nope, still broken. You can try it as root:

$ sudo podman run --env MARIADB_ROOT_PASSWORD=passwd -v /tmp/data/:/var/lib/mysql:Z mariadb
2022-04-06 20:28:15+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:10.7.3+maria~focal started.
2022-04-06 20:28:15+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2022-04-06 20:28:15+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:10.7.3+maria~focal started.
…

Download now

When it works in rootful mode but not rootless mode, there is a good chance the issue is with user namespace. User namespace tends to cause issues when volumes are mounted into containers, similar to the SELinux problems. Usually, the problem is a non-root user inside a container, say the MySQL UID 999, trying to access a volume mounted from the host user's home directory. By default, the UID of the host user is treated as UID 0 inside the container. The MySQL user of the MariaDB container (UID 999) is not allowed to read and write from it.

In a user namespace, this UID is not simply UID==999. It is offset by the range of UIDs in /etc/subuid. On my system, this UID 999 inside the container is mapped to UID 100998 outside the user namespace. For this issue, Podman makes it easy by adding a :U option. The :U tells Podman to recursively chown the volume to match the default user found inside the user namespaced container.

First, stop the rootful container from running, and then remove and recreate the /tmp/data directory since the actual root user owns the content in this directory:

$ sudo stop -f

$ sudo rm -rf /tmp/data

$ mkdir /tmp/data

Now run the container again in rootless mode, this time with the :U option:

$ podman run --env MARIADB_ROOT_PASSWORD=passwd -v /tmp/data/:/var/lib/mysql:Z,U mariadb
2022-04-06 16:30:53-04:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:10.7.3+maria~focal started.
2022-04-06 16:30:53-04:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2022-04-06 16:30:53-04:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:10.7.3+maria~focal started.
…

If you'd like more information, I wrote about volumes and user namespace in Dealing with user namespaces and SELinux on rootless containers.

Another common issue with the user namespace is using a UID that is not mapped within the user namespace. By default, rootless users only use 65537 UIDs. If you use a UID greater than that, the user namespace treats it as undefined, and it will not be allowed. You can see the user namespace mappings of the container with the podman unshare command:

$ podman unshare cat /proc/self/uid_map
0 3267 1
1 100000 65536

One common problem is a failure with an unmapped user. You can probably run the container as root, or you need to expand the number of UIDs mapped for the user in /etc/subuid and /etc/subgid files.

Note: If you ever modify those files, you need to run podman system migrate afterward to use them in a logged-in session.

b. Is network namespace the issue?

Sometimes the issue is caused by the network namespace; you can disable this and use the host's network namespace using the --net=host flag:

$ podman run --net=host …

This runs the container in the host's network. Note that you still won't have full access to the network; if you are running as rootless, some access is prevented even if you have added all caps. For example, rootless users are not allowed to bind to ports < 1024:

$ podman run -p 80:80 ubi8/httpd-24
Error: rootlessport cannot expose privileged port 80, you can add 'net.ipv4.ip_unprivileged_port_start=80' to /etc/sysctl.conf (currently 1024), or choose a larger port number (>= 1024): listen tcp 0.0.0.0:80: bind: permission denied

This happens so often that Podman tells the user about it, and even describes a special sysctl that can be set to allow non-root users to bind to port 80:

$ sudo sysctl -w net.ipv4.ip_unprivileged_port_start=80
net.ipv4.ip_unprivileged_port_start = 80

$ podman run -p 80:80 ubi8/httpd-24
=> sourcing 10-set-mpm.sh ...
=> sourcing 20-copy-config.sh ...
=> sourcing 40-ssl-certs.sh ...
…

c. Issues with PID or IPC namespaces

Similar to a network, you could have issues with containers caused by the PID or IPC namespaces. It is simple to turn off the separation on these by executing with the --pid=host and --ipc=host options:

$ podman run --pid=host --ipc=host …

3. Try rootful containers

Some containers just require root. Usually, very privileged containers that want to modify the system will not work in rootless mode. Luckily, these are very rare. To run a container that mounts different types of filesystems, you need to run it in rootful mode.

The bottom line is that in rootless mode, you can only change system parameters related to namespaces and can only do what a normal user can do. Podman does not add anything special to the system, but it takes advantage of the namespaces in clever ways.

Wrap up

It should rarely be necessary to run with --privileged mode; if you spend a small amount of time investigating which of the subsystems is failing, you should be able to run with tighter security. You might be moving away from Papa Bear, but you don't need to go all the way to Mama Bear.