Suscríbase al feed
Linux 

All recent versions of the most popular Linux distributions are using systemd to boot the machine and manage system services. Systemd provides several features to make the starting of services easier and more secure. This is a rare combination, and this article shows why it is useful to let systemd manage the resources and sandboxing of a service.

Justification

So, why should we use systemd for security sandboxing? First, one might argue that each bit of this functionality is already exposed through existing and well-known tools, which can be scripted and combined in arbitrary ways. Second, particularly in case of programs written in C/C++ and other low-level languages, appropriate system calls can be used directly, achieving a lean implementation carefully tailored to the needs of a particular service.

There are four main reasons:

1. Security is hard. A centralized implementation in the service manager means that a service that takes advantage of it can be significantly simplified. No doubt, this centralized implementation is complex, but because of its wide use, it is well tested. If we consider that it is reused over thousands of services, the overall complexity of the system is reduced.

2. Security primitives vary between systems. Systemd smooths over the differences between hardware architectures, kernel versions, and system configurations.

The functionality that provides hardening of services is implemented to the extent possible on a given system. For example, a systemd unit may contain both AppArmor and SELinux configurations. The first is used on Ubuntu/Debian systems, the second on Fedora/RHEL/CentoOS, and neither for distributions that don't enable any MAC system. The other side of this flexibility is that those features cannot be relied on as the only containment mechanism (or that such services are only used on systems that support all required features).

3. Security requires low-level fiddling with the system. Features provided by the service manager are independent of the implementation language of the service, so it is easy to write a service in a high-level language, e.g., shell or Python or whatever is convenient, and still lock it down.

4. Security requires privileges. This is a paradox, but privileges are required to take away privileges. For example, we often need to be root to set up a custom mount namespace to limit a view of the filesystem. As another example, an HTTP daemon is often started as root only to be able to open a low-numbered port and low-numbered ports are restricted in the name of security. The service manager needs to run with the highest privileges anyway, but services shouldn't, and the hardening setup is often the only reason to require higher privileges. Any bugs in the implementation of the service in this phase can be dangerous. By offloading setup to the service manager, services can start without this early phase of elevated privileges.

To put this in context, the recently released Fedora 32 contains almost 1800 different unit files for starting services written in C, C++, Python, Java, Ocaml, Perl, Ruby, Lua, Tcl, Erlang and so on - and just one systemd.

[ Need more on systemd? Download the systemd cheat sheet for more helpful hints. ]

A few equivalent ways to start a service

Most commonly, systemd services are defined through a unit file: a text file in ini format that declares the commands to execute and various settings. After this unit file is edited, systemctl daemon-reload should be called to poke the manager to load the new settings. The output from the daemon lands in the journal and a separate command is used to view it. When running commands interactively, all of that is not very convenient. The systemd-run command tells the manager to start a command on behalf of the user and is a great alternative for interactive use. The command to execute is specified similarly to sudo. The first positional argument and everything after it is the actual command, and any preceding options are interpreted by systemd-run itself. The systemd-run command has options to specify specific settings such as --uid and --gid for the user and group. The -E option sets an environment variable, while a "catch-all" option -p accepts arbitrary key=value pairs similar to the unit file.

$ systemd-run whoami
Running as unit: run-rbd26afbc67d74371a6d625db78e33acc.service
$ journalctl -u run-rbd26afbc67d74371a6d625db78e33acc.service
journalctl -u run-rbd26afbc67d74371a6d625db78e33acc.service
-- Logs begin at Thu 2020-04-23 19:31:49 CEST, end at Mon 2020-04-27 13:22:35 CEST. --
Apr 27 13:22:18 fedora systemd[1]: Started run-rbd26afbc67d74371a6d625db78e33acc.service.
Apr 27 13:22:18 fedora whoami[520662]: root
Apr 27 13:22:18 fedora systemd[1]: run-rbd26afbc67d74371a6d625db78e33acc.service: Succeeded.

systemd-run -t connects the standard input, output, and error streams of the command to the invoking terminal. This is great for running commands interactively (note that the service process is still a child of the manager).

$ systemd-run -t whoami
Running as unit: run-u53517.service
Press ^] three times within 1s to disconnect TTY.
root

Consistent environment

A unit always starts in a carefully defined environment. When we start a unit using systemctl or systemd-run, the command is always invoked as a child of the manager. The environment of the shell does not affect the environment in which the service commands run. Not all settings which can be specified in a unit file are supported by systemd-run, but most are, and as long as we stick to that subset, invocation through a unit file and systemd-run are equivalent. In fact, systemd-run creates a temporary unit file on the fly.

For example:

$ sudo systemd-run -M rawhide -t /usr/bin/grep PRETTY_NAME= /etc/os-release

Here, sudo talks to PAM to allow privilege escalation, and then executes systemd-run as root. Next, systemd-run makes a connection to a machine named rawhide, where it talks to the system manager (PID 1 in the container) over dbus. The manager invokes grep, which does its job. The grep command prints to stdout, which is connected to the pseudo-terminal from which sudo was invoked.

Security settings

Users and dynamic users

Without further ado, let's talk about some specific settings, starting with the simplest and most powerful primitives.

First, the oldest, most basic, and possibly the most useful privilege separation mechanism: users. You might define users with User=foobar in the [Service] section of a unit file, or systemd-run -p User=foobar, or systemd-run --uid=foobar. It might seem obvious—and on Android, every application gets its own user—but in the Linux world, we still have too many services that needlessly run as root.

Systemd provides a mechanism to create users on demand. When invoked with DynamicUser=yes, a unique user number is allocated for the service. This number resolves to a temporary user name. This assignment is not stored in /etc/passwd, but is instead generated on the fly by an NSS module whenever the number or corresponding name is queried. After the service is stopped, the number might be reused later for another service.

When should a regular static user be used for a service, and when is a dynamic one preferred? Dynamic users are great when the user identity is ephemeral, and no integration with other services in the system is needed. But when we have a policy in the database to allow specific user access, directories shared with a particular group, or any other configuration where we want to refer to the user name, dynamic users are probably not the best option.

Mount namespaces

In general, it should be noted that systemd is often only wrapping functionality that is provided by the kernel. For example, various settings that limit access to the file system tree, making parts of it read-only or inaccessible, are accomplished by arranging the appropriate filesystems in an unshared mount namespace.

Several useful settings are implemented like this. The two most useful and general ones are ProtectHome= and ProtectSystem=. The first uses an unshared mount namespace to make /home either read-only or entirely inaccessible. The second is about protecting /usr, /boot, and /etc.

A third also useful but very specific setting is PrivateTmp=. It uses mount namespaces to make a private directory visible as /tmp and /var/tmp for the service. The service's temporary files are hidden from other users to avoid any issues due to filename collisions or wrong permissions.

The file system view can be managed at the level of individual directories through InaccessiblePaths=, ReadOnlyPaths=, ReadWritePaths=, BindPaths=, and ReadOnlyBindPaths=. The first two settings deliver all or just write access to parts of a file system hierarchy. The third is about restoring access, which is useful when we want to give full access only to some specific directory deep in the hierarchy. The last two allow moving directories, or, more precisely speaking, privately bind-mounting them in a different location.

Returning to the subject of DynamicUser=yes, such transient users are only possible when the service is not allowed to create permanent files on disk. If such files were visible to other users, they would be shown as having no owner, or worse, they could be accessed by the new transient user with the same number, leading to an information leak or an unintended privilege escalation. Systemd uses mount namespaces to make most of the file system tree unwritable to the service. To allow permanent storage, a private directory is mounted into the file system tree visible to the service.

Note that those protections are independent of the basic file access control mechanism using file ownership and the permission mask. If a file system is mounted read-only, even users who could modify specific files based on standard permissions cannot do so until the filesystem is remounted read-write. This provides a safeguard against mistakes in file management (after all, it is not unheard of for users to set the wrong permission mask occasionally) and is a layer of a defense-in-depth strategy.

Resources implemented using mount namespaces are generally very efficient because the kernel implementation is efficient. The overhead when setting them up is usually negligible too.

Automatic creation of directories for a service

A relatively new feature that systemd provides for services is the automatic management of directories. Different paths of the filesystem have different storage characteristics and intended uses, but they fall into a few standard categories. The FHS specifies that /etc is for configuration files, /var/cache is for non-permanent storage, /var/lib/ is for semi-permanent storage, /var/log for the logs, and /run for volatile files. A service often needs a subdirectory in each of those locations. Systemd sets that up automatically, as controlled by the ConfigurationDirectory=, CacheDirectory=, StateDirectory=, LogsDirectory=, and RuntimeDirectory= settings. The user owns those directories. The runtime directory is removed by the manager when the service stops. The general idea is to tie the existence of those filesystem assets to the lifetime of the service. They don't need to be created beforehand and they are cleaned up appropriately after the service stops running.

$ sudo systemd-run -t -p User=user -p CacheDirectory=foo -p StateDirectory=foo -p RuntimeDirectory=foo -p PrivateTmp=yes ls -ld /run/foo /var/cache/foo /var/lib/foo /etc/foo /tmp/
Running as unit: run-u45882.service
Press ^] three times within 1s to disconnect TTY.
drwxr-xr-x  2 user    user   40 Apr 26 08:21 /run/foo           ← automatically created and removed
drwxr-xr-x  2 user    user 4096 Apr 26 08:20 /var/cache/foo     ← automatically created
drwxr-xr-x. 2 user    user 4096 Nov 13 21:50 /var/lib/foo       ← automatically created
drwxr-xr-x. 2 root    root    4096 Nov 13 21:50 /etc/foo           ← automatically created, but not owned by the user, since the service (usually) shall not modify its own configuration
drwxrwxrwt  2 root    root      40 Apr 26 08:21 /tmp/              ← "sticky bit" is set, but this directory is not the one everyone else sees

Of course, those seven locations (counting PrivateTmp= as two) don't cover the needs of every service, but they should be enough for most situations. For other cases, manual setup or an appropriate configuration in tmpfiles.d is always an option.

Automatic directory management ties nicely with the DynamicUser= setting and automatically-created users, by providing a service that runs as a separate user and is not allowed to modify most of the file system tree (even if file access permissions would allow that). The service may still access select directories and store data there, without any setup other than the unit file configuration.

For example, a Python web service might be run as:

$ systemd-run -p DynamicUser=yes -p ProtectHome=yes -p StateDirectory=webserver --working-directory=/srv/www/content python3 -m http.server 8000

or through the equivalent unit file:

[Service]
DynamicUser=yes
ProtectHome=yes
StateDirectory=webserver
WorkingDirectory=/srv/www/content
ExecStart=python3 -m http.server 8000

We make sure that the service runs as a transient user without the ability to modify the file system or have any access to user data.

The settings described here can be considered "high level." Even though the implementation might be tricky, the concepts themselves are easily understood, and the effect on the service is clear. There are a large number of other settings to take away various permissions and capabilities, lock down network protocols and kernel tunables, and even disable individual system calls. These are outside of the scope of this short article. Refer to the extensive reference documentation.

Putting all this to use

When we have a good understanding of what the service does and needs, we can consider what privileges are required and what we can take away. The obvious candidates are running as an unprivileged user and limiting access to user data under /home. The more we allow systemd to set things up for us (for example, by using StateDirectory= and friends), the more likely that the service can successfully run as an unprivileged user. Often the service needs access to a specific subdirectory, and we can achieve that using ReadWritePaths= and similar settings.

Adding security measures in any sort of automatic way is impossible. Without a good understanding of what the service needs in different configuration scenarios and for different operations, we cannot define a useful sandbox. This means that the sandboxing of services is best done by their authors or maintainers.

Evaluation and status quo

The number of possible settings is large, and new ones are added with each release of systemd. Keeping up with that is hard. Systemd provides a tool to evaluate the use of sandboxing directives in the unit file. The results should be considered hints — after all, as mentioned above, automatic creation of a security policy is hard, and any evaluation is just counting what is used and what is not, without any deep understanding of what matters for a given service.

$ systemd-analyze security systemd-resolved.service
  NAME                    DESCRIPTION                                                       EXPOSURE
...
✓ User=/DynamicUser=      Service runs under a static non-root user identity                       
✗ DeviceAllow=            Service has a device ACL with some special devices                0.1
✓ PrivateDevices=         Service has no access to hardware devices                                
✓ PrivateMounts=          Service cannot install system mounts                                     
  PrivateTmp=             Service runs in special boot phase, option does not apply                
✗ PrivateUsers=           Service has access to other users                                 0.2
  ProtectHome=            Service runs in special boot phase, option does not apply                
✓ ProtectKernelLogs=      Service cannot read from or write to the kernel log ring buffer          
✓ ProtectKernelModules=   Service cannot load or read kernel modules                               
✓ ProtectKernelTunables=  Service cannot alter kernel tunables (/proc/sys, …)                      
  ProtectSystem=          Service runs in special boot phase, option does not apply                
✓ SupplementaryGroups=    Service has no supplementary groups                                      
...

→ Overall exposure level for systemd-resolved.service: 2.1 OK 🙂

$ systemd-analyze security httpd.service
  NAME                    DESCRIPTION                                                        EXPOSURE
...
✗ User=/DynamicUser=      Service runs as root user                                          0.4
✗ DeviceAllow=            Service has no device ACL                                          0.2
✗ PrivateDevices=         Service potentially has access to hardware devices                 0.2
✓ PrivateMounts=          Service cannot install system mounts                                  
✓ PrivateTmp=             Service has no access to other software's temporary files             
✗ PrivateUsers=           Service has access to other users                                  0.2
✗ ProtectHome=            Service has full access to home directories                        0.2
✗ ProtectKernelLogs=      Service may read from or write to the kernel log ring buffer       0.2
✗ ProtectKernelModules=   Service may load or read kernel modules                            0.2
✗ ProtectKernelTunables=  Service may alter kernel tunables                                  0.2
✗ ProtectSystem=          Service has full access to the OS file hierarchy                   0.2
  SupplementaryGroups=    Service runs as root, option does not matter                          
...

→ Overall exposure level for httpd.service: 9.2 UNSAFE 😨

Again, this doesn't mean that the service is insecure, but that it is not using the systemd security primitives.

Looking at the level of the whole distribution:

$ systemd-analyze security '*'
fedorascore

We see that most services score very high (i.e., bad). We cannot gather such statistics about various in-house services, but it seems reasonable to assume that they are similar. There is certainly a lot of low-hanging fruit, and applying some relatively simple sandboxing would make our systems safer.

Wrap up

Letting systemd manage services and sandboxing can be a great way of adding a layer of security to your Linux servers. Consider testing the configurations above to see what might benefit your organization.

In this article, we studiously avoided any mention of networking. This is because the second installment is going to talk about socket activation and sandboxing of services using the network.

[ Don't forget to check out the systemd cheat sheet for more helpful hints. ]


Sobre el autor

I work in the "Plumbers Team" of Red Hat, taking care of
upstream systemd development and maintenance of systemd in Fedora.
Currently a member of FESCo (Fedora' Engineering Ctte.)

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

Navegar por canal

automation icon

Automatización

Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos

AI icon

Inteligencia artificial

Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar

open hybrid cloud icon

Nube híbrida abierta

Vea como construimos un futuro flexible con la nube híbrida

security icon

Seguridad

Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías

edge icon

Edge computing

Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge

Infrastructure icon

Infraestructura

Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo

application development icon

Aplicaciones

Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones

Original series icon

Programas originales

Vea historias divertidas de creadores y líderes en tecnología empresarial