June 23, 2019
By Leo Dorrendorf

Linux Kernel Isolation Features

With guest writer Donald Tevault


This article is the third in the series of articles on isolation and integrity techniques, which we outlined in a previous blog entry. The series as a whole focuses on embedded Linux and Internet of Things (IoT) devices, and describes the security techniques we can use to isolate the services and applications running on them, so that attackers who take over a network-facing service do not end up compromising the entire device.

In this article, we will cover several methods based on the features of the Linux kernel. These include control groups (cgroups), namespace isolation, the kernel "capabilities" feature, and seccomp. These tools provide us with the basic means to limit the system resources a process can access and restrict the amount of each resource that it can consume, enforcing the "least privilege" principle. This should enable a device vendor or a system administrator to isolate selected server processes, services or daemons, and contain any damage caused by an attacker who gains a foothold in one of them. Used separately, all the features described in this article are low-level and require great care in deployment; we will look into easier ways to configure and deploy them in combination, through jail packages, in the next article.

Namespace Isolation

Namespaces are a security feature that’s built into the Linux kernel. The cool thing about them is that they allow a process to use a specific set of computer resources, which are completely segregated from the resources that another process would use. In total, there are seven different kinds of namespaces, with the first one having been introduced with kernel version 2.4.19 way back in 2002. Having the concept of namespaces is very handy for when you need to serve multiple customers from one physical server. Most importantly, processes running under a restricted account can receive unlimited capabilities inside their namespace; you can effectively convince services that they are running with root privileges, while actually isolating them from the rest of your environment. Processes can also get their set own network interfaces and mount points. All these are central features for isolation and containerization.

The seven kinds of namespaces are as follows:

  • Mount (mnt) - This was the first namespace to be created, back in kernel version 2.4.19. In a nutshell, you can use this namespace to create a root filesystem that only one process can see. So, if you have files and directories that you only want for one process to see, you can create a mount namespace. But if you want to, you can share a mount namespace with other processes.
  • UTS (uts) - This namespace allows each process to have its own specific hostname and domain name.
  • Process ID (pid) - This allows each running process to have its own unique set of Process ID numbers, with the first process in each namespace as number 1 (a special PID normally reserved for init). A parent process can see its child processes in any namespace, and since PID namespaces are nested, the top-level namespace is able to see the PIDs for all child PID namespaces. However, a child PID namespace can’t see the PIDs for any sibling or parent PID namespace.
  • Network (net) - This allows you to create an entire virtual network stack for each process. Each network namespace that you create can have its own set of virtual network adapters, its own range of private IP addresses, its own routing table, and its own firewall.
  • Interprocess Communication (ipc) - This namespace separates processes with respect to SysV IPC APIs. Processes in different IPC namespaces cannot talk with each other using SysV IPCs, or create a memory space to share with other processes.
  • Control Group (cgroup) - This namespace conceals the identity of the cgroup of which a process is a member.
  • User (user) - A user namespace allows a user to have different levels of privileges with different processes. For example, you can allow one user to have administrative privileges for one process, but not for others.

It should be noted that only the User namespace can be created by any process regardless of privileges. To create any other namespace, root privileges, or a special capability - CAP_SYS_ADMIN - are necessary. We cover capabilities in the next section.

Although it is possible to manually create namespaces for your programs, as a normal user, you probably won’t. Some programs, such as the newer web browsers, use namespaces technology to enhance product security. As a normal user, you might want a way to use this technology to enhance the security of any program that you might want to run. Containerization programs, such as Docker, and sandbox programs, such as Firejail and Minijail, also include this technology. We will look into those in some detail in the following articles.

You can take a look at the namespaces for each running process by entering the /proc filesystem. For example, let’s look a the namespaces for PID 1.

user@host:~$ cd /proc/1
user@host:/proc/1$ ls -ld ns
dr-x--x--x 2 root root 0 May 23 19:48 ns

Here, we see the ns directory that contains the namespace information. What seems a bit unusual is the permissions settings. We can cd into it as a normal user, but we can’t view anything in it without sudo privileges. Let’s see what we have there.

user@host:/proc/1$ cd ns
user@host:/proc/1/ns$ sudo ls -l
total 0
lrwxrwxrwx 1 root root 0 May 23 19:55 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 root root 0 May 23 19:55 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 root root 0 May 23 19:55 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 root root 0 May 23 19:55 net -> 'net:[4026531993]'
lrwxrwxrwx 1 root root 0 May 23 19:55 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 May 23 19:55 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 root root 0 May 23 19:55 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 May 23 19:55 uts -> 'uts:[4026531838]'

You can see that we have all seven namespaces represented here, plus a pid_for_children namespace that’s a child of the PID 1 namespace. Other namespaces are being considered for addition to Linux, such as the Security namespace, which would allow configuring security features per container.

Namespace isolation comes with minor security caveats: as a kernel feature, it can be circumvented by exploiting vulnerabilities in the kernel itself. Some of these vulnerabilities are only exploitable by a process with root capabilities in its own namespace - for example, CVE-2016-9793. Others are generic, and some can even be stopped by deploying seccomp and other mechanisms described here. Regardless of any isolation features you deploy, make sure your kernel is up to date and apply security patches as necessary.

Kernel Capabilities

On a Linux system, the root user is nearly all-powerful and ignores most access restrictions. Sometimes though, having processes owned by the all-powerful root user can pose a security problem. Quite a few years ago, someone got the idea of allowing the Linux kernel to divide the root user’s capabilities into distinct units. Appropriately enough, these units are called capabilities. There are about 40 different capabilities which can all be independently enabled or disabled for a given program. Capabilities can allow a program to do things like connect to a privileged network port or to change ownership of files and directories, without running under the root account. We won’t list all of the capabilities here, except for a few examples to give you a sense of this mechanism. Here are some capabilities, and what a process enabled with those capabilities can do:

  • CAP_DAC_OVERRIDE - override any DAC (Discretionary Access Controls) settings on files - in other words, the process can read, write, and execute files regardless of their permissions.
  • CAP_SYS_MODULE - load and unload kernel modules.
  • CAP_SYS_PTRACE- trace, inspect, and read/write the memory of other processes, effectively acting as debugger and taking full control of another process.
  • CAP_MKNOD - create device nodes.
  • CAP_SYS_ADMIN - perform a wide range of administrative, unsafe, and sensitive operations, override some permissions, directly interact with device drivers, and more.

As you can probably tell from the description, the capabilities listed above are extremely powerful, effectively grant root-equivalent powers, and can be used to obtain complete control of the system. All other capabilities are more restricted and can be safely granted to less-trusted processes; for example, the CAP_NET_BIND_SERVICE capability allows creating listening ports in the reserved range (ports with numbers between 0 and 1024) , and can be granted for web servers and other Internet-enabled services which should not be running as root. We'll provide a detailed example of its usage below.

You can see the whole list of capabilities by entering:

$ man capabilities

So, how does this work? You can add one or more capabilities to a program’s executable file. Let’s say for example, that we want to set up a simple webserver that uses the SimpleHTTPServer module that comes with Python, and we don’t want to have to use sudo or root privileges in order to run it. We’ll use the following command to call in the module and start the server. Note that you’ll need to stop the Apache webserver service before this can work.

$ python -m SimpleHTTPServer 80
Traceback (most recent call last):
  File "/usr/lib/python2.7/", line 174, in _run_module_as_main
​    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/", line 72, in _run_code
​    exec code in run_globals
  File "/usr/lib/python2.7/", line 235, in <module>
​    test()
  File "/usr/lib/python2.7/", line 231, in test
​    BaseHTTPServer.test(HandlerClass, ServerClass)
  File "/usr/lib/python2.7/", line 606, in test
​    httpd = ServerClass(server_address, HandlerClass)
  File "/usr/lib/python2.7/", line 417, in __init__
​    self.server_bind()
  File "/usr/lib/python2.7/", line 108, in server_bind
​    SocketServer.TCPServer.server_bind(self)
  File "/usr/lib/python2.7/", line 431, in server_bind
​    self.socket.bind(self.server_address)
  File "/usr/lib/python2.7/", line 228, in meth
​    return getattr(self._sock,name)(*args)
socket.error: [Errno 13] Permission denied

At the very end of the output, you see an error message because we’re trying to have Python listen on a privileged port. Privileged network ports are those with numbers less than or equal to 1024.

Next, let’s use the getcap command to see if any capabilities are enabled for the python2.7 executable file:

$ sudo getcap /usr/bin/python2.7

There’s no output, which means that no capabilities are enabled. Let’s now add a capability that will allow this Python module to work with a privileged port.

$ sudo setcap 'CAP_NET_BIND_SERVICE+ep'  /usr/bin/python2.7         
$ sudo getcap /usr/bin/python2.7
/usr/bin/python2.7 = cap_net_bind_service+ep

This time, the getcap command shows that we have one capability assigned to the Python executable file. Now, what happens if we try to run the SimpleHTTPServer module again?

$ python -m SimpleHTTPServer 80
Serving HTTP on port 80 …

It now works fine, and we can actually connect to it from the web browser of another machine.

To give another example, the well-known ping network utility uses the SUID bit to run under the root account, no matter which account actually invokes it. Normally all it does is open a raw socket so it can send "ping" packets and observe the responses. It's possible to strip it of the SUID bit and grant it the CAP_NET_RAW necessary to open raw sockets, so it can do that without having it run as root. Then, even if the ping process gets exploited by an attacker (which has happened in the past), the exploit wouldn't result in a complete system takeover.

In some cases, such as when you start up a Docker container, you can specify the capabilities that you want to either use or not use. We will discuss Docker in one of the following articles.

The main problem with setting capabilities manually is that it’s not always obvious about which specific capabilities are needed. In the above example, it is obvious that we needed an added network capability in order for the program to access a privileged port. In other cases though, getting the capabilities right could require a bit of trial-and-error. If you need to work with capabilities, your best bet would be to be familiar with how the program in question operates, so that you can make a better judgment about which set of kernel capabilities to try using for the initial setup.

We conclude the capabilities topic for now. We will review them again once we start to put all the features together, in the following articles on jails and sandboxing.

Control Groups (cgroups)

A cgroup is a collection of processes that are grouped together for some specific purpose. With a cgroup, you can do the following:

  • Set resource limits for the cgroup - Specifically, you can set limits for CPU usage, memory usage, and I/O usage.
  • Perform accounting functions - In other words, you can measure resource usage for the different cgroups. This can come in very handy, including for when you need to bill customers for resource usage.
  • Set prioritization - If you have one customer that is hogging so many resources that other customers can’t do what they need to do, you can simply adjust the resource limits for that specific customer.
  • Freezing, checkpointing, and restarting - this just means that you can stop the processes in a cgroup, take a snapshot to use as a backup, and restore the cgroup from the backup in case of problems. This can be quite handy in troubleshooting scenarios.

In the days before systemd, you would have had to manually create cgroups yourself. Now, with systemd, everything already runs in a cgroup, but without restrictions. If you need to set limits for a particular service, you can either hand-edit the service file, or use the systemctl utility to set limits from the command-line. For a quick demo, let’s look at the Apache webserver service that we installed on our Ubuntu 18.04 machine.

We can use a systemctl show command to view the status of any cgroup accounting or limits. Here, we’re looking at the status of memory accounting and memory limits for the Apache service:

$ sudo systemctl show -p MemoryLimit apache2.service
$ sudo systemctl show -p MemoryAccounting apache2.service

By default, cgroups don’t have any limits on the amount of memory that they’re allowed to use, and all accounting is turned off. Let’s now say that we want to turn accounting on, and to limit memory usage to only 500 Mbytes.

$ sudo systemctl set-property apache2.service MemoryAccounting=1
$ sudo systemctl set-property apache2.service MemoryLimit=500M  
$ sudo systemctl show -p MemoryAccounting apache2.service
$ sudo systemctl show -p MemoryLimit apache2.service

Running these commands causes an apache2.service.d directory to be created within the /etc/systemd/system.control directory.

user@host:/etc/systemd/system.control$ ls -l
total 4
drwxr-xr-x 2 root root 4096 May 22 19:18 apache2.service.d

Within this new directory, we see a configuration file for each cgroup parameter that we’ve set:

user@host:/etc/systemd/system.control/apache2.service.d$ ls -l
total 12
-rw-r--r-- 1 root root 152 May 22 19:17 50-MemoryAccounting.conf
-rw-r--r-- 1 root root 153 May 22 19:18 50-MemoryLimit.conf

When we look at the contents of one of these files, we see that it’s just an extra parameter that effectively gets inserted into the active service file for the Apache service.

# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.

This is just one example of how to set cgroup limits on a systemd-type operating system. You would set CPU limits (CPUShares) and block device input/output limits (BlockIOWeight, BlockIODeviceWeight, BlockIOReadbandwidth, BlockIOWritebandwidth) in pretty much the same way. Refer to official documentation to see which cgroups subsystems are available on your Linux distribution and version.

seccomp and System Calls

A system call, or syscall, is a mechanism used by regular programs to request a service from the operating system. System calls are necessary for actions on all system-controlled resources, such as opening or closing files, starting a new process in the system, allocating memory, talking to hardware devices, and so on. In total, there are approximately 330 different system calls built into the Linux kernel. We say “approximately”, because new system calls occasionally get added. To demonstrate, I’ve just done a simple ls -l command here on the Fedora Linux system that I’m using to write this. Running an strace command and filtering its output appropriately shows us a list of system calls that were involved in this simple ls -l operation.

[user@host ~]$ strace -c -f -S name ls 2>&1 1>/dev/null | tail -n +3 | head -n -2 | awk '{print $(NF)}'

As you can see, just that one simple operation required quite a few syscalls. Something else to note is that this list comes from a specific machine. Different OS versions will have slight differences in the set of supported syscalls, or their internal identifiers.

seccomp, which stands for "Secure Computing", gives us a way to either disable certain system calls that we don’t want a process to use (blacklisting), or to just enable a certain set of system calls that we do want a process to use (whitelisting). In addition, seccomp allows you to filter syscalls based on the arguments to the syscall. When a process isolated by seccomp attempts a blocked system call, it can either receive a "request failed" response, receive an exception, or be killed, depending on how its filter is set up. For processes whose behavior is well-known in advance, this is a powerful tool. For example, if a process is only expected to read from a few files or sockets, and never to write to any files, it's possible to block the write system call. Should the process become compromised and attempt to write files, it will be killed.

To give an example of how seccomp can be used, let's use a real-world vulnerability, CVE-2016-9793. In this case, a kernel vulnerability involved mishandling of negative values passed to the setsockopt system call, which could result in attackers crashing or taking over the system. Setting up seccomp to deny calls with negative integer arguments to the SO_SNDBUF and SO_RCVBUF options would prevent the exploitation of this vulnerability.

To deploy seccomp correctly, you need to thoroughly profile the target process in advance, putting it through every kind of behavior you expect it to engage in. Otherwise seccomp will block rarely-used functionality. For example, if a service normally operates without writing files, but will sometimes (and only rarely) attempt to update its own executables, seccomp could block the update, and the service could become stuck in an infinite loop of failed updates.

Antother reason why seccomp is difficult to set up is because its system call filters are specific to the kernel architecture and version it targets. In addition, it incurs a performance penalty. It is one of those tools where you really have to know what you're doing in order to succeed. We will not examine a direct approach to seccomp; interested readers can refer to other sources. Instead, we will look into uses of seccomp in combination with jails and containers, in the following two articles.


We've reviewed several useful security features that Linux offers for process isolation and restriction. These can be very helpful when deploying restricted processes. However, each of these tools requires careful configuration, and can break the functionality of the target process if deployed incorrectly. Applying these protections in combination can be a painstaking and difficult task. In the next article, we will review a better method to apply software isolation, which uses the Linux "jails" - Firejail and Minijail. These software packages provide user-friendly wrappers to all the Linux features outlined above, with a greatly simplified configuration interface, making the deployment of software isolation a much easier task.

Share this post

You may also like