Notes about namespaces
Namespaces () on Linux are pretty cool. They let you do certain things that you could only previously do with virtual machines, and also accomplish things that you can never even do with virtual machines, especially with shared resources.
Note: A lot of the tricks described here are untested. Use at your own risk. If you know whether something works or not, please contact me.
General / Miscellaneous
- There are eight types of namespaces as of Linux 5.6 -- cgroup, IPC, mount, network, PID, time, user, and UTS (hostname).
- Creating a namespace requires the CAP_SYS_ADMIN capability (i.e. root access), with the exception of user namespaces, which can be created by an unprivileged (non-root) user.
- Namespaces can be created using the CLONE_NEW* flags of the clone() syscall or the unshare() syscall.
- A user namespace creates a kind of root access for creating other namespaces. By creating a user namespace, other namespaces can also be created; however, they are limited to its own processes.
- Although a process can only be in one network or mount namespace, it can still hold directory file descriptors, pipes, sockets, or other file descriptors received from other namespaces, whether by keeping them alive through a namespace transition or being passed down via a UNIX domain socket. This allows for fine-grained resource sharing that is unobtainable from virtual machines.
- Unix domain sockets identified by a pathname can be accessed by any process with filesystem and mode access to that socket, regardless of its user, mount, or network namespaces. For example, if there is a MySQL instance listening on the mysqld.sock file in the filesystem, and a different process on the system creates new user and network namespaces, then it can still access the MySQL socket as long as it has access to that socket.
- To use clone() in a way that acts like fork() but with additional CLONE_* flags, use syscall(SYS_clone, [flags], 0, 0, 0, 0).
- Since containerized processes run like any other normal process on the host system from the perspective of the kernel (i.e. they are visible in
top), they do not suffer from the same slowdown effects often found in virtual machines.
- To some extent, this is almost like thinking with portals. You have to figure out what syscalls to make to bring the process into a container. We call this "think with namespaces."
- To fully understand namespaces in Linux you not only have to be familiar with the underlying concepts like user/group IDs, capabilities, mount points, network devices and bridging, and process IDs, but you also have to fully understand their behavior.
- If a process that creates a user namespace needs privileged access to a file on the host, but doesn't want any access to it after the process has finished initializing, it can perform the following steps:
1. Have the privileged file have a group ID which lies outside of the user namespace's GID map. 2. Before calling unshare(), call setgroups() to set its supplementary group IDs to include the GID of the privileged file, along with any other group IDs that the process may need access to. 3. Call clone(CLONE_NEWUSER) to create a new user namespace. 4. Write the UID and GID maps as usual. Note that you might need to use clone() and not unshare() to create the new user namespace, since you need CAP_SETGID in the original user namespace to write the gid_map file. Note that if you wrote "deny" to the setgroups file, you will not be able to perform step 6. 5. Perform any operations that might require the privileged file. Note that if you open a file, you will still be able to operate on that file even after step 6, as long as the file descriptor is kept open. 6. Perform setgroups(0, NULL) to drop group memberships. Since the GID of the file is now outside the user namespace, it will not be possible to specify it in any further setgroups() or setgid() call to gain access to it.
- You lose any and all capabilities in the original user namespace when you call clone() and unshare() with CLONE_NEWUSER. For this reason, you can't simply write your UID and GID map after you call CLONE_NEWUSER because you wouldn't have the necessary capabilities (CAP_SETUID, CAP_SETGID) in the parent user namespace to write the UID and GID map files. There are two ways around this:
1. Call clone(CLONE_NEWUSER), have the child process wait until uid_map is set, then run the containerization routines. While the child process waits, in the parent process, open up the forked process's UID and GID maps, and write the mappings from there. Note that unless the signal disposition of SIGCHLD is set to SIG_IGN, the forked process's /proc/pid directory will continue to exist even if it terminates for any reason, so the risk of PID reuse is mitigated unless wait(), waitid(), or waitpid() is called (which you shouldn't). 2. Call clone(CLONE_NEWUSER), then call pause() in the forked process. While this is happening, have the parent process write the UID and GID maps in the /proc/PID directory of the forked process. Next, open the forked process's /proc/PID/ns/user file in the parent process. Then, use kill() to terminate the forked process. Finally, use setns() on the /proc/PID/ns/user file to enter the new user namespace.
- (Untested) If a process creates or joins a new user namespace and then calls execve(), then it will no longer have capabilities unless the process has UID 0 in that namespace. To fix that without writing the uid_map, prior to calling execve(), use capget() and capset() to copy all capabilities from the permitted set into the inheritable set, then raise all ambient capabilities in the permitted set by calling prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, [n]) on each bit position that is 1 in the permitted set that you originally retrieved with capget(). Privileges can be later dropped by clearing both the inheritable and permitted sets. If you use this technique and want to ensure that the process no longer gains capabilities because uid_map was written and the mapped UID of the process becomes 0, then you can set the SECBIT_NOROOT securebit () and either lock it or remove CAP_SETPCAP from the bounding set. Note that any existing locked securebits will not interfere with this operation, since all securebits are cleared and unlocked when entering a new user namespace.
- To restrict the capability set of a root (effective UID=0 in its user namespace) process, you can:
1. Clear the capabilities from the bounding set. Note that this will only take effect once you call execve(). If you want the changes to take effect immediately, clear the capabilities from the permitted set. Note that if you clear the permitted set without clearing the bounding set, then call execve(), it will just regain those capabilities. 2. Alternatively, do the same as for a non-root user below, but before doing so, set the SECBIT_NOROOT securebit and lock it, so that UID 0 is no longer special.
- For a non-root process that wants to have capabilities, you can perform the following steps:
1. Before switching to a non-root user, use the PR_SET_KEEPCAPS prctl() operation to make sure that the permitted set is preserved after switching the UID. 2. Change the process's UID. 3. Use capget and capset to add the desired capabilities to the inheritable set. 4. Use the PR_SET_AMBIENT_RAISE operation to add the same capabilities to the ambient set. 5. Call exec() on the new program. 6. (Untested) If the program you are running actually checks that the UID is 0 and fails to run if it is not (even if the selected capabilities allow for proper execution of its otherwise root-only operations), then you can install a seccomp filter that makes every call to get[e]uid() return 0 via the SECCOMP_RET_ERRNO return value. It is sadly not possible to do the same with getresuid().
- The process's UID and GID after a user namespace switch are completely unchanged, even if they are not mapped to any UID or GID in the target user namespace.
- On Linux, the supplementary group ID list operates completely independently of any other process GID (effective, real, saved, or filesystem GID). The init process starts out with an empty supplementary GID list. However, if root logs in or a command is executed with "sudo", then the supplementary GID list contains GID 0. Combined with the above points, this means that a process could have access to group IDs (such as the original effective group ID) outside its user namespace if its supplementary GID list is not fully cleared prior to or after entering the new user namespace.
- (Untested) On Linux, in addition to real, effective, and saved-set UIDs, we also have the filesystem UID. It was originally used by the NFS server to access files as if it were running with a certain UID before the rules of kill() regarding UIDs were changed, and now it is regarded as obsolete. However, there is one interesting corner case that would require the filesystem UID to be used. Suppose that a process is in a certain parent user namespace and has the CAP_SETUID capability but not the CAP_SYS_ADMIN capability, and it wants to enter a descendant user namespace whose owner UID does not match the process's current effective user ID and the current effective UID is not mapped in the target user namespace. To do so, it can change its effective user ID to match that of the target user namespace, and then switch to it using setns(). However, in doing so, it cannot return back to its original effective user ID, even if it matches its real or saved UID, since it is not mapped in the target user namespace. If the process changed its filesystem UID back to the original effective UID prior to calling setns(), then it can still assume that UID for accessing files in the filesystem (if its mount namespace has changed), even though it is not mapped in the target user namespace. I did realize that another way of doing this is by forking a new process with CLONE_FILES set, change its effective user ID, call setns() on the target user and mount namespaces, then opening the root directory as a file descriptor, then operating on that file descriptor in the original process. But it might not work if accessing files in the target mount namespace would have required some of the capabilities in the target user namespace.
- (Untested) The specific case of having a process be in the initial user namespace and setting its network, IPC, etc. namespace from another user namespace is quite interesting. The author uses this technique to inspect the network configuration inside a Docker container without having to include the necessary commands in the container. This requires CAP_SYS_ADMIN in the initial user namespace. However, if after doing so, the process changes its effective UID to match the owner UID (which can be non-zero) of the user namespace, then it can assume all capabilities in the network or IPC namespace owned by the descendant user namespace, even though it is not running as UID 0. Of course, if it also needs capabilities from the initial user namespace, then it can set its ambient capabilities. Interestingly enough, in lieu of ambient capabilities, if it uses PR_SET_KEEPCAPS prior to changing effective UID, then it can have certain capabilities in the permitted set prior to calling execve(), but even though its capabilities in the initial user namespace are gone after calling execve(), it still has the capabilities in the descendant user namespace.
- (Untested) It might be useful to use SELinux to make otherwise world-writable directories inaccessible from containerized processes.
- Sometimes I like to think of the user namespace hierarchy like this:
- The initial user namespace has a tree height of 0. There is only one such namespace on the system.
- Every user namespace whose parent user namespace is the initial user namespace has a tree height of 1.
- Every user namespace whose parent user namespace has a tree height of 1, has a tree height of 2.
- Every user namespace whose parent user namespace has a tree height of n, has a tree height of n+1.
- Systems are designed so that the user namespaces are not nested too deeply, so as to minimize the maximum tree height. The degree of each user namespace (number of direct descendent user namespaces) is unlimited.
|Highest Privilege Level|
|Root in the initial user namespace|
|[drop capabilities; may be reversed by executing a setuid root program, or by file capabilities]|
|Non-root in the initial user namespace|
|Root in a user namespace of height 1|
|[drop capabilities; may be reversed by executing a setuid root program, or by file capabilities]|
|Non-root in a user namespace of height 1|
|Root in a user namespace of height 2|
|[drop capabilities; may be reversed by executing a setuid root program, or by file capabilities]|
|Non-root in a user namespace of height 2|
|Lowest Privilege Level|
There are several ways to block the creation of a new user namespace:
- Putting the process in a chroot environment.
- Put the process in a descendant user namespace with UIDs and GIDs outside the namespace's UID and GID maps, and remove the CAP_SETUID and CAP_SETGID capabilities. This is only possible if the process started in the initial user namespace. However, in this mode, the process cannot send any credentials using a UNIX domain socket (but the other side calling SO_PEERCRED may also work).
- Use seccomp() to block the unshare() call (not recommended).
- Don't use chroot(). Instead, use pivot_root(). You may create a new bind mount or tmpfs mount in the new mount namespace as your new root and call umount2([oldroot], MNT_DETACH) (or umount -l /oldroot) to disassociate it with the current mount namespace.
- If you create a new mount namespace, have a removable disk attached to the system at the time of mount namespace creation, and either 1. call mount --make-rprivate / (but not mount --make-rslave /) (unshare -m does this by default) or 2. the propagation type of the previous mount namespace is set to slave or private, then when the removable disk is later unmounted, it will still be mounted in the new mount namespace. This has actually caused filesystem corruption twice for the author.
- When calling setns() on a mount namespace, the process's root directory returns to the root of the mount namespace, even if the process is chroot'ed. (Untested) This means that if the /proc filesystem is mounted in the chroot(), then any process with the CAP_SYS_CHROOT and CAP_SYS_ADMIN capability in the process's mount namespace can escape from the chroot by calling setns() on the /proc/self/ns/mnt file. Note that an unprivileged process cannot do the same, since it is not allowed to create a new user namespace in a chroot().
- Not exactly namespace-related, but instead of using a PID file, the /proc/PID directory can be bind-mounted somewhere else. This has an advantage that if the process terminates and the PID is reused, then the /proc/PID bind mount will no longer have files, whereas simply keeping track of PIDs will only result in a false positive.
- You can create virtual Ethernet devices (), tun/tap devices (untested), and bring up the loopback interface without any privileges from the host.
- Moving a network device from one namespace to another via
ip link set <dev> netns <ns>or similar requires CAP_NET_ADMIN in both the current network namespace and in the target network namespace, in contrast with nsenter/setns which requires CAP_SYS_ADMIN.
- If you want to intercept network connections without creating devices (sort of have a virtual TCP/IP stack), then you can run the commands
ip route add local ::/0 dev lo ip route add local 0.0.0.0/0 dev lo
and then open up a wildcard-bound socket to intercept connections to any and all IP addresses, provided that they were made in that network namespace. You can use getsockname() to obtain the original destination IP address. This might be useful for transparent proxying.
- iptables-legacy may not work well in a network namespace owned by an unprivileged user namespace. Use iptables-nft instead.
- One particularly interesting use of network namespaces is that it can allow a computer to have multiple IP addresses on a wired network in a way that is transparent to the application, without the need to create separate bindings (this is mostly useful for IPv6, where SLAAC and privacy extensions may make it difficult to bind to a subset of every address that exists on the system). To do this, first create a bridge interface, and then add the original Ethernet interface to it. Configure the bridge interface's IP address like you normally would with the Ethernet interface. Next, create a virtual Ethernet device pair, and then add one of the two newly created interfaces onto the bridge and bring the new interface up. Create the new network namespace, and then move the other virtual network interface into that namespace. Set up the IP address of the virtual Ethernet device in the new network namespace, which should be different from the IP address in the original namespace. This setup is as if you connected another computer onto the same network that the physical network card is plugged into, but with the advantage that they can share the same system environment. If you also unshare the PID namespace, you can also run an init process in it, allowing you to spawn multiple processes in it. Busybox init works well here; though if you want to have multiple inittab files, you will need to create a new mount namespace too and bind mount a separate inittab file onto /etc/inittab in each mount namespace. The author actually uses this technique to run his web-based email interface and GitLab in separate containers, both of which have their own (private) IP address, all on one computer, and without any NAT, host firewall, proxy ARP, or routing configuration. One main disadvantage here is that since the new network namespaces are on the same layer 2 broadcast domain as the host, ARP spoofing attacks are possible if the containerized processes are untrusted, even if a new user namespace is created. This can be mitigated by removing the CAP_NET_RAW and CAP_NET_ADMIN capabilities from the bounding set prior to running the containerized application.
- Since the loopback interface is independent in each network namespace, moving applications to a different network namespace can prevent a certain class of security vulnerabilities which exploit the ability to make arbitrary connections and bindings in a common loopback interface.
- (Untested) If a process changes IPC namespace, then it only affects new calls to shmget(). Any existing shared memory segments mapped in the process's memory will continue to exist until it has been detached with shmdt(). However, it will not be possible to remove this shared memory segment unless IPC_RMID is used prior to changing IPC namespace.
- IPC namespaces operate only on System V shared memory segments and semaphores (shmget/semget), and POSIX message queues. They do not operate on POSIX semaphores or shared memory obtained with sem_open or shm_open. To do that, use a mount namespace to overmount a tmpfs instance on /dev/shm.
- PID namespaces are not useful to keep alive because once the last process terminates, it can no longer be joined with setns().
- (Untested) By using the set_tid option of clone3(), you can have "reserved"/"fixed" PIDs in a PID namespace, thus eliminating PID files. This is accomplished by the following steps.
1. Create a new PID namespace and if using unshare(), fork a new process to join that namespace. 2. Mount the /proc filesystem. 3. Write the number 301 to /proc/sys/kernel/ns_last_pid. 4. To create new processes with "reserved" PIDs, use clone3() and set the first element of the set_tid array to the desired "reserved" PID, which can range from 2 to 299 inclusive.
This may be useful because it can eliminate PID files since the daemon uses a known fixed PID. To terminate a daemon with a fixed PID, simply kill the fixed PID. No need to find it anymore in /proc. And if the daemon is already running, then step 4 will fail since the PID is already in use.
The magic number 300 is present here since PIDs wrap around from 32768 or 4194304 to that point. PIDs 2-299 will never be assigned as a result of PID reuse. It is not known whether if all the PIDs > 300 are used on the system, then the kernel will assign PIDs less than 300 to new processes; if you're paranoid, make sure that the number in /proc/sys/kernel/pid_max is at least 350 + the value in /proc/sys/kernel/threads-max, or use the PID cgroup controller.
This works best in a descendant PID namespace, since 2-299 in the initial PID namespace are already taken by kernel threads.
- PID 1 is always a stable reference in any PID namespace. In this case, the PID namespace file descriptor can sometimes serve as a PID file descriptor (as described in ), and entering the PID namespace followed by referencing PID 1 will always refer to that process.
(TODO: code snippets to perform the routines above)