Socketbox

Socketbox works by having a single "master" daemon accept connections for multiple IP/port combinations, then sending connection sockets to other server processes using the SCM_RIGHTS operation. Socketbox determines which server the socket should be sent to by calling getsockname() on the connection socket, then using its internal rules to direct the connection socket to the right server.

socketbox is a replacement for the classic "inetd" daemon. This daemon operates similar to inetd in that it accepts connecting sockets and passes them on to a program, but instead of spawning a new program, the socket is sent to an existing daemon using a Unix domain socket and the SCM_RIGHTS control message; the exact procedures determined by a configuration file. For example, you can specify that sockets with a server IP address of 2001:db8::1 go to one program, and sockets with a server IP address of 2001:db8::2 go to another. Essentially, we have performed the socket demultiplexing routines in user space rather than kernel space, in a way that would span multiple network namespaces.

https://gitlab.peterjin.org/_/socketbox

Remember that from the Notes about namespaces page, even though a process can only be in one network namespace, it can still hold sockets or other file descriptors obtained from other namespaces using a Unix domain socket. This allows containers to have a different addressing scheme than the server IP address endpoints. For example, on a network with both global and unique local IPv6 addresses, the containers could be addressed with global addresses, whereas the server sockets would be restricted to the unique local addresses.

History

Socketbox started out because on one of my machines, I was running "throwaway boxes" in some Docker containers. Each container had a globally unique IPv6 address, an SSH server on that address and a persistent home directory, and they are called "throwaway" boxes because I could easily delete and recreate them if something bad happens.

The problem with this was that my ISP's IPv6 prefix delegation is dynamic, and if the prefix were to change, then I would have to delete and recreate all the containers again, as well as update all the containers' IP addresses.

ULAs were added to mitigate this problem, but (at that time, I did not know that you could have multiple prefixes in a Docker network|Docker does not allow more than one IPv6 prefix in a network.). But I still had the ULA static route set up, so what could I do? Well, I could use the AnyIP trick to make all of those addresses locally bound.

And so it was. Initially, I modified the SSH server to run in inetd mode, then set up an inetd-like daemon that instead of listening on a TCP socket, listened on a Unix domain socket. Then, with some experimentation, I discovered that if the "AnyIP" trick is used on a wildcard socket, then the server will accept any IP address within that subnet, and the original IP address out of that entire subnet can still be retrieved using the getsockname() syscall on the socket returned by accept() as part of a normal TCP server flow. Using this method, along with the "geo" module in nginx to match the server IP address, I set up a very interesting reverse proxy for the SSH servers.

But there were some limitations. First off, the client IP address was never recorded in the ssh logs. While this is not too important on a local network, it does, of course, have implications on the public Internet. Nginx also had to be used as a middleman, so performance was not that great.

So what could I do? Initially, I had a program that created a socket bound to a ULA in the host network namespace, and then forwarded any sockets returned by accept() to another program in a different network namespace using the SCM_RIGHTS control message. However, there were still some limitations:

  • Sockets had to be bound to individual addresses rather than the wildcard address. This meant that I would need to run N programs and open N sockets if I were to have N ssh servers. Obviously, this wouldn't scale.
  • The server program had to have an "inetd mode" or similar. Due to complexities, most server programs don't have inetd mode anymore.

And so socketbox basically extends upon this concept without the above limitations. By having a "socket dispatch" routine in userspace, we can basically have a single "universal" socket that accepts for multiple destination IP/port combinations, and then forward them throughout the system by applying rules on the destination IP address. In addition, we have a "socketbox-preload" utility that modifies virtually any TCP server program to work seamlessly with socketbox's dispatch routine, so the "inetd mode" restriction no longer exists for the most part.

Since then, the issue with the Docker network prefixes no longer exists because I've since moved on to custom container software, which offers much more flexibility as to the network configuration. However, the ideas of AnyIP (which now includes IPv6 Things) and inter-netns sockets were still strong, so socketbox still has its uses.

Use with ip6tables TPROXY

Socketbox supports ip6tables TPROXY as a means of allowing a single socket to listen on multiple addresses, thus eliminating the need to use poll() on multiple sockets.

As I understand it, what TPROXY appears to do is the following: For the purposes of socket demultiplexing and finding a listening socket, TPROXY overrides which address a daemon has to listen on to accept the connection[1]. For example, an incoming TCP SYN packet with destination IP/port [2001:db8::1]:80 would require that there be a listening socket on [::]:80 or [2001:db8::1]:80. If this SYN packet matched a TPROXY rule with --on-ip ::1 --on-port 80, then it would look for a listening socket on [::1]:80 instead. Multiple TCP/IP flows can be reduced to one listening address, thus allowing the socket to accept connections for any set of IPs or port numbers determined by iptables rules.

Dual-stack TPROXY

ip route add local 2001:db8::/64 dev lo
ip route add local 192.0.2.0/24 dev lo
ip6tables[-nft] -t mangle -A PREROUTING -d 2001:db8::/64 -p tcp -m multiport --dports 22,80,443 -j TPROXY --on-ip ::ffff:127.100.100.1 --on-port 1
iptables[-nft] -t mangle -A PREROUTING -d 192.0.2.0/24 -p tcp -m multiport --dports 22,80,443 -j TPROXY --on-ip 127.100.100.1 --on-port 1

Socketbox must be configured to bind to [::ffff:127.100.100.1]:1 with IPV6_TRANSPARENT set in order for this to work. You must use the IPv6-mapped-IPv4 notation; an IPv4 socket, even with IP_TRANSPARENT set, will not work.

This works because the ::ffff:127.100.100.1 address can be interpreted from the perspective of the kernel as both an IPv6 and an IPv4 address.

Permissions for bound Unix domain sockets

Socketbox requires a directory of sockets created by end daemons that it can access. To ensure security of that directory, the author recommends that this procedure be followed:

Let's say that the directory is /run/socketbox, and there are two containers, one with UID/GID map 100000-100999, and another with 101000-101999. The socketbox user has a dedicated user and group ID along with another group ID (socketbox-access).

We want to create two subdirectories in that root directory, one for each container: /run/socketbox/00001 and /run/socketbox/00002. Make the /run/socketbox directory mode 2755 with owner/group as root:socketbox-access. Make each of the 00001 and 00002 directories mode 2750 with owner/group root:socketbox-access, but also set an ACL such that the 00001 directory has read-write-execute privileges for UID 100000, and 00002 has read-write-execute for UID 101000.

mkdir -pm 2755 /run/socketbox
chown root:socketbox-access /run/socketbox

# The presence of the set-group-ID bit on /run/socketbox means that these
# directories will also be set-group-ID with a group of socketbox-access
mkdir -pm 2750 /run/socketbox/00001 /run/socketbox/00002

setfacl -m u:100000:rwx /run/socketbox/00001
setfacl -m u:101000:rwx /run/socketbox/00002

This is the safest in terms of security, because:

  • The 00001 and 00002 directories are restricted to three users: root, UID 100000 (the container's root user, rwx privileges), and socketbox-access (r-x privileges). This allows sockets to be created by the container's root user without restriction. The directory is set-group-ID, so any created sockets in that directory will have a group ID of socketbox-access; this works even though this group ID is outside of the container's GID map. The daemon chmods this to 0660, so that the socket is also accessible by socketbox (which has a supplementary group ID of socketbox-access).
  • From the perspective of the containers, the sockets will have a group of "nogroup" (i.e. outside the GID map), so they appear to have a mode of 0600 instead. Similarly, the 00001 and 00002 directories appear to have a mode of 0700 (read-write-execute only for the container's root user)
  • There is no write access for the directories for the socketbox-access group; this means that socketbox is free to access the sockets, but can't create new ones or delete existing ones.
  • All directories are owned by the host's root, so changing permissions from the containers isn't possible.

Essentially, the set-group-ID bit on the directory is used here as a means for a process in a container to specify access for a GID that is not otherwise mapped in a container's user namespace, without forcing it into "other".

How we use socketbox

A service (like my Matrix homeserver) in an isolated network namespace can be made to safely accept connections directly from the LAN, by using socketbox in the main network namespace (or another namespace that's directly connected to the LAN) to send connection sockets over to the server in the other network namespace.

As a means of preventing leaks, the server network that is part of our home LAN is on a completely isolated network namespace[2] which is not connected in any way to the rest of the LAN. However, when accessing servers locally, it would theoretically need to take the full path of the Internet, which is very inefficient. Socketbox can be used to safely "bridge" these two namespaces together such that they stay isolated in terms of routing policy, but the servers in the other network namespace can also be accessed from the main network namespace without any tunneling overhead. The server IP address is different, so we use split-horizon DNS to ensure that this configuration remains fully transparent to clients in the LAN.

The config used for this looks like this:

max-maps:100
max-rules:100
rule-set:0 size:100
ip:fd01:db8:1:40::/96 port:443 jump-map:2
ip:fd01:db8:1:40::/96 port:22 jump-map:3
ip:fd01:db8:1:40:0:64::/96 jump-map:1

map-set:1 size:100 match-lport:1
port:80 jump-unix:/nat64/nat64.sock
port:443 jump-unix:/nat64/nat64.sock
default:fail

map-set:2 size:100 match-lip:128
ip:fd01:db8:1:40::1:10 jump-unix:/vm7/sb-00443
ip:fd01:db8:1:40::1:11 jump-unix:/vm7/apache-00443
default:fail

map-set:3 size:100 match-lip:128
ip:fd01:db8:1:40::1:10 jump-unix:/vm7/sshd_socket
default:fail

The effect of all this is pretty interesting, since it gives off the illusion that the addresses all correspond to their own host:

$ ssh root@fd01:db8:1:40::1:10
-bash-5.0# exit
logout
Connection to fd01:db8:1:40::1:10 closed.
$ ssh root@fd01:db8:1:40::1:11
kex_exchange_identification: Connection closed by remote host
$ ssh root@fd01:db8:1:40::1:f
kex_exchange_identification: Connection closed by remote host
$ ssh root@fd01:db8:1:40::1:10
-bash-5.0# who
root     pts/0        2020-12-18 22:19 (fd01:db8:1:21:8d64:3353:1f4c:55e)
-bash-5.0# exit
logout
Connection to fd01:db8:1:40::1:10 closed.
$ ssh root@fd01:db8:1:40::1:0
kex_exchange_identification: Connection closed by remote host
$ wget https://[fd01:db8:1:40::1:f]
--2020-12-18 18:23:37--  https://[fd01:db8:1:40::1:f]/
Connecting to [fd01:db8:1:40::1:f]:443... connected.
Unable to establish SSL connection.
$ wget https://[fd01:db8:1:40::1:10]
--2020-12-18 18:23:38--  https://[fd01:db8:1:40::1:10]/
Connecting to [fd01:db8:1:40::1:10]:443... connected.
    ERROR: certificate common name ‘www2.peterjin.org’ doesn't match requested host name ‘fd01:db8:1:40::1:10’.
To connect to fd01:db8:1:40::1:10 insecurely, use `--no-check-certificate'.
$ wget https://[fd01:db8:1:40::1:11]
--2020-12-18 18:23:39--  https://[fd01:db8:1:40::1:11]/
Connecting to [fd01:db8:1:40::1:11]:443... connected.
    ERROR: certificate common name ‘apps-vm7-www.srv.peterjin.org’ doesn't match requested host name ‘fd01:db8:1:40::1:11’.
To connect to fd01:db8:1:40::1:11 insecurely, use `--no-check-certificate'.
$ wget https://[fd01:db8:1:40::1:12]
--2020-12-18 18:23:42--  https://[fd01:db8:1:40::1:12]/
Connecting to [fd01:db8:1:40::1:12]:443... connected.
Unable to establish SSL connection.
$ wget https://[fd01:db8:1:40::1:1f]
--2020-12-18 18:23:44--  https://[fd01:db8:1:40::1:1f]/
Connecting to [fd01:db8:1:40::1:1f]:443... connected.
Unable to establish SSL connection.
$ wget https://[fd01:db8:1:40::1:11]
--2020-12-18 18:23:47--  https://[fd01:db8:1:40::1:11]/
Connecting to [fd01:db8:1:40::1:11]:443... connected.
    ERROR: certificate common name ‘apps-vm7-www.srv.peterjin.org’ doesn't match requested host name ‘fd01:db8:1:40::1:11’.
To connect to fd01:db8:1:40::1:11 insecurely, use `--no-check-certificate'.
$ wget https://[fd01:db8:1:40::1234:5678]
--2020-12-18 18:23:53--  https://[fd01:db8:1:40::1234:5678]/
Connecting to [fd01:db8:1:40::1234:5678]:443... connected.
Unable to establish SSL connection.

See also