Securing a C++ Websocket Server with Libseccomp for Fun and Profit

Cybersecurity is about attack surface reduction and defense in depth. Or, in other words: Give an attacker the least possible amount of room to work in.

This is also called the principle of least privilege. For example, a service must only run under a dedicated user account that has just enough privileges to run the service itself. And nothing beyond that. If the service is compromised, the attacker is constrained to the privileges of the user. But often, these privileges are more than enough to do real damage. This is where seccomp comes to save the day.

Seccomp - Dropping privileges with precision

With seccomp, a process can instruct the linux kernel to kill the process if it uses a system call that was not explicitly whitelisted before. If an attacker injects code, for example by writing to the stack through a vulnerability like a buffer overflow, the attacker can therefore only use whitelisted system calls. This can severeley hinder further exploitation.

The best time to install a seccomp filter in the lifetime of a process is after its initialization, and before any user input is accepted. In a networked service, this would be right before we accept connections from the outside.

Using strace to generate a list of system calls

I have a websocket server running on hextws.thomastrapp.com. Its functionality is fairly simple: Accept incoming connections, accept requests, parse their content, chase some pointers and respond with JSON. The service is free software and available on Github.

We can run the service under the diagnostics tool strace to list all system calls that our service uses. We only care about the system calls issued after the process is initialized and waiting for connections.

# log all system calls of hextws to the file strace-output
strace --follow-forks -o strace-output ./hextws [...]
# wait for the service to fully initialize
# in background, empty the log file
truncate -s 0 strace-output
# now run the test suite against the service in parallel
( for i in $(seq 1230 1550) ; do \
    WS_WEBSOCAT_LOCAL_PORT=$i WS_WEBSOCAT_FLAGS=[...] \
      ./blackbox.sh \
        wss://[...] \
        case/*hext &
  done ; ) >/dev/null
# strace-output now contains the system calls that we
# need to explicitly allow
mv strace-output whitelist-system-calls
# list system calls by usage count
cat whitelist-system-calls \
  | sed 's/^[0-9]\+[ ]\+//' \
  | grep -v '^<' \
  | grep -Eo '^[^(]+' \
  | sort \
  | uniq -c \
  | sort -h

For example, we might arrive at the following list of system calls, that are used while the service is running.

# usage count -> system call
    1 openat
    1 read
    3 mmap
    3 munmap
   11 madvise
   26 brk
   41 ioctl
   42 close
   56 accept
   97 timerfd_settime
  225 getpid
  259 mprotect
  593 epoll_ctl
 3043 epoll_wait
 3895 sendmsg
 7740 recvmsg
10815 futex

The openat system call immediately caught my eye. Digging deeper:

openat(AT_FDCWD, "/proc/sys/vm/overcommit_memory", O_RDONLY|O_CLOEXEC) = 33

Another interesting one is ioctl:

ioctl(7, FIONBIO, [1])
ioctl(8, FIONBIO, [1])
ioctl(9, FIONBIO, [1])
ioctl(10, FIONBIO, [1])
...

We now have a general idea what the process is doing at the system call level, when handling connections and requests.

Hardening my C++ websocket server with seccomp

When the service starts up, it sets up the SSL context, binds to a port and spawns some threads, that each have their own main loop where incoming connections are accepted and handled (See ws/main.cpp)

In the code below, we can see that after a thread is spawned, the seccomp filters are installed with ws::SetupSeccomp(). This is why we can also disallow syscalls related to thread management, because at this point the thread is already running.

std::vector<std::thread> threads;
for(auto i = 0; i < num_threads; ++i)
  // start threads
  threads.emplace_back([&ioc]{
    try
    {
      // setup seccomp rules for this thread
      ws::SetupSeccomp();
    }
    catch( const std::runtime_error& e )
    {
      std::cerr << e.what() << "\n";
      std::abort();
    }
    // main loop that accepts connections
    ioc.run();
  });

Using libseccomp

Lets take a look at ws::SetupSeccomp in detail (See ws/SetupSeccomp.cpp for the full code).

Seccomp itself is part of the Kernel API and cumbersome to use. Libseccomp on the other hand is an abstraction of this API and simplifies installing the seccomp filters tremendously.

Employing the scope guard pattern, we create an automatically deleted handle to the libseccomp filter, which we initialize with a default action of SCMP_ACT_KILL_PROCESS, that kills the process if the calling thread initiates any unwanted system call.

// scope guard for seccomp_{init,release}
using SeccompGuard = std::unique_ptr<void, decltype(&seccomp_release)>;
SeccompGuard ctx(seccomp_init(SCMP_ACT_KILL_PROCESS), seccomp_release);

if( !ctx )
  throw SetupSeccompError("seccomp_init failed");

Now we define a whitelist of system calls that this thread is explicitly allowed to use. WS_SYS_PAIR is a helper macro that produces an initializer list, with the libseccomp system call specification as its first member, and the string representation as its second member. The string representation is only used in error reporting.

std::pair<int, const char *> whitelist[] = {
  WS_SYS_PAIR(accept),            // accept a connection on a socket
  WS_SYS_PAIR(brk),               // memory management
  WS_SYS_PAIR(close),             // close a file descriptor/connection
  WS_SYS_PAIR(epoll_ctl),         // epoll management
  WS_SYS_PAIR(epoll_wait),        // epoll blocking wait
  WS_SYS_PAIR(futex),             // locking mechanism
  WS_SYS_PAIR(getpid),            // process management
  WS_SYS_PAIR(madvise),           // memory management
  WS_SYS_PAIR(mmap),              // memory management
  WS_SYS_PAIR(mprotect),          // memory management
  WS_SYS_PAIR(munmap),            // memory management
  WS_SYS_PAIR(read),              // read from a file descriptor
  WS_SYS_PAIR(recvmsg),           // receive message from a socket
  WS_SYS_PAIR(rseq),              // per-core thread management
  WS_SYS_PAIR(sendmsg),           // send a message on a socket
  WS_SYS_PAIR(timerfd_settime),   // connection timeout management
};

Next, we add these system calls to the seccomp filter with SCMP_ACT_ALLOW, which allows the thread to issue these system calls unconditionally.

for( const auto& [sys_value, sys_name] : whitelist )
  if( int rc = seccomp_rule_add(ctx.get(), SCMP_ACT_ALLOW, sys_value, 0) )
    throw SetupSeccompError(RuleAddErrorString(sys_name, sys_value, rc));

As we saw earlier, the service uses openat just once. While freeing heap memory for the first time, glibc opens the file /proc/sys/vm/overcommit_memory to check whether memory overcommit is disabled and adjusts its behavior accordingly (See sysdeps/unix/sysv/linux/malloc-sysdep.h in glibc’s sources for details). Luckily for us, if glibc fails to open this file, glibc will behave as if overcommit_memory is enabled, which is the default in most systems.

Therefore, we allow openat, but always let it fail with the error code EACCES.

if( int rc = seccomp_rule_add(ctx.get(), SCMP_ACT_ERRNO(EACCES), SCMP_SYS(openat), 0) )
  throw SetupSeccompError(
      ErrorString("seccomp_rule_add SCMP_ACT_ERRNO(EACCES) failed for openat", rc));

Similarly, we want to allow the ioctl system call, but only with flag FIONBIO, which is used to enable non-blocking I/O in Boost.Asio.

// Allow ioctl only with flag FIONBIO, i.e. `ioctl(any, FIONBIO, any)`.
if( int rc = seccomp_rule_add(ctx.get(),
                              SCMP_ACT_ALLOW,
                              SCMP_SYS(ioctl),
                              1,
                              SCMP_A1_64(SCMP_CMP_EQ,
                                         static_cast<scmp_datum_t>(FIONBIO))) )
  throw SetupSeccompError(ErrorString("seccomp_rule_add allow failed for ioctl with FIONBIO", rc));

And in the end, pass the filter to the kernel with seccomp_load:

if( int rc = seccomp_load(ctx.get()) )
  throw SetupSeccompError(ErrorString("seccomp_load failed", rc));

And that’s it. The calling thread is now protected with seccomp. Attackers are constrained to the very narrow list of system calls we have defined here. For example, typical shellcode will fail outright, leaving the attacker having to overcome seccomp first.

Watching the audit log

Seccomp violations lead to the termination of the service. Additionally, the incident is logged with auditd.

We can use systemd’s journalctl to list all seccomp violations:

journalctl _AUDIT_TYPE_NAME=SECCOMP

For example, I use this command in cron to send an hourly log of possbile incidents per mail, if any violations occurred:

journalctl --no-pager --all --quiet --since="1 hour" _AUDIT_TYPE_NAME=SECCOMP | mail -E -s "seccomp violation" ...

Caveat: Just a downgrade of code execution to denial-of-service

Seccomp doesn’t magically plug the holes in our software. But it does downgrade the severity of a code execution bug to a denial-of-service attack, which is a huge win. On the flip side, if we accidentally omitted a system call that is used in one of the intended code paths of our service, we have just made it easier to cause disruption.

Shortly after deploying the service I had to whitelist three more system calls, that brought down the service after a runtime of a couple of hours. But, at the time of writing this, the service has now been stable for weeks.

Caveat: Finicky and prone to break

Our service has lots of dependencies. Consider one of these dependencies receives a security update where a new usage of a benign system call is introduced. Our service might break immediately. More succinctly: The set of used system calls has no bearing on semantic versioning.

In a similar manner, code and deployment are now intrinsically linked. Different environments might trigger different code paths in our service or its dependencies, which may use a system call that we have not thought of when hardcoding the whitelist. But such is life!

Last updated on 2024-07-18 ⚬ Published on 2024-06-27