Securing a C++ Websocket Server with Libseccomp for Fun and Profit
Cybersecurity is about attack surface reduction and defense in depth. Or, in other words: Give an attacker the least possible amount of room to work in.
This is also called the principle of least privilege. For example, a service must only run under a dedicated user account that has just enough privileges to run the service itself. And nothing beyond that. If the service is compromised, the attacker is constrained to the privileges of the user. But often, these privileges are more than enough to do real damage. This is where seccomp comes to save the day.
Seccomp - Dropping privileges with precision
With seccomp, a process can instruct the linux kernel to kill the process if it uses a system call that was not explicitly whitelisted before. If an attacker injects code, for example by writing to the stack through a vulnerability like a buffer overflow, the attacker can therefore only use whitelisted system calls. This can severeley hinder further exploitation.
The best time to install a seccomp filter in the lifetime of a process is after its initialization, and before any user input is accepted. In a networked service, this would be right before we accept connections from the outside.
Using strace to generate a list of system calls
I have a websocket server running on hextws.thomastrapp.com. Its functionality is fairly simple: Accept incoming connections, accept requests, parse their content, chase some pointers and respond with JSON. The service is free software and available on Github.
We can run the service under the diagnostics tool strace to list all system calls that our service uses. We only care about the system calls issued after the process is initialized and waiting for connections.
# log all system calls of hextws to the file strace-output
strace --follow-forks -o strace-output ./hextws [...]
# wait for the service to fully initialize
# in background, empty the log file
truncate -s 0 strace-output
# now run the test suite against the service in parallel
( for i in $(seq 1230 1550) ; do \
WS_WEBSOCAT_LOCAL_PORT=$i WS_WEBSOCAT_FLAGS=[...] \
./blackbox.sh \
wss://[...] \
case/*hext &
done ; ) >/dev/null
# strace-output now contains the system calls that we
# need to explicitly allow
mv strace-output whitelist-system-calls
# list system calls by usage count
cat whitelist-system-calls \
| sed 's/^[0-9]\+[ ]\+//' \
| grep -v '^<' \
| grep -Eo '^[^(]+' \
| sort \
| uniq -c \
| sort -h
For example, we might arrive at the following list of system calls, that are used while the service is running.
# usage count -> system call
1 openat
1 read
3 mmap
3 munmap
11 madvise
26 brk
41 ioctl
42 close
56 accept
97 timerfd_settime
225 getpid
259 mprotect
593 epoll_ctl
3043 epoll_wait
3895 sendmsg
7740 recvmsg
10815 futex
The openat
system call immediately caught my eye. Digging deeper:
openat(AT_FDCWD, "/proc/sys/vm/overcommit_memory", O_RDONLY|O_CLOEXEC) = 33
Another interesting one is ioctl
:
ioctl(7, FIONBIO, [1])
ioctl(8, FIONBIO, [1])
ioctl(9, FIONBIO, [1])
ioctl(10, FIONBIO, [1])
...
We now have a general idea what the process is doing at the system call level, when handling connections and requests.
Hardening my C++ websocket server with seccomp
When the service starts up, it sets up the SSL context, binds to a port and spawns some threads, that each have their own main loop where incoming connections are accepted and handled (See ws/main.cpp)
In the code below, we can see that after a thread is spawned, the seccomp filters are installed with ws::SetupSeccomp()
. This is why we can also disallow syscalls related to thread management, because at this point the thread is already running.
std::vector<std::thread> threads;
for(auto i = 0; i < num_threads; ++i)
// start threads
threads.emplace_back([&ioc]{
try
{
// setup seccomp rules for this thread
ws::SetupSeccomp();
}
catch( const std::runtime_error& e )
{
std::cerr << e.what() << "\n";
std::abort();
}
// main loop that accepts connections
ioc.run();
});
Using libseccomp
Lets take a look at ws::SetupSeccomp
in detail (See ws/SetupSeccomp.cpp for the full code).
Seccomp itself is part of the Kernel API and cumbersome to use. Libseccomp on the other hand is an abstraction of this API and simplifies installing the seccomp filters tremendously.
Employing the scope guard pattern, we create an automatically deleted handle to the libseccomp filter, which we initialize with a default action of SCMP_ACT_KILL_PROCESS, that kills the process if the calling thread initiates any unwanted system call.
// scope guard for seccomp_{init,release}
using SeccompGuard = std::unique_ptr<void, decltype(&seccomp_release)>;
SeccompGuard ctx(seccomp_init(SCMP_ACT_KILL_PROCESS), seccomp_release);
if( !ctx )
throw SetupSeccompError("seccomp_init failed");
Now we define a whitelist of system calls that this thread is explicitly allowed to use.
WS_SYS_PAIR
is a helper macro that produces an initializer list, with the libseccomp system call specification as its first member, and the string representation as its second member.
The string representation is only used in error reporting.
std::pair<int, const char *> whitelist[] = {
WS_SYS_PAIR(accept), // accept a connection on a socket
WS_SYS_PAIR(brk), // memory management
WS_SYS_PAIR(close), // close a file descriptor/connection
WS_SYS_PAIR(epoll_ctl), // epoll management
WS_SYS_PAIR(epoll_wait), // epoll blocking wait
WS_SYS_PAIR(futex), // locking mechanism
WS_SYS_PAIR(getpid), // process management
WS_SYS_PAIR(madvise), // memory management
WS_SYS_PAIR(mmap), // memory management
WS_SYS_PAIR(mprotect), // memory management
WS_SYS_PAIR(munmap), // memory management
WS_SYS_PAIR(read), // read from a file descriptor
WS_SYS_PAIR(recvmsg), // receive message from a socket
WS_SYS_PAIR(rseq), // per-core thread management
WS_SYS_PAIR(sendmsg), // send a message on a socket
WS_SYS_PAIR(timerfd_settime), // connection timeout management
};
Next, we add these system calls to the seccomp filter with SCMP_ACT_ALLOW
, which allows the thread to issue these system calls unconditionally.
for( const auto& [sys_value, sys_name] : whitelist )
if( int rc = seccomp_rule_add(ctx.get(), SCMP_ACT_ALLOW, sys_value, 0) )
throw SetupSeccompError(RuleAddErrorString(sys_name, sys_value, rc));
As we saw earlier, the service uses openat
just once.
While freeing heap memory for the first time, glibc opens the file /proc/sys/vm/overcommit_memory
to check whether memory overcommit is disabled and adjusts its behavior accordingly (See sysdeps/unix/sysv/linux/malloc-sysdep.h
in glibc’s sources for details). Luckily for us, if glibc fails to open this file, glibc will behave as if overcommit_memory
is enabled, which is the default in most systems.
Therefore, we allow openat
, but always let it fail with the error code EACCES
.
if( int rc = seccomp_rule_add(ctx.get(), SCMP_ACT_ERRNO(EACCES), SCMP_SYS(openat), 0) )
throw SetupSeccompError(
ErrorString("seccomp_rule_add SCMP_ACT_ERRNO(EACCES) failed for openat", rc));
Similarly, we want to allow the ioctl
system call, but only with flag FIONBIO
, which is used to enable non-blocking I/O in Boost.Asio.
// Allow ioctl only with flag FIONBIO, i.e. `ioctl(any, FIONBIO, any)`.
if( int rc = seccomp_rule_add(ctx.get(),
SCMP_ACT_ALLOW,
SCMP_SYS(ioctl),
1,
SCMP_A1_64(SCMP_CMP_EQ,
static_cast<scmp_datum_t>(FIONBIO))) )
throw SetupSeccompError(ErrorString("seccomp_rule_add allow failed for ioctl with FIONBIO", rc));
And in the end, pass the filter to the kernel with seccomp_load
:
if( int rc = seccomp_load(ctx.get()) )
throw SetupSeccompError(ErrorString("seccomp_load failed", rc));
And that’s it. The calling thread is now protected with seccomp. Attackers are constrained to the very narrow list of system calls we have defined here. For example, typical shellcode will fail outright, leaving the attacker having to overcome seccomp first.
Watching the audit log
Seccomp violations lead to the termination of the service. Additionally, the incident is logged with auditd.
We can use systemd’s journalctl
to list all seccomp violations:
journalctl _AUDIT_TYPE_NAME=SECCOMP
For example, I use this command in cron to send an hourly log of possbile incidents per mail, if any violations occurred:
journalctl --no-pager --all --quiet --since="1 hour" _AUDIT_TYPE_NAME=SECCOMP | mail -E -s "seccomp violation" ...
Caveat: Just a downgrade of code execution to denial-of-service
Seccomp doesn’t magically plug the holes in our software. But it does downgrade the severity of a code execution bug to a denial-of-service attack, which is a huge win. On the flip side, if we accidentally omitted a system call that is used in one of the intended code paths of our service, we have just made it easier to cause disruption.
Shortly after deploying the service I had to whitelist three more system calls, that brought down the service after a runtime of a couple of hours. But, at the time of writing this, the service has now been stable for weeks.
Caveat: Finicky and prone to break
Our service has lots of dependencies. Consider one of these dependencies receives a security update where a new usage of a benign system call is introduced. Our service might break immediately. More succinctly: The set of used system calls has no bearing on semantic versioning.
In a similar manner, code and deployment are now intrinsically linked. Different environments might trigger different code paths in our service or its dependencies, which may use a system call that we have not thought of when hardcoding the whitelist. But such is life!
Last updated on ⚬ Published on