Tags: huffman zlib bruteforce 

Rating: 5.0

# Discovery
## Interacting with the judge server
Connecting with Netcat, we are welcomed by a prompt asking us a seed and a string. It then returns the name of few compression libraries alongside a number.
We can then provide a new seed and string and the program loops on that.
```
Please send: seed:string
I'll then show you the compression benchmark results!
Note: Flag has format DrgnS{[A-Z]+}
1:A // This is our input
none 26
zlib 34
bzip2 63
lzma 84
```

? We learn that the flag only contains uppercase characters inside the brackets. ?

## Peeking at server.py
### What is happening
We are provided a server.py file, a copy of it is available at the end of this WU, under Attachments.
The interesting part is under `handle_connection`:
```python
seed, string = line.split(b':', 1)

flag = bytearray(FLAG)
random.seed(int(seed))
random.shuffle(flag)
test_string = string + bytes(flag)

response = []
for cfunc in COMPRESSION_FUNCS:
res = cfunc(test_string)
response.append(f"{cfunc.__name__:>8} {res:>4}")
```

Every time we provide a `seed` and a `string`, it takes the `FLAG`, shuffles it based on our `seed` and prepend the shuffled flag by our `string`.
For example, considering `FLAG = DrgnS{AAAAAAAAAAAAA}`, `seed = 1`, `string = a`, then `test_string = aA{A}ADArAAAAAAAnAgAS`, and the result of the benchmark is:
```
none 21
zlib 24
bzip2 56
lzma 80
```

### What are the output numbers
```python
def none(v):
return len(v)

def zlib(v):
return len(codecs.encode(v, "zlib"))

def bzip2(v):
return len(codecs.encode(v, "bz2"))

def lzma(v):
return len(lz.compress(v))
```

? Those always correspond to the number of bytes after compression of the shuffled and prepended flag. ?

# Exploit
## Size of the FLAG
First thing first, the "none" compression is... not compressing. So it returns the size of the flag + the size of the string we prepend.

Remind our discovery, we now know ? the flag is 25 characters long ?, including 7 characters of the flag format (i.e. "DrgnS{}"), so 18 characters to find!
```
Please send: seed:string
I'll then show you the compression benchmark results!
Note: Flag has format DrgnS{[A-Z]+}
1:A // This is our input
none 26
zlib 34
bzip2 63
lzma 84
```

## Not so dumb brute force
### General idea
Not sure it is the intended way to go forward. We focused on zlib results.
What's great with zlib (and also bzip2) is that they use [Huffman Coding](https://en.wikipedia.org/wiki/Huffman_coding) and we remembered how it worked overall.

It basically split its input into tokens and counts how many times they appear.
Tokens are then be converted to bits in a way to be as efficient as possible. Those that appear the most will use as few bits as possible.

Our vector of attack will be to have long tokens that take very few bits.

### Testing our main vector of attack
At that point, we tried to validate our idea. For that we updated server.py to add debug info:
```python
print("Initial flag: " + str(FLAG))
flag = bytearray(FLAG)
random.seed(int(seed))
random.shuffle(flag)
test_string = string + bytes(flag)
print("Random + prefixed: " + str(test_string))
```

We then started it with a custom flag: `DrgnS{ABCEFGHIJKLMNOPQRT}`. Our server says that with `seed = 1`, the shuffled flag is `FP}QT{RILEGMDBrAHOJKnCgNS`.
```
1:YZ
none 27
zlib 35
bzip2 69
lzma 84

1:FP
none 27
zlib 35
bzip2 66
lzma 84

1:FP}QT{RI
none 33
zlib 35
bzip2 72
lzma 92

1:abcdefhi
none 33
zlib 41
bzip2 76
lzma 92
```

What we see is that with few letters there is no difference. That's kind of expected, Huffman Coding tokenizes bits, not characters.
However with longer strings, we can see that having a good prefix helps zlib stay at the same value, while unusual prefix will make it grow.

### Second vector of attack
In order for the main vector of attack to work, we actually need to know a substring of the shuffled flag.
This can be done by combining two piece of information.
1/ We actually know the flag format is `DrgnS{??????????????????}`, so we know 7 characters: `'D', 'r', 'g', 'n', 'S', '{', '}'`.
2/ Flag shuffling is fully deterministic and only depends on the seed we provide.

We created a script seed_unshuffle.py (cf Attachment) to find seeds for which the shuffled flag starts with `DrgnS`.

### Shuffle reversing
Our attack is done on the shuffled flag. So in order to find the flag, we actually need to unshuffle it.
This part is a matter of keeping track of how to swap characters. Using a fake flag where all characters are unique (e.g. `DrgnS{ABCEFGHIJKLMNOPQRT}`), we can see our letters are swapped.

We did it manually to win time in solving this challenge.

### Let's brute force locally!
#### V1
Cf attachment "exploit.py (V1)".
Using seed `2430447`, our local flag is DrgnS{ABCEFGHIJKLMNOPQRT} the shuffled flag is `DrgnSHJFCNTLM{BEPQ}OAIGKR` which starts with "DrgnS".
In this first brute force, we take the flag prefix ("DrgnS") and add one character among uppercase characters, `{` and `}` and check the zlib size. The character providing the lowest zlib size is then kept and we try the next one.
Result? => ❌ `DrgnSHJFCNTAAAAAAAAAAAAAA` ❌

The beginning is okay... until it only chooses A.
Checking manually what happened, after `DrgnSHJFCNT`, whatever the character being added, zlib always returns the same size.

#### V2
We don't remember exactly how tokenization works in Huffman Coding, and zlib also uses LZ77 which we don't know.
We thought of learning more about those. Then thought that anyway we are in Python here so we can just read the code and went to [gzip source code](https://github.com/python/cpython/blob/aaf42222cfd5774d23ca48ff304ace1f64426201/Lib/gzip.py#L576). And then we decided to be lazy and update our current brute force algorithm to not determine one character at a time, but two!

This exploit works!! ❌ Or so we thought ❌
Locally it works great, but remember our flag? There is no character being duplicated. The actual flag on the server contains duplication.

With this seed, there is quickly 3 `I`. Checking manually, those seems to be normal. But afterwards the exploit gets completely broken because such consecutive characters impacts the tokenisation of the Huffman Coding algorithm.

At that point we still kept track of what we learnt about the flag. We manually reversed the shuffle manually.
Here is the information we gathered with this test:
```
# DrgnS{??????????????????} -- Initial step, no brute force done
# DrgnS{??????A???????????} -- First letter brute forced
# DrgnS{??????A?R?????????} -- Second letter brute forced
# DrgnS{????I?A?R?????????} -- ...
# DrgnS{????I?A?R?????????} -- ...
# DrgnS{??I?I?A?R???I?????} -- Last letter brute forced, after this one we can't do anything
#2430447:DrgnSARIII
```

#### V2 second attempt
What do you do when brute-forcing is not working? Try again.
The next seed for which the shuffled flag starts with "DrgnS" is 3723664, and it actually starts with "DrgnS}", so that's even better, we know a longer prefix than expected.

We still have a similar issue, but we only miss 3 characters this time! We are also comforted with our brute force since found letters are the same as with our first try.
```
# DrgnS{??????????????????}
# DrgnS{????I?????????????}
# DrgnS{????IS????????????}
# DrgnS{????IS?????E??????}
# DrgnS{????IS?????E?????S}
# DrgnS{????IS????ME?????S}
# DrgnS{T???IS????ME?????S}
# DrgnS{T???ISA???ME?????S}
# DrgnS{TH??ISA???ME?????S}
# DrgnS{TH??ISA???MEI????S}
# DrgnS{TH??ISA???MEIG???S}
# DrgnS{TH??ISA???MEIG??SS}
# DrgnS{TH??ISA?R?MEIG??SS}
# DrgnS{TH??ISACR?MEIG??SS}
# DrgnS{TH??ISACR?MEIGU?SS}
# DrgnS{THI?ISACR?MEIGU?SS}
#3723664:DrgnS}ISESMTAHIGSRCUISESM
```

#### Guessing the flag
What do you do when brute-forcing is not working? Stop brute-forcing and guess. It often works when trying to brute force passwords so that seems legit here ?

The flag looks like a sentence and the first one that came to our mind is "This is a crime I guess". => ? DrgnS{THISISACRIMEIGUESS} ?

# Personal after thoughts
I found the challenge super fun. It reminded me of a course on compression I took 7 years ago at school. Time flies!
It also gave me vibes of "you can see that in real life" somehow. The most probable one would be in some Blind SQL Injection.
Maybe another would be related to HTTP calls that would store user input data, suffix it with some metadata, and returns a compressed size in headers. This kind of attack would allow an attacker to read the metadata. That seems a bit far fetched though.

Thank you Dragon Sector :)

# Attachments
## server.py
```python
#!/usr/bin/env python3
import threading
import socket
import random
import codecs
import lzma as lz

with open("flag.txt", "rb") as f:
FLAG = f.read().strip()

def none(v):
return len(v)

def zlib(v):
return len(codecs.encode(v, "zlib"))

def bzip2(v):
return len(codecs.encode(v, "bz2"))

def lzma(v):
return len(lz.compress(v))

COMPRESSION_FUNCS = [
none,
zlib,
bzip2,
lzma
]

def handle_connection(s, addr):
s.sendall(
("Please send: seed:string\\n\n"
"I'll then show you the compression benchmark results!\n"
"Note: Flag has format DrgnS{[A-Z]+}\n").encode())

data = b''
while True:
idx = data.find(b'\n')
if idx == -1:
if len(data) > 128:
s.shutdown(socket.SHUT_RDWR)
s.close()
return

d = s.recv(1024)
if not d:
s.close()
return
data += d
continue

line = data[:idx]
data = data[idx+1:]

seed, string = line.split(b':', 1)

flag = bytearray(FLAG)
random.seed(int(seed))
random.shuffle(flag)
test_string = string + bytes(flag)

response = []
for cfunc in COMPRESSION_FUNCS:
res = cfunc(test_string)
response.append(f"{cfunc.__name__:>8} {res:>4}")

response.append('')
response.append('')
s.sendall('\n'.join(response).encode())

s.shutdown(socket.SHUT_RDWR)
s.close()

def main():
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(('0.0.0.0', 1337))
s.listen(256)

while True:
conn, addr = s.accept()
print(f"Connection from: {addr}")

th = threading.Thread(
target=handle_connection,
args=(conn, addr),
daemon=True
)
th.start()

if __name__ == "__main__":
main()
```

## seed_unshuffle.py
```python
#!/usr/bin/env python3
import random

with open("flag.txt", "rb") as f:
FLAG = f.read().strip()

def main():
print("Initial flag: " + str(FLAG))

found = 0
seed = 2430448
while (found == 0):
flag = bytearray(FLAG)
random.seed(int(seed))
random.shuffle(flag)
test_string = bytes(flag)

if(str(test_string).startswith("b'DrgnS")):
print("Seed found: " + str(seed))
found = 1

if(seed % 100000 == 0):
print(seed)

seed += 1

if __name__ == "__main__":
main()
```

## exploit.py (V1)
```
import re
import requests
import socket
import string
from pwn import *

flag = "DrgnS"
seed = "3723664"
#nc = remote('compresstheflag.hackable.software', 1337)
nc = remote('127.0.0.1', 1337)
nc.recvuntil(b"Note: Flag has format DrgnS{[A-Z]+}\n")

def send_flag(payload):
nc.write(bytes(payload, "utf-8") + b"\n")
nc.recvuntil(b"none")
nc.recvline()
recv = nc.recvline()
nc.recvuntil(b"lzma")
nc.recvline()

numbers = [int(word) for word in recv.split() if word.isdigit()]
return int(numbers[0])

while len(flag) < 25:
min_size = 100000
best_payload = ""
for i in string.ascii_uppercase + "{}":
payload = flag + i
print(payload)
size = send_flag(seed + ":" + payload)
if size < min_size:
min_size = size
best_payload = payload

flag = best_payload
```