Rating:
Solution
========
Task was easy, but that's not what I wanted to share
So there was this +300M brainfuck file... (329 080 541 to be precise).
It was a multilayer, so every next layer was also brainfuck file but a lot smaller:
329 080 541 original
41 145 939 file1
5 148 580 file2
646 203 file3
82 092 file4
10 991 file5
1 914 file6
221 file7
last file was following python script:
```python
import hashlib
class Tolkien:
def __init__(self):
self.quote = 'Not all those who wander are lost'
if __name__ == '__main__':
t = Tolkien()
print(hashlib.md5(t.quote.encode('utf-8')).hexdigest())
```
(If solution is the only thing that you're interested in, you can stop reading here)
----
AOT v1
======
So the biggest problem was probably making interpreter that is able to crunch that 300M file.
But without much hassle, I got an interpreter that could do that in **30s.**
That was certainly fine for CTF, but it bothered me for some time and I thought that's lame, so I've come up with an idea
of making an AOT for this task. I have bad experiances with asmjit, and I was already familiar with luajit's dynasm
(http://github.com/gim913/dynasm-jit-playground), so I thought about using it.
I came up with following code (boilerplate dynasm code removed) [first.cpp](first.cpp)
That's where the problems started. Internally dynasm is handling some sort of sections and it's getting problematic, if you'll
have to many labels (and that certainly will be the case for a brainfuck to x64 binary translation).
The problem comes from the fact how it's calculating so called 'positions' into pointers within a sections.
I've tried requesting some help on luajit mailing list, but without much luck.
I was to lazy to analyze the code and do a proper fix, so instead I've changed the macros that handle calculations, so I've turned following
macros from `dynasm_x86.h`:
```
/* Macros to convert positions (8 bit section + 24 bit index). */
#define DASM_POS2IDX(pos) ((pos)&0x00ffffff)
#define DASM_POS2BIAS(pos) ((pos)&0xff000000)
#define DASM_SEC2POS(sec) ((sec)<<24)
#define DASM_POS2SEC(pos) ((pos)>>24)
#define DASM_POS2PTR(D, pos) (D->sections[DASM_POS2SEC(pos)].rbuf + (pos))
```
into (I know I have only one section, so we'll use only single bit for section, and 31 bits for index):
```cpp
#define DASM_POS2IDX(pos) ((pos)&0x7fffffff)
#define DASM_POS2BIAS(pos) ((pos)&0x80000000)
#define DASM_SEC2POS(sec) ((sec)<<31)
#define DASM_POS2SEC(pos) ((pos)>>31)
#define DASM_POS2PTR(D, pos) (D->sections[DASM_POS2SEC(pos)].rbuf + pos)
```
Now that itself did not work, as the input file was big enough to cause integer calculations, so I've fixed few more things:
```cpp
typedef struct dasm_Section {
int *rbuf; /* Biased buffer pointer (negative section bias). */
int *buf; /* True buffer pointer. */
size_t bsize; /* Buffer size in bytes. */
size_t /*int*/ pos; /* Biased buffer position. */
size_t /*int*/ epos; /* End of biased buffer position - max single put. */
int ofs; /* Byte offset into section. */
} dasm_Section;
```
pos and epos in struct above became `size_t`, and at beginnig of `dasm_put`, there was also `pos` that became `size_t`
`size_t pos = sec->pos;`
now there's one more thing that might be a problem if you don't have 4G of free mem ^^
basically dasm rellocation always try to double memory, I didn't need that, but you might need following hack,
that slows down allocation a bit once you hit 512M:
```cpp
#ifndef DASM_M_GROW
#define DASM_M_GROW(ctx, t, p, sz, need) \
do { \
size_t _sz = (sz), _need = (need); \
if (_sz < _need) { \
if (_sz < 16) _sz = 16; \
if (_sz < 1024 * 1024 * 1024) \
while (_sz < _need) _sz += _sz; \
else \
while (_sz < _need) _sz += 128 * 1024 * 1024; \
(p) = (t *)realloc((p), _sz); \
if ((p) == NULL) exit(1); \
(sz) = _sz; \
} \
} while(0)
#endif
```
So with all of that in place it finally worked, and I was able to get get output of 300M file in time between **16-26 secs**
(depending on avail memory, allocations, system load etc.)
I've tried playing a bit with dynasm allocator (not to reallocate section), but that did not give much boost.
I collected bit more details, about time taken (in seconds):
| action | time |
| --------------- | -------------:|
| code generation | 5.19 |
| dynasm link | 1.73 |
| encode time | 3.25 |
| execution time | 5.22 |
| --------------- | ------------- |
| overall | 15.39 |
At this point I thought, I should be able to beat 3 first parts, as dynasm needs to take care about handling labels.
But brainfuck is simple, so I thought it shouldn't be that hard to beat that value, right?
AOT v2
======
And it isn't :)
We can generate all the code on the fly in a single pass.
But first reminder how brainfuck loops work like.
* `[` - if `byte(data pointer) == 0`, jump after `]`
* `]` - jump back to correpsonding opening `[`
So we will need simple trick when generating [, we'll gonna make following code:
```asm
loop_start:
cmp [rdi], 0
jnz continue_loop
jmp dummy_value
continue_loop:
... code goes here
; now handle closing brace
jmp loop_start
loop_end:
```
When generating, I'm gonna track all loop starts on the stack, and when generating
jump at closing brace:
* get from the stack position of `loop_start`, and generate `jmp loop_start`
* calculate proper value to fill `dummy_value`, so that the jump will go to `loop_end`
That is what [second.cpp](second.cpp) does, easy peasy, time **6.53** :)
| action | time |
| --------------- | -------------:|
| code generation | 1.25 |
| execution time | 5.28 |
| --------------- | ------------- |
| overall | 6.53 |
Optimize
========
Now, next question is, can this get any better? Sure, with just a little bit of cheating.
The key is an observation, that there are many loops in the file that look like:
`[---->++<]`
I'm storing this as a new insturction that will have arguments `M(count1, count2)`
count1 is negative if there were `-` and positive it it were `+`, but in the
files provided count1 is always negative.
The preprocessing step is bit lengthy, but it's not that complicated.
Take a look at [optimizer.cpp](optimizer.cpp)
So I'm changing whole loop into `M(a,b)` filling the rest with blanks.
This can be used to calculate value of next cell as follows (pseudo code):
```
M(a,b)
starting = cell[ current ]
a = -a
while (starting % a)
starting += 256;
result = starting / a * b;
cell[ current ] = 0
cell[ current + 1 ] += result;
```
How to use it is along with [second.cpp](second.cpp) is left as an excercise.
Now time for the final number **time: 2.8s** :)
Summary
=======
Obviously last step is a cheat, and if applied directly
to interpreter, it would improve it's speed as well.
- Is it an overkill? Who am I to argue.
- Was it fun? OFC
Point was, that even from dumb task, you can get somewhat valuable knowledge.
Erratum
=======
So I've applied optimizations as well, and collected times from few runs,
here are final numbers:
Interpreter
-----------
| |optimizer|interpreter|total|
| --- | --- | --- | --- |
| | 0.866119| 1.18828| 2.05485|
| | 0.776477| 1.18891| 1.9658|
| | 0.711183| 1.18157| 1.89319|
| | 0.721473| 1.19604| 1.91792|
| | 0.680486| 1.18171| 1.86262|
| | 0.677328| 1.18375| 1.8615|
| | 0.657009| 1.17936| 1.83681|
| | 0.669505| 1.18598| 1.85591|
| | 0.670191| 1.18677| 1.85737|
| | 0.669077| 1.18755| 1.8571|
| **average** | **0.7098848** | **1.185992** | **1.896307**|
AOT
---
| |optimizer|x64 generation|execution|total|
| --- | --- | --- | --- | --- |
| | 0.731768| 0.87682| 1.00158| 2.64302|
| | 0.677816| 0.853774| 1.00222| 2.567|
| | 0.628973| 0.843486| 1.00134| 2.50852|
| | 0.615942| 0.854813| 1.00879| 2.51432|
| | 0.619122| 0.846164| 1.01687| 2.51639|
| | 0.615045| 0.855564| 1.02103| 2.52412|
| | 0.616443| 0.84457| 1.0055| 2.49951|
| | 0.616248| 0.842644| 1.01138| 2.52595|
| | 0.631289| 0.847689| 1.01104| 2.52361|
| | 0.618018| 0.84542| 1.01319| 2.51087|
| **average** | **0.6370664** | **0.8510944** | **1.009294** | **2.533331** |
So it's pretty clear interpreter is faster ^^
If you compare exectution only vs interpreter the difference is not that big.
Was it worth it? Certainly :)