Rating: 5.0

The audio file contains not only "instructions" in voice form, but also keyboard tapping. The space key has a distinct long sound; the audio begins like (5 keys)(space)(3 keys)(space)(3 or 4 keys, not very clear)(space)(more tapping)(voice saying "hello you have been pwned..."). So the voice is fully synthesized by some text-to-speech engine (and is not e.g. distorted version of the hacker's voice), and most of tapping corresponds to what the voice says.

Aside of the space key, other keys sound the same for human ear (at least, for my ear), but maybe an analysis can distinguish more? Comparing raw bytes for sound waves is practically meaningless; probably the simplest invariant (for same-volume sounds) is how long the signal stays above a certain threshold.
```
#include <stdio.h>
#include <stdlib.h>

#define N 0xA10C8B
#define LIM 4096

int done(size_t pos, const short* data)
{
for (size_t i = 0; i < 1024; i++)
if (data[pos+i] <= -LIM || data[pos+i] >= LIM)
return 0;
return 1;
}

int main()
{
FILE* f = fopen("theroxxorlord2010.wav", "rb");
if (!f) {
fprintf(stderr, "Failed to open input\n");
return 1;
}
fseek(f, 0x2C, SEEK_SET);
static short data[N];
if (fread(data, 1, sizeof(data), f) != sizeof(data)) {
fclose(f);
fprintf(stderr, "Read error\n");
return 1;
}
fclose(f);
size_t pos = 0;
for (;;) {
while (pos < N && data[pos] > -LIM && data[pos] < LIM) {
++pos;
// [13s, 18s] and [53s, 1min5s] are voice, skip them
if (pos == 13*44100)
pos += 5*44100;
if (pos == 53*44100)
pos += 12*44100;
}
if (pos >= N)
break;
size_t start = pos;
while (!done(pos, data))
++pos;
printf("%lf -> %lf [%d]\n", start / 44100., pos / 44100., (int)(pos - start));
}
return 0;
}
```
The results are far from perfect, but possible to work with:
```
1.051202 -> 1.057551 [280]
1.144898 -> 1.153016 [358]
h
1.282993 -> 1.287007 [177]
1.353946 -> 1.360295 [280]
e
1.538390 -> 1.547234 [390]
1.609501 -> 1.615465 [263]
l
1.677710 -> 1.686553 [390]
1.795261 -> 1.801224 [263]
l
1.910476 -> 1.915170 [207]
2.007664 -> 2.012222 [201]
o
2.188549 -> 2.193719 [228]
2.200454 -> 2.200544 [4]
2.281837 -> 2.297710 [700]
[space]
2.467642 -> 2.476032 [370]
2.561610 -> 2.568753 [315]
y
2.723175 -> 2.726825 [161]
2.820363 -> 2.824943 [202]
o
2.885624 -> 2.891020 [238]
3.005918 -> 3.011610 [251]
u
3.163787 -> 3.168957 [228]
3.175692 -> 3.175941 [11]
3.280136 -> 3.296168 [707]
[space]
3.397234 -> 3.405351 [358]
h
3.535351 -> 3.541769 [283]
3.646122 -> 3.652925 [300]
a
3.674626 -> 3.679320 [207]
3.748753 -> 3.754989 [275]
v
3.837166 -> 3.842449 [233]
3.931361 -> 3.937687 [279]
e
3.999705 -> 4.005442 [253]
4.011610 -> 4.013401 [79]
4.092993 -> 4.108866 [700]
[space]
4.208685 -> 4.215215 [288]
4.328118 -> 4.331270 [139]
b
4.464694 -> 4.468141 [152]
4.535079 -> 4.541429 [280]
e
4.696327 -> 4.700340 [177]
4.744059 -> 4.752290 [363]
e
4.858844 -> 4.864785 [262]
4.951723 -> 4.951859 [6]
4.958005 -> 4.960952 [130]
n
5.067823 -> 5.072063 [187]
5.137891 -> 5.153764 [700]
[space]
5.276916 -> 5.280045 [138]
5.284308 -> 5.284331 [1]
5.394626 -> 5.402268 [337]
p
5.509002 -> 5.513991 [220]
5.601905 -> 5.601927 [1]
5.608753 -> 5.612925 [184]
w
5.741202 -> 5.747143 [262]
5.810862 -> 5.810998 [6]
5.817143 -> 5.820091 [130]
n
5.950249 -> 5.954218 [175]
6.044376 -> 6.050703 [279]
e
6.182449 -> 6.187596 [227]
6.256054 -> 6.256077 [1]
6.259977 -> 6.265261 [233]
d
6.344921 -> 6.350091 [228]
6.357052 -> 6.357098 [2]
6.438209 -> 6.454104 [701]
[space]
7.296939 -> 7.305805 [391]
7.373469 -> 7.380839 [325]
t
7.506349 -> 7.512698 [280]
7.623265 -> 7.631383 [358]
h
7.761338 -> 7.765374 [178]
7.855533 -> 7.861859 [279]
e
8.016757 -> 8.021927 [228]
8.028662 -> 8.028685 [1]
8.110045 -> 8.125918 [700]
[space]
```
Usually keys have two distinct sounds, one for pressing and one for releasing, although sometimes the threshold splits one of those into two, making three ranges in total. When pairs/triples of lengths are close enough, they correspond to the same key (e.g. 390:263 corresponds to 'l', 262:6:130 corresponds to 'n'), but not vice versa, the same key can give different lengths (in the snippet above, 'e' is around 175:280 three times, but also 152:280, 177:363 and 233:279). A very useful source for many keys is bitcoin address in the second phrase; it is nearly impossible to fully recognize from the voice, but the hacker has also attached a text file
```
-----BEGIN BITCOIN SIGNED MESSAGE-----
Coucou
-----BEGIN SIGNATURE-----
1Lbe5hegWuGxukDQg3xQ77vs5TC3xiMVLQ
IPWuZOUZ0UmNdsofHibn+oADosWBikc/u5dI7D6HZxc3oJvv9ESneeT+/yw/o++FKqqeEMI1I53zKLNZFgOBPlQ=
-----END BITCOIN SIGNED MESSAGE-----
```
that contains the address. Capital letters and, strangely, digits are typed with Shift, so the sequence of keypresses is (press Shift)(1)(l)(release Shift)(b)(e)(press Shift)...

Another, more stable way to check whether two sounds are the same, is Fourier transform:
```
>>> import librosa # I'm definitely not going write it from scratch
>>> import numpy as np
>>> f = librosa.load('theroxxorlord2010.wav', sr=None)
>>> d = librosa.stft(f[0], hop_length=128)
>>> dabs = np.abs(d)
>>> np.linalg.norm(dabs[:,int(1.677710*44100/128)] - dabs[:,int(1.538390*44100/128)])
3.3174179 # two 'l'-s from above are quite close
>>> np.linalg.norm(dabs[:,int(1.910476*44100/128)] - dabs[:,int(1.677710*44100/128)])
46.886997 # 'o' and 'l' from above are not so close
>>> def printgood(target, limit=16):
... for i in range(target-10):
... x = np.linalg.norm(dabs[:,i] - dabs[:,target])
... if x < limit:
... print(i,'=',i*128/44100,x)
...
>>> printgood(int(1.677710*44100/128))
# what previous fragments are sufficiently close to the second 'l'?
529 = 1.5354195011337868 9.13513
530 = 1.5383219954648526 3.3174179
531 = 1.5412244897959184 5.5610113
532 = 1.5441269841269842 8.659018
533 = 1.5470294784580498 13.392412
>>> printgood(int(3.837166*44100/128))
# can it actually figure out that two first 'e'-s are close?
440 = 1.2770975056689342 14.873772
441 = 1.28 8.230451
442 = 1.2829024943310658 5.801723
443 = 1.2858049886621314 5.989423
444 = 1.2887074829931973 7.3282714
445 = 1.291609977324263 11.990934
# yes, it can
```
I mainly used lengths, which took a lot of manual work to match, and started to research how one does FFTs and check results with FFT only when letters just refused to spell words. In retrospect, this has probably costed me a significant amount of time.

Anyway, after a while the voice says something about signing and a message "coucou", after which the audio has a long fragment of tapping that does not correspond to anything that the voice says. The fragment can be divided into two distinct parts, the part from 01:40 to 01:53 has relatively low taps with pauses inside, and the part from 01:56 to 02:50 contains a steady flow of taps. Upon a closer investigation, the second part has 23 spaces making for 24 words, which is exactly the length of a highest-entropy BIP39 mnemonic (and the first part presumably deals with launching the signer and entering the message).

After some work, matching all keys between 01:56 and 02:50 to earlier known keys indeed reveals a valid mnemonic. The rest is straightforward:
```
>>> import mnemonic
>>> from Crypto.Cipher import AES
>>> mnemo = mnemonic.Mnemonic('english')
>>> entropy = mnemo.to_seed('smoke cave often gasp junk stone student high dragon fiction still awesome believe muffin dynamic faith gather couple upgrade credit night fame dumb surge')
>>> AES.new(key=bytes.fromhex('ee8879f2db202c4cefd1810e174989b09bd121eb4a84dea382ca98bb6d4e63e5'), mode=AES.MODE_ECB).decrypt(b'r\x8cd\x11\xbf\xf8U_\xea\x92\xb4\xba\xad*{.\xdc\x08\x98\x92\\\xfb\x97\xa6\x01/_(W\xf4\xf3\xe3')
b'CTF{Thereare*explodingaroundyou}'
```