Tags: sekaictf2022 vocaloid_heardle
Well, it’s just too usual to hide a flag in stegano, database, cipher, or server. What if we decide to sing it out instead?
Author: pamLELcu
See challenge here: https://ctf.sekai.team/challenges#Vocaloid-Heardle-23
After looking at the files, it became clear that vocaloid_heardly.py
is the file used to generate flag.mp3
Let's imagine that the flag is SEKAI{THIS_IS_MY_FIRST_WRITEUP}
. Given this flag, the python script:
Removes the enclosing SEKAI{...}
to get the inner substring THIS_IS_MY_FIRST_WRITEUP
Converts each character to unicode:
ord(' T ') = 84
ord(' H ') = 72
ord(' I ') = 73
Gets all musics with musicId equal to the characters' unicodes and downloads it, storing them into the array tracks
# returns a random assetbundleName from the list of all musics with musicId equals to the given input mid
def get_resource(mid):
return random.choice([i for i in resources if i["musicId"] == mid])["assetbundleName"]
def download(mid):
resource = get_resource(mid)
r = requests.get(f"https://storage.sekai.best/sekai-assets/music/short/{resource}_rip/{resource}_short.mp3")
filename = f"tracks/{mid}.mp3"
with open(filename, "wb") as f:
return mid
tracks = [download(ord(i)) for i in flag]
# here is how tracks look like after execution:
# tracks = [
# 'vs_0084_01', --> musicId = 84 ('T')
# '0072_01', --> musicId = 72 ('H')
# 'se_0073_01' --> musicId = 73 ('I')
# ...
# ]
Stitches together the given music files using ffmpeg
to generate flag.mp3
# stage 1
inputs = sum([["-i", f"tracks/{i}.mp3"] for i in tracks], [])
# stage 2
filters = "".join(f"[{i}:a]atrim=end=3,asetpts=PTS-STARTPTS[a{i}];" for i in range(len(tracks))) + \
"".join(f"[a{i}]" for i in range(len(tracks))) + \
# stage 3
subprocess.run(["ffmpeg"] + inputs + ["-filter_complex", filters, "-map", "[a]", "flag.mp3"])
# stage 1:
# inputs = [
# '-i', 'tracks/vs_0084_01.mp3',
# '-i', 'tracks/0071_01.mp3',
# '-i', 'tracks/se_0073_01.mp3',
# ...
# ]
# stage 2:
# filters = '[0:a]atrim=end=3,asetpts=PTS-STARTPTS[a0];[1:a]atrim=end=3,asetpts=PTS-STARTPTS[a1];[2:a]atrim=end=3,asetpts=PTS-STARTPTS[a2]; ...'
# stage 3:
# ffmpeg -i tracks/vs_0084_01.mp3 -i ... -filter_complex <filters> -map [a] flag.mp3
Having understood how the flag is generated, the obvious next step is to somehow figure out (1) which music files make up flag.mp3
, (2) get the corresponding musicIds from the file names, and (3) convert the musicIds from unicode to ASCII.
I needed to Google a bit to figure out what the ffmpeg
instruction was doing specifically.
I stumbled upon a stackoverflow post that teaches us to concatenate two audio files via ffmpeg
's filter_complex
Here is a visualization of how ffmpeg
commands work for the above example:
accepts some input files via the -i
option, then performs a series of filters via the -filter_complex
option (which are separated by semicolons), and finally outputs & saves the final output stream [a]
as flag.mp3
You can learn more about how ffmpeg
works here.
Diving deeper into filter_complex
, I learned what the atrim and asetpts filters do in stage 2:
"atrim=end=3" will stop trimming at 3 seconds "asetpts=PTS-STARTPTS" will specify to start at the first frame
Thus, stage 2 is essentially trimming the first 3 seconds of all the audio files and concatenating them together in the order of input:
# stage 2
filters = "".join(f"[{i}:a]atrim=end=3,asetpts=PTS-STARTPTS[a{i}];" for i in range(len(tracks))) + \
"".join(f"[a{i}]" for i in range(len(tracks))) + \
A quick sanity check confirms that our hypothesis is true: flag.mp3
is an audio file that lasts for 33 seconds (multiple of 3), and while playing the audio file we learn that every 3 seconds the music changes.
Thus, the inner substring of the flag must contain 33 / 3 = 11 characters!
Why not immediately make use of all the knowledge we've learned about ffmpeg
A quick google search taught me how to split an audio file into equal segments using ffmpeg
ffmpeg -i flag.mp3 -f segment -segment_time 3 -c copy flag_char_%03d.mp3
This generated precisely 11 files:
? vocaloid_heardle
┣ ? flag.mp3
┣ ? vocaloid_heardle.py
┗ ? flag_chars
┣ ? flag_char_000.mp3
┣ ? flag_char_001.mp3
┣ ? flag_char_002.mp3
┣ ? flag_char_003.mp3
┣ ? flag_char_004.mp3
┣ ? flag_char_005.mp3
┣ ? flag_char_006.mp3
┣ ? flag_char_007.mp3
┣ ? flag_char_008.mp3
┣ ? flag_char_009.mp3
┗ ? flag_char_010.mp3
One hour into the challenge and I was determined to solve this CTF. My teammates probably got tired of hearing me repeatedly play flag.mp3
. It is about time for me to tell them "I found the flag!!"
Okay, we got the individual audio files. Now we need to know which musicId each of the flag_char_XXX.mp3
corresponds to.
What's the most algorithmically efficient way to do that?
Brute force. Brute force is the way.
And so that's what I did:
I scraped and downloaded all 638 music files (>500MB) provided by resources.json
# get all possible resourceID from resources.josn
def scrape():
with open("resources.json", "r") as f:
resources = json.load(f)
# download all possible assetBundleNames
for resource in resources:
ass = resource["assetbundleName"]
print("getting asset:", ass)
r = requests.get(f"https://storage.sekai.best/sekai-assets/music/short/{ass}_rip/{ass}_short.mp3")
# write to a new file
filename = f"tracks/{ass}.mp3"
with open(filename, "wb") as f:
print(f"wrote to file: tracks/{ass}.mp3")
# sit and wait
Now my folder looks like this:
? vocaloid_heardle
┣ ? flag.mp3
┣ ? vocaloid_heardle.py
┣ ? flag_chars
┃ ┣ ? flag_char_000.mp3
┃ ┣ ? flag_char_001.mp3
┃ ┗ ...
┗ ? tracks # 638 MP3s (>500MB)
┣ ? 0001_01.mp3
┣ ? 0002_01.mp3
┗ ...
Here comes the hard part: figuring out which audio file maps to each of the flag_char_XXX.mp3
Attempt 1: I tried using python difflib's SequenceMatcher, but was not able to find matching audio files. My guess is that while performing ffmpeg
the sequence of bytes may not necessarily align perfectly.
from difflib import SequenceMatcher
def compare():
# loop through all track files
with open("resources.json", "r") as f:
resources = json.load(f)
for resource in resources:
ass = resource["assetbundleName"]
# use ffmpeg to compare file with all 12 flags
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
def brute_force_flag_char(file_name):
file_to_brute = open(file_name, "rb").read()
# loop through all track files
with open("resources.json", "r") as f:
resources = json.load(f)
for resource in resources:
ass = resource["assetbundleName"]
# use ffmpeg to compare file with all 12 flags
file2 = open(f"trim_tracks_mp3/{ass}_3s.wav", "rb").read()
sim_ratio = similar(file_to_brute, file2)
if sim_ratio > 0:
print(f"{sim_ratio}: {ass}")
# sit and wait
for i in range(11):
Attempt 2: I then tried using audiodiff, but again it didn't work.
At this point I felt defeated.
Maybe I implemented someting wrongly...
Then, after some more Googling, I stumbled upon gold: Sononym
It is a free software that allows you to find similar sounding samples in a sample collection with simple drag-and-drop UI:
I downloaded the software. Dragged my tracks
folder containing 638 audio files into the app. Then dragged flag_char_000.mp3
in as well.
Lo and behold, an instant 99% match on vs_0118_01.mp3
which corresponds to musicId: 118
or chr(118)
which is the letter v
Now quickly repeat this for all 11 characters:
flag characters | musicId files | unicode | ascii |
flag_char_000.mp3 | vs_0118_01.mp3 | 118 | v |
flag_char_001.mp3 | 0048_01.mp3 | 48 | 0 |
flag_char_002.mp3 | 0067_01.mp3 | 67 | C |
flag_char_003.mp3 | vs_0097_01.mp3 | 97 | a |
flag_char_004.mp3 | vs_0108_01.mp3 | 108 | l |
flag_char_005.mp3 | vs_0111_01.mp3 | 111 | o |
flag_char_006.mp3 | 0073_01.mp3 | 73 | I |
flag_char_007.mp3 | vs_0100_01.mp3 | 100 | d |
flag_char_008.mp3 | 0060_01.mp3 | 60 | < |
flag_char_009.mp3 | vs_0051_01.mp3 | 51 | 3 |
flag_char_010.mp3 | vs_0117_01 .mp3 | 117 | u |
And at last, the flag has been found:
The end must justify the means.
-- Written By Zi Nean Teoh