Tags: sekaictf2022 vocaloid_heardle 

Rating:

SEKAI CTF 2022: Vocaloid Heardle

Problem Description

Well, it’s just too usual to hide a flag in stegano, database, cipher, or server. What if we decide to sing it out instead?

Author: pamLELcu

See challenge here: https://ctf.sekai.team/challenges#Vocaloid-Heardle-23

Files Provided

  1. vocaloid_heardle.py
  2. flag.mp3

Step 1: How the flag is generated

After looking at the files, it became clear that vocaloid_heardly.py is the file used to generate flag.mp3:

Let's imagine that the flag is SEKAI{THIS_IS_MY_FIRST_WRITEUP}. Given this flag, the python script:

  1. Removes the enclosing SEKAI{...} to get the inner substring THIS_IS_MY_FIRST_WRITEUP.

  2. Converts each character to unicode:

    ord(' T ') =  84
    ord(' H ') =  72
    ord(' I ') =  73
    ...
    
  3. Gets all musics with musicId equal to the characters' unicodes and downloads it, storing them into the array tracks:

    # returns a random assetbundleName from the list of all musics with musicId equals to the given input mid
    def get_resource(mid):
        return random.choice([i for i in resources if i["musicId"] == mid])["assetbundleName"]
    
    def download(mid):
        resource = get_resource(mid)
        r = requests.get(f"https://storage.sekai.best/sekai-assets/music/short/{resource}_rip/{resource}_short.mp3")
        filename = f"tracks/{mid}.mp3"
        with open(filename, "wb") as f:
            f.write(r.content)
        return mid
    
    tracks = [download(ord(i)) for i in flag]
    
    # here is how tracks look like after execution:
    # tracks = [
    #   'vs_0084_01',       --> musicId = 84 ('T')
    #   '0072_01',          --> musicId = 72 ('H')
    #   'se_0073_01'        --> musicId = 73 ('I')
    #   ...
    # ]
    
  4. Stitches together the given music files using ffmpeg to generate flag.mp3:

    # stage 1
    inputs = sum([["-i", f"tracks/{i}.mp3"] for i in tracks], [])
    # stage 2
    filters = "".join(f"[{i}:a]atrim=end=3,asetpts=PTS-STARTPTS[a{i}];" for i in range(len(tracks))) + \
          "".join(f"[a{i}]" for i in range(len(tracks))) + \
          f"concat=n={len(tracks)}:v=0:a=1[a]"
    # stage 3
    subprocess.run(["ffmpeg"] + inputs + ["-filter_complex", filters, "-map", "[a]", "flag.mp3"])
    
    # stage 1:
    # inputs = [
    #   '-i', 'tracks/vs_0084_01.mp3',
    #   '-i', 'tracks/0071_01.mp3',
    #   '-i', 'tracks/se_0073_01.mp3',
    #   ...
    # ]
    
    # stage 2:
    # filters = '[0:a]atrim=end=3,asetpts=PTS-STARTPTS[a0];[1:a]atrim=end=3,asetpts=PTS-STARTPTS[a1];[2:a]atrim=end=3,asetpts=PTS-STARTPTS[a2]; ...'
    
    # stage 3:
    # ffmpeg -i tracks/vs_0084_01.mp3 -i ... -filter_complex <filters> -map [a] flag.mp3
    

Step 2: Reversing the code

Having understood how the flag is generated, the obvious next step is to somehow figure out (1) which music files make up flag.mp3, (2) get the corresponding musicIds from the file names, and (3) convert the musicIds from unicode to ASCII.

2.1 Understand how FFMPEG works

I needed to Google a bit to figure out what the ffmpeg instruction was doing specifically.

I stumbled upon a stackoverflow post that teaches us to concatenate two audio files via ffmpeg's filter_complex.

Here is a visualization of how ffmpeg commands work for the above example:

how ffmpeg works

ffmpeg accepts some input files via the -i option, then performs a series of filters via the -filter_complex option (which are separated by semicolons), and finally outputs & saves the final output stream [a] as flag.mp3.

You can learn more about how ffmpeg works here.

Diving deeper into filter_complex, I learned what the atrim and asetpts filters do in stage 2:

"atrim=end=3" will stop trimming at 3 seconds "asetpts=PTS-STARTPTS" will specify to start at the first frame

Thus, stage 2 is essentially trimming the first 3 seconds of all the audio files and concatenating them together in the order of input:

# stage 2
filters = "".join(f"[{i}:a]atrim=end=3,asetpts=PTS-STARTPTS[a{i}];" for i in range(len(tracks))) + \
      "".join(f"[a{i}]" for i in range(len(tracks))) + \
      f"concat=n={len(tracks)}:v=0:a=1[a]"

A quick sanity check confirms that our hypothesis is true: flag.mp3 is an audio file that lasts for 33 seconds (multiple of 3), and while playing the audio file we learn that every 3 seconds the music changes.

Thus, the inner substring of the flag must contain 33 / 3 = 11 characters!

2.2 When life gives you ffmpeg, make a bunch of audio files

Why not immediately make use of all the knowledge we've learned about ffmpeg?

A quick google search taught me how to split an audio file into equal segments using ffmpeg:

ffmpeg -i flag.mp3 -f segment -segment_time 3 -c copy flag_char_%03d.mp3

This generated precisely 11 files:

? vocaloid_heardle
┣ ? flag.mp3
┣ ? vocaloid_heardle.py
┗ ? flag_chars
  ┣ ? flag_char_000.mp3
  ┣ ? flag_char_001.mp3
  ┣ ? flag_char_002.mp3
  ┣ ? flag_char_003.mp3
  ┣ ? flag_char_004.mp3
  ┣ ? flag_char_005.mp3
  ┣ ? flag_char_006.mp3
  ┣ ? flag_char_007.mp3
  ┣ ? flag_char_008.mp3
  ┣ ? flag_char_009.mp3
  ┗ ? flag_char_010.mp3

2.3 The end justifies the means

One hour into the challenge and I was determined to solve this CTF. My teammates probably got tired of hearing me repeatedly play flag.mp3. It is about time for me to tell them "I found the flag!!"

Okay, we got the individual audio files. Now we need to know which musicId each of the flag_char_XXX.mp3 corresponds to.

What's the most algorithmically efficient way to do that?

Brute force. Brute force is the way.

And so that's what I did:

  • I scraped and downloaded all 638 music files (>500MB) provided by resources.json:

    # get all possible resourceID from resources.josn
    def scrape():
        with open("resources.json", "r") as f:
            resources = json.load(f)
    
        # download all possible assetBundleNames
        for resource in resources:
            ass = resource["assetbundleName"]
            print("getting asset:", ass)
            r = requests.get(f"https://storage.sekai.best/sekai-assets/music/short/{ass}_rip/{ass}_short.mp3")
    
            # write to a new file
            filename = f"tracks/{ass}.mp3"
            with open(filename, "wb") as f:
                f.write(r.content)
            print(f"wrote to file: tracks/{ass}.mp3")
    
    # sit and wait
    scrape()
    

Now my folder looks like this:

? vocaloid_heardle
┣ ? flag.mp3
┣ ? vocaloid_heardle.py
┣ ? flag_chars
┃ ┣ ? flag_char_000.mp3
┃ ┣ ? flag_char_001.mp3
┃ ┗ ...
┗ ? tracks             # 638 MP3s (>500MB)
  ┣ ? 0001_01.mp3
  ┣ ? 0002_01.mp3
  ┗ ...

Here comes the hard part: figuring out which audio file maps to each of the flag_char_XXX.mp3.

  • Attempt 1: I tried using python difflib's SequenceMatcher, but was not able to find matching audio files. My guess is that while performing ffmpeg the sequence of bytes may not necessarily align perfectly.

    # DID NOT WORK
    from difflib import SequenceMatcher
    
    def compare():
        # loop through all track files
        with open("resources.json", "r") as f:
            resources = json.load(f)
    
        for resource in resources:
            ass = resource["assetbundleName"]
            # use ffmpeg to compare file with all 12 flags
    
    def similar(a, b):
        return SequenceMatcher(None, a, b).ratio()
    
    def brute_force_flag_char(file_name):
        file_to_brute = open(file_name, "rb").read()
        # loop through all track files
        with open("resources.json", "r") as f:
            resources = json.load(f)
    
        for resource in resources:
            ass = resource["assetbundleName"]
            # use ffmpeg to compare file with all 12 flags
            file2 = open(f"trim_tracks_mp3/{ass}_3s.wav", "rb").read()
            sim_ratio = similar(file_to_brute, file2)
            if sim_ratio > 0:
                print(f"{sim_ratio}: {ass}")
    
    # sit and wait
    for i in range(11):
        brute_force_flag_char(f"flags/flag_char_{i:03}.mp3")
    
  • Attempt 2: I then tried using audiodiff, but again it didn't work.

At this point I felt defeated.

Maybe I implemented someting wrongly...

Step 3: Stumbling upon gold

Then, after some more Googling, I stumbled upon gold: Sononym

It is a free software that allows you to find similar sounding samples in a sample collection with simple drag-and-drop UI:

Demo of Sononym Similarity Search

I downloaded the software. Dragged my tracks folder containing 638 audio files into the app. Then dragged flag_char_000.mp3 in as well.

Lo and behold, an instant 99% match on vs_0118_01.mp3 which corresponds to musicId: 118 or chr(118) which is the letter v!

Step 4: Getting the flag

Now quickly repeat this for all 11 characters:

flag charactersmusicId filesunicodeascii
flag_char_000.mp3vs_0118_01.mp3118v
flag_char_001.mp30048_01.mp3480
flag_char_002.mp30067_01.mp367C
flag_char_003.mp3vs_0097_01.mp397a
flag_char_004.mp3vs_0108_01.mp3108l
flag_char_005.mp3vs_0111_01.mp3111o
flag_char_006.mp30073_01.mp373I
flag_char_007.mp3vs_0100_01.mp3100d
flag_char_008.mp30060_01.mp360<
flag_char_009.mp3vs_0051_01.mp3513
flag_char_010.mp3vs_0117_01 .mp3117u

And at last, the flag has been found:

SEKAI{v0CaloId<3u}

The end must justify the means.

-- Written By Zi Nean Teoh

Original writeup (https://github.com/zineanteoh/ctf/blob/main/vocaloid_heardle.md).