CTFtime.org / CTFZone 2018 Quals / Help Mars! / Writeup

> Hello!
> The Martians need your help. They are in contact with the H5N1 virus.
> We know that there is a universal vaccine (locus HW306977) on our planet. Find the substances on their planet that can be used to synthesize the vaccine. A large sample database is at your disposal.
> mars_dna_samples.zip.
> Your task is to select the right combination of samples (recipe). Result is the shortest (lowest count of samples used to synthesize the vaccine). If you find more than one shortest recipe - select the one, which has the longest code in each sample from start to end.
>
> Example:
> Code: 123456
> Samples (id,code):
>
> 1,4
> 2,6
> 3,12
> 4,34
> 5,56
> 6,45
> 7,123
>
> Available combinations
>
> 12-34-56
> 123-45-6
> 123-4-56
>
> Solving: 123 + 45 + 6
> Result: 7,6,2
>
> Flag is ctfzone{md5(Result)}

## Solution in short
Find the vaccine sequence online, store the samples in a Trie structure and use DP (Memoization) to find the solution (a fast algorithm is required in this problem).

## Finding the vaccine sequence
This is the OSINT part. There is a patent about this: [link](https://patents.google.com/patent/US9555093B2/en) with a downloadable pdf and a link with the content as [normal text](http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/9555093). Use the pdf to get a well formatted version and the other page to extract the string (this was a bit annoying to filter this long messed up string..) .

The sequences are at the end. Sequence 3 is the required one (1707 characters).

## Trie structure
In short: We have a tree where each node has exactly for children (one for each base ATGC in a dna sequence). Each string can be seen as a path from the root downwards using the i'th character in the string to choose the appropriate child at level i. The node where a string ends is annotated, in our case with the id of the string (sample). The trie is constructed by inserting all samples one by one (for each sample, descend the tree according to the string starting at the root, extend the tree if necessary and annotate the final node).

For a detailed description about Tries, use your OSINT skills. E.g., [Wikipedia](https://en.wikipedia.org/wiki/Trie)

## Finding the (best) solution
*We call the expected solution simply 'the best solution' . All other 'solutions' are valid combinations to form the sequence but need more samples or have shorter samples in the beginning -> they are 'worse'*

Let's say we want to build the string s = s1 s2 s3 ... sn.

Every possibility to build the string s is a combination of a prefix p = s1 s2 ... si and a suffix q = si+1 si+2 ... sn, where p is a sample and q is a string that can be built by using at least 0 samples (q can be empty, exactly one sample or a combination of samples). Assume that we already know the best solution to build a specific q, the best solution to build s for a given q is simply by using p first followed the solution for q. We get the best solution for s for trying all i's (splitting position, therefore responsible for p and q) that are valid (p is a sample and q can be built). Save this solution for later.

To find all valid i's we descend our previously built tree starting at the root with the character s1. Whenever we hit an annotated node, the path so far is p and the level is i.

After completing the computation for s, extend s by one character to the left s' = s0 s1 s2 s3 ... sn and do the same again. By applying the algorithm for increasing suffixes of the complete string, we ensure to always have the best solution for q already computed earlier.

## Result
The computation takes less than a second to find the solution
```
10532,111808,9066,64465,34874,89064,137,4547,25145,10824,122193,6169,10895,40896,187356,4721,1732,46101,39262,103437,8324,
9661,61254,17561,17562,635,183688,24275,24285,172470,54159,90283,812,145888,26216,67897,24483,60020,8749,7176,7233,225944
```

## Complexity
* Constructing the trie is done in O(Length of all samples concatenated)
* For a single string s, iterate over all splitting positions i: O(Length of s) = O(Length of final string)
* For all suffixes (and therefore to get the result): O(number of suffixes * O(Length of suffix)) = O(n^2) where n is the length of the final string (1707)

## The code
(for completeness the whole code. The real stuff is done in the short solve2() method)
```
#include <bits/stdc++.h>

using namespace std;
using ll = long long;
using vl = vector<ll>;
#define mp make_pair
#define pb push_back

#define FOR(i,a,b) for(int i = (a); i < (b); i++)

vector<pair<string,int>> v;

int counter = 1;

vector<vector<int>> sols;

vector<int> current;

map<char, int> m = {
{'T', 0},
{'G', 1},
{'A', 2},
{'C', 3}
};

struct Node {
vector<Node*> n;
Node() : n(4) {}
int id = -1;
};

Node * head = new Node();

// Insert s into the trie
void insert(string s, int id) {
Node * h = head;
int i = 0;
while (i < s.size()) {
int idx = m[s[i]];
if (h->n[idx] == nullptr) {
h->n[idx] = new Node();
}
h = h->n[idx];
i++;
}
h->id = id;
}

vector<vector<int>> besten;

void solve2(string rem) {
Node * h = head;
int i = 0;
// descend the trie
while (h != nullptr && i < rem.size()) {
int idx = m[rem[i]];
h = h->n[idx];
i++;
if (h != nullptr && h->id != -1) { // annotated node
int ol = besten[rem.size() - i].size();
if (ol +1 <= besten[rem.size()].size()) {
// same length should implicitly be better because longer substring in first element
besten[rem.size()] = besten[rem.size() - i];
besten[rem.size()].pb(h->id);
}
}
}
}
int main() {
string target;
cin >> target; // final string wanted
// the best solution found so far for all lengths of suffixes.
// currently really really bad (100000 samples 'needed')
besten.resize(target.size() + 1, vector<int>(100000));
ifstream seq;
// this file has one line per sample: string <TAB> id
// (this is for historical reasons ...)
seq.open("samples");
string s;
int id;
while (seq >> s >> id) {
insert(s, id); // insert sample with id
}
// an empty string is easy
besten[0] = vector<int>(0);
// now go over all suffixes
for (int i = target.size() - 1; i>= 0; i--) {
solve2(target.substr(i));
}

bool first = true;
for (int i = besten[target.size()].size() - 1; i >= 0; i--) {
if (first) first = false;
else cout << ",";
cout << besten[target.size()][i];
}
}
```

Help Mars!

Comments

Sign in with