P
R
O
G
R
A
M
M
I
N
G
 
C
O
M
P
E
T
I
T
I
O
N
 

Gene Sequencing

Task #13
Advanced
14 Points

Task Description
You are working on the human genome project, and your team is closing in on the last few base pairs needed to complete the project and go on to certain glory and perhaps a Nobel prize. Researchers on your team have managed to obtain several sets of segments needed to sequence the last remaining gene, but your help is urgently needed to lead the effort.
     Your job is to write a program which can reconstruct the gene sequence using the given fragmentary segments of the larger gene. For each set of segments given, you must devise an algorithm to sequence them in the correct order. If you succeed, your name will appear in all biology textbooks from this day forward. If you are too slow, someone else will take your place. Good luck.

source: The MATLAB Programming Contest - http://www.mathworks.com/contest/sequence.cgi/rules.html

Program Input
The gene segments given to you have been chemically snipped from a single gene and sequenced individually. Your job is to reassemble them jigsaw-style into the most likely single sequence they could have come from. The gene segments will be represented in the input file as a character (or string) matrix.
     Only four letters will be used. Each stands for one of the nitrogenous bases in DNA: A (adenine), C (cytosine), G (guanine), and T (thymine). The segments in any single test sequence have the same length (since they're passed as a char matrix). But that length, as well as the number of segments, will vary across the many tests included in our test suite - in other words, your program will be given inputs of different, unknown sizes. In one case, we might have 5 segments of length 4. Another test case might provide 10 segments of length 12. For our purposes we will limit all test cases to a maximum of 10 segments with a maximum segment length of 16.
     To provide a sense of scale, the entire human genome is approximately three billion base pairs long.
     The first line of the input file indicates the total number of segment sets. Each set begins with a line of two integers specifying the number of segments N and the segment size Z, respectively. The next N lines will be strings of length Z.
     

2
5 4
TCGG
GCAG
ATCG
CAGC
GATC
4 16
AGCTTGCTGAAACGTA
CTTGCTGAAACGTATC
TATCGATGCATCGATC
AAACGTATCGATGCAT

Program Output
Your program should determine the shortest gene that can be made from these pieces. The rules for assembling a sequence are: (1) the sequence must use all the segments (and only the segments) provided, (2) you cannot flip any of the segments, (3) the result must be a one row char array, (4) the best answer is the shortest answer, (5) speed counts. For example, if we slide around the above segments we can get:

  TCGG
     GCAG
 ATCG
      CAGC
GATC
==========
GATCGGCAGC
This solution has length 10. It should be obvious that there will be many solutions for each set. The easiest solution is just to stick all the pieces placed end to end. While this sequence is valid, it's unlikely to be a faithful reconstruction of the original gene and would therefore be penalized because of its length. The team with the shortest solution will threrefor earn a 3 point bonus. Keep in mind that length isn't the only criteria. Your entry will be judged also on its speed. The program must execute in less than 30 seconds on a K6 or Pentium at 200MHz or it will be marked incorrect. The program output file should show each solution on a separate line.

GATCGGCAGC
AGCTTGCTGAAACGTATCGATGCATCGATC