Darkness 2000-03-20 00:00:00 UTC
 
Twilight 2000-03-21 00:00:00 UTC
 
Daylight 2000-03-22 00:00:00 UTC
 
Finish 2000-03-27 00:00:00 UTC

Gene Sequencing - Winners

In the recent week long MATLAB programming contest, about 800 entries went through the contest queue, resulting in three intermediate prize winners and one final winner.

In addition to looking at this analysis (which focuses on the techniques used in the winning entry), you might want to look at the discussion areas to see other interesting threads related to the MATLAB Contest.

The Scoring Function

We wanted the scoring function to strongly favor sequence length, and to consider time factors only when separating entries with similar performance. The scoring function looked like:

 
score = cputime + totalscore;
  • "cputime" is the runtime of your function, measured with CPUTIME.
  • "totalscore" is the sum of the lengths of the segments your function returned
Since the sequence lengths in "totalscore" were orders of magnitude higher than the cputimes, optimizing the sequence length much more significantly affected each entry's ranking.

Instead of displaying "totalscore" on our Rankings page, we displayed "result", a normalized measure how how short your sequence was. "result" was computed with:

 
result = totalscore/Our_Length*100;
  • "Our_Length" is the sum of the lengths of the original segments used to create the fragments given to your function
"result" then returns a normalized value letting you know how long your result was with respect to the segment we used to create the fragments. Scores of less than 100 indicate a result shorter than the original sequence, while scores of over 100 indicate a longer sequence. Normalizing the score made it easier to judge how close to matching the original sequence an entry was, since a score above 100 meant that there was definitely room for improvement. And to some extent, this allowed us to keep you from guessing on the lengths of the sequences in the test suite and customizing your code for those lengths.

(As an aside, our experience with previous contests shows that hiding details like this doesn't always prevent the entries from "evolving" to be suited to the specific trials presented. However, it's not clear if that happened here, since the best entries did fairly well across many different types of tests.)

The Test Suite

A MAT-file containing the segments and the original piece can be found here.

This MAT file contains two cell arrays. The one_answer{i} contains the original sequence used to create the fragments stored in testsuite{i}. There were 29 tests, with original lengths ranging from 10 up to 23456 characters.

Overview of the Techniques Used in the Top Entries

The majority of the contest entries used some type of brute force method to solve this problem, which is generally thought to be NP complete.

This typically meant comparing all the segments to each other to determine how to make the best match. While time wasn't weighted nearly as strongly as the scoring function, time was initially important because it was difficult to get a slow brute force algorithm to pass the testsuite without timing out.

Here were a few common ideas we saw in the top entries; we've pointed out the code related to these ideas further below where we've "dissected" the code.

  1. Remove identical segments

    To avoid unnecessary work, many entries first removed duplicate fragments. One way of doing this is to use SORTROWS followed by DIFF to remove duplicates. Since SORTROWS operates on all columns of the data set, this does more work than is necessary and takes too long. A few entries "inlined" the SORTROWS algorithm, but this still performed unnecessary work.

    A clever idea involved removing the SORTROWS algorithm entirely. All that's needed is to remove identical segments; we don't need to sort all the rows. With this in mind, a simple hash was performed, which mapped the n-space SORTROWS and DIFF problem (where n is the length of a segment) into a 1-d SORT and DIFF problem. Because only one column of data is sorted, this saves time. This mapping was done using a vector of random numbers; while this wasn't 100% foolproof, the likelihood of random numbers not producing the desired behavior is small and this method worked well.

    Here's that mapping as used in the winning entry:

     
       lSegs = size(Segments, 2);
     
       [SortedHash Order]  = sort(Segments * randn(lSegs, 1));
       startSegs = Order(diff(cat(1, SortedHash, 0)) ~= 0)';
      
  2. Match the segments

    Matching the segments is where most of the algorithms spent most of their time, matching up every pair with every possible overlap; a natural way of matching the segments is to use STRCMP/STRNCMP and FIND.

    As before, a simple hash was used to quickly perform the comparison, saving a significant amount of CPU time.

  3. Update the information

    As the segments were combined, the running length of the final output grew, but the rate at which it grew couldn't be determined ahead of time. This presented a challenge in efficiently storing the result as it's formed.

    A common approach to solving this problem used cell arrays. For problems with large segments, one successful method was to track the information needed to combine the segments without actually doing the combining. Only at the end of the process were the segments combined into the final result. This turned out to be especially efficient with larger segments. The final winning entry used this method for longer segments and would combine segments for shorter segments; an IF statement was used to compare the segment length and switch between methods as needed.

Let's look at the winning entry.

The Winning Entry

The first puzzle related to analyzing the winning entry was determining what GFD stood for! Francois Glineur, our winner, told us that GFD stood for Gene-Francois-Deborah (the latter being his wife's first name).

The code for the winning entry is below. In some places, we've reformatted the code with "..." and newlines to make it more readable. The code begins with:

 
 function Sequence = GFD6(Segments)
 
 if size(Segments, 1) > 200
This IF statement is the cutoff swich statement to determine which algorithm to use for accumulating the final result.

Next we had:

 
   lSegs = size(Segments, 2);
   
   [SortedHash Order]  = sort(Segments * randn(lSegs, 1));
   startSegs = Order(diff(cat(1, SortedHash, 0)) ~= 0)';
These lines remove the duplicate fragments, according to the hashing-type algorithm we discussed earlier. Notice that the sort is only performed on a 1-d vector, instead of on an lSegs-d array. The multiplication 'Segments * randn(lSegs,1)' maps lSegs dimensions into 1 dimension.

 
   endSegs = startSegs;
   
   nSegs = length(startSegs);
These lines set up some work vectors to be carried around for updating information.

Next we need to set up a cell array to carry information about the combinations needed to produce our final result:

 
   Description = cell(nSegs, 1);
   for index = startSegs
     Description{index} = index;
   end
While this code works well, a minor problem in this code is that this pre-allocated size may not be large enough to hold the final result.

Below we begin the routine for finding which segments have the best (greatest) overlap:

 
   matchLength = lSegs-1; leftSegs = nSegs;
Since there are no identical fragments, the longest match possible is lSegs-1. Now we begin to loop over the possible shifts to find the best match:

 
   while matchLength > 0
     RandWeight = randn(matchLength, 1);
     LeftHash = Segments(endSegs, lSegs-matchLength+1:end) * RandWeight;
     RightHash = (Segments(startSegs, 1:matchLength) * RandWeight)';
The RandWeight vector is the vector which maps the lSegs matching problem into a 1-D problem, similar to what was done to remove duplicate entries.

Next we'll loop over the uncombined segments and check to see if there are any matches:

 
     indexLeft = 1; 
     while indexLeft <= leftSegs 
       Match = findstr(LeftHash(indexLeft), RightHash);
       if Match
         indexRight = Match(1);
After checking for matches, we check for a special case:

 
         if (indexRight == indexLeft) & (length(Match) > 1) 
                  % Rare case, happens e.g. with strvcat('AAA','AAG','AGC') 
           indexRight = Match(2); 
         end
Did you remember that we're still inside an IF statement here? In this section of the IF, we don't actually combine the segments when a match is found. Instead we store information about how to combine the segments and do the combination at the end of the routine.

The following code updates the Description array to record the combining pairs and the amount by which they're shifted; it also updates the related working vectors:

 
         if indexRight ~= indexLeft
           Description{startSegs(indexLeft)} = cat(2,...
               Description{startSegs(indexLeft)}, lSegs-matchLength, ...
               Description{startSegs(indexRight)}); 
           startSegs(indexRight) = startSegs(indexLeft); 
           startSegs(indexLeft) = []; endSegs(indexLeft) = [];
           leftSegs = leftSegs - 1;
           RightHash(indexRight) = RightHash(indexLeft);
           RightHash(indexLeft) = []; LeftHash(indexLeft) = [];
If there's nothing left to track, we break out of this routine:

 
           if leftSegs == 1 
             matchLength = 1;
             break;   
           end
         else
Here's the rest of the "if indexRight ~= indexLeft" conditional:

 
           indexLeft = indexLeft + 1;
         end
       else
         indexLeft = indexLeft + 1;
And now we end the "if Match" conditional:

 
       end
Now we can end the "while indexLeft <= leftSegs" loop:

 
     end
One more step to reduce matchLength:

 
     matchLength = matchLength - 1;
Now we can end the "while matchLength > 0" conditional.

 
   end
But we're not finished yet! Now we need to combine the segments according to the Description cell array created. We'll start by attaching all nonmatching fragments to the end of the sequence:

 
   if length(startSegs) > 1 
        % If disjoint segments in the end, we have to merge them !
     for index = 2:length(startSegs)
       Description{startSegs(1)} = [Description{startSegs(1)},...
                                    lSegs, Description{startSegs(index)}];
     end
     startSegs = startSegs(1);
   end
Now we actually combine the segments - check out the line used here:

 
   Sequence(repmat(cumsum(cat(2, 0, ...
                              Description{startSegs}(2:2:2*nSegs-2)))',...
                              1, lSegs) + repmat(1:lSegs, nSegs, 1))...
            = Segments(Description{startSegs}(1:2:2*nSegs-1), :);
Pretty impressive! This code snippet is a great example of vectorizing. REPMAT, CUMSUM, and the colon operator are all used in this line, making for a very efficient implementation.

Keeping in mind we're still inside an IF conditional, the above section of code only executes for segments of length greater than 200. The following block of code beginning with the ELSE describes the algorithm for used for shorter segments; it combines segments as soon as a match is found:

 
 else
   [nSegs, lSegs] = size(Segments);
   
   [SortedHash Order]  = sort(Segments * randn(lSegs,1))  ;
   Segments = Segments(Order(diff(cat(1,SortedHash,0)) ~= 0), :);
   [nSegs, lSegs] = size(Segments);
   
   theSegments = cell(nSegs, 1);
   theSegs = 1:nSegs;
   for index = theSegs,
     theSegments{index} = Segments(index, :);
   end
   copySegments = theSegments;
   
   matchLength = lSegs-1;
   while matchLength > 0
     Segments(:, 1) = [];
     for leftSeg = theSegs
The following line is a very computationally expensive line. If you profile this function, you'll see that a majority of the time spent in this routine is spend on this line:

 
       Match = find(strncmp(Segments(leftSeg, :), copySegments, matchLength));
 
 
       if Match
         Select = Match(1);
         rightSeg = theSegs(Select);
         if rightSeg == leftSeg 
           if length(Match) > 1 % Rare case
             temp = Match(Match ~= Select);
             Select = temp(1);
             rightSeg = theSegs(Select); 
             
Here's one of the code sections that does the actual combining:

 
             theSegments{rightSeg} = cat(2,theSegments{leftSeg}, ...
                                         theSegments{rightSeg}(1+matchLength:end));
             i = find(theSegs == leftSeg);
             theSegs(i) = [];
             if length(theSegs) == 1 
               Sequence = theSegments{theSegs};
               return;
             end
             copySegments(Select) = copySegments(i);
             copySegments(i) = [];
           end
This section of code is then followed by:

 
         else
           theSegments{rightSeg} = cat(2,theSegments{leftSeg}, ...
                                       theSegments{rightSeg}(1+matchLength:end));
           i = find(theSegs == leftSeg);
           theSegs(i) = [];
           if length(theSegs) == 1 
             Sequence = theSegments{theSegs};
             return;
           end
           copySegments(Select) = copySegments(i);
           copySegments(i) = [];
         end
       end
     end
     matchLength = matchLength - 1;
   end
   Sequence = [theSegments{theSegs}];
 end
Take a look at these two sections of code. They have a lot in common; their structure is very similar. It'd be interesting to see if it's possible to remove any redundancies between the code to improve the performance yet again.

We hope you've enjoyed this analysis of the code. If you have more questions on it, you may want to contact Francois, his contact information is listed next to his winning entry.