**
Winner Francois Glineur
**
(GFD6e)

In the recent week long MATLAB programming contest, about 800 entries went through the contest queue, resulting in three intermediate prize winners and one final winner.

In addition to looking at this analysis (which focuses on the techniques used in the winning entry), you might want to look at the discussion areas to see other interesting threads related to the MATLAB Contest.

score = cputime + totalscore;

- "cputime" is the runtime of your function, measured with CPUTIME.
- "totalscore" is the sum of the lengths of the segments your function returned

Instead of displaying "totalscore" on our Rankings page, we displayed "result", a normalized measure how how short your sequence was. "result" was computed with:

result = totalscore/Our_Length*100;

- "Our_Length" is the sum of the lengths of the original segments used to create the fragments given to your function

(As an aside, our experience with previous contests shows that hiding details like this doesn't always prevent the entries from "evolving" to be suited to the specific trials presented. However, it's not clear if that happened here, since the best entries did fairly well across many different types of tests.)

This MAT file contains two cell arrays. The one_answer{i} contains the original sequence used to create the fragments stored in testsuite{i}. There were 29 tests, with original lengths ranging from 10 up to 23456 characters.

This typically meant comparing all the segments to each other to determine how to make the best match. While time wasn't weighted nearly as strongly as the scoring function, time was initially important because it was difficult to get a slow brute force algorithm to pass the testsuite without timing out.

Here were a few common ideas we saw in the top entries; we've pointed out the code related to these ideas further below where we've "dissected" the code.

#### Remove identical segments

To avoid unnecessary work, many entries first removed duplicate fragments. One way of doing this is to use SORTROWS followed by DIFF to remove duplicates. Since SORTROWS operates on all columns of the data set, this does more work than is necessary and takes too long. A few entries "inlined" the SORTROWS algorithm, but this still performed unnecessary work.

A clever idea involved removing the SORTROWS algorithm entirely. All that's needed is to remove identical segments; we don't need to sort all the rows. With this in mind, a simple hash was performed, which mapped the n-space SORTROWS and DIFF problem (where n is the length of a segment) into a 1-d SORT and DIFF problem. Because only one column of data is sorted, this saves time. This mapping was done using a vector of random numbers; while this wasn't 100% foolproof, the likelihood of random numbers not producing the desired behavior is small and this method worked well.

Here's that mapping as used in the winning entry:

lSegs = size(Segments, 2); [SortedHash Order] = sort(Segments * randn(lSegs, 1)); startSegs = Order(diff(cat(1, SortedHash, 0)) ~= 0)';

#### Match the segments

Matching the segments is where most of the algorithms spent most of their time, matching up every pair with every possible overlap; a natural way of matching the segments is to use STRCMP/STRNCMP and FIND.

As before, a simple hash was used to quickly perform the comparison, saving a significant amount of CPU time.

#### Update the information

As the segments were combined, the running length of the final output grew, but the rate at which it grew couldn't be determined ahead of time. This presented a challenge in efficiently storing the result as it's formed.

A common approach to solving this problem used cell arrays. For problems with large segments, one successful method was to track the information needed to combine the segments without actually doing the combining. Only at the end of the process were the segments combined into the final result. This turned out to be especially efficient with larger segments. The final winning entry used this method for longer segments and would combine segments for shorter segments; an IF statement was used to compare the segment length and switch between methods as needed.

The code for the winning entry is below. In some places, we've reformatted the code with "..." and newlines to make it more readable. The code begins with:

function Sequence = GFD6(Segments) if size(Segments, 1) > 200This IF statement is the cutoff swich statement to determine which algorithm to use for accumulating the final result.

Next we had:

lSegs = size(Segments, 2); [SortedHash Order] = sort(Segments * randn(lSegs, 1)); startSegs = Order(diff(cat(1, SortedHash, 0)) ~= 0)';These lines remove the duplicate fragments, according to the hashing-type algorithm we discussed earlier. Notice that the sort is only performed on a 1-d vector, instead of on an lSegs-d array. The multiplication 'Segments * randn(lSegs,1)' maps lSegs dimensions into 1 dimension.

endSegs = startSegs; nSegs = length(startSegs);These lines set up some work vectors to be carried around for updating information.

Next we need to set up a cell array to carry information about the combinations needed to produce our final result:

Description = cell(nSegs, 1); for index = startSegs Description{index} = index; endWhile this code works well, a minor problem in this code is that this pre-allocated size may not be large enough to hold the final result.

Below we begin the routine for finding which segments have the best (greatest) overlap:

matchLength = lSegs-1; leftSegs = nSegs;Since there are no identical fragments, the longest match possible is lSegs-1. Now we begin to loop over the possible shifts to find the best match:

while matchLength > 0 RandWeight = randn(matchLength, 1); LeftHash = Segments(endSegs, lSegs-matchLength+1:end) * RandWeight; RightHash = (Segments(startSegs, 1:matchLength) * RandWeight)';The RandWeight vector is the vector which maps the lSegs matching problem into a 1-D problem, similar to what was done to remove duplicate entries.

Next we'll loop over the uncombined segments and check to see if there are any matches:

indexLeft = 1; while indexLeft <= leftSegs Match = findstr(LeftHash(indexLeft), RightHash); if Match indexRight = Match(1);After checking for matches, we check for a special case:

if (indexRight == indexLeft) & (length(Match) > 1) % Rare case, happens e.g. with strvcat('AAA','AAG','AGC') indexRight = Match(2); endDid you remember that we're still inside an IF statement here? In this section of the IF, we don't actually combine the segments when a match is found. Instead we store information about how to combine the segments and do the combination at the end of the routine.

The following code updates the Description array to record the combining pairs and the amount by which they're shifted; it also updates the related working vectors:

if indexRight ~= indexLeft Description{startSegs(indexLeft)} = cat(2,... Description{startSegs(indexLeft)}, lSegs-matchLength, ... Description{startSegs(indexRight)}); startSegs(indexRight) = startSegs(indexLeft); startSegs(indexLeft) = []; endSegs(indexLeft) = []; leftSegs = leftSegs - 1; RightHash(indexRight) = RightHash(indexLeft); RightHash(indexLeft) = []; LeftHash(indexLeft) = [];If there's nothing left to track, we break out of this routine:

if leftSegs == 1 matchLength = 1; break; end elseHere's the rest of the "if indexRight ~= indexLeft" conditional:

indexLeft = indexLeft + 1; end else indexLeft = indexLeft + 1;And now we end the "if Match" conditional:

endNow we can end the "while indexLeft <= leftSegs" loop:

endOne more step to reduce matchLength:

matchLength = matchLength - 1;Now we can end the "while matchLength > 0" conditional.

endBut we're not finished yet! Now we need to combine the segments according to the Description cell array created. We'll start by attaching all nonmatching fragments to the end of the sequence:

if length(startSegs) > 1 % If disjoint segments in the end, we have to merge them ! for index = 2:length(startSegs) Description{startSegs(1)} = [Description{startSegs(1)},... lSegs, Description{startSegs(index)}]; end startSegs = startSegs(1); endNow we actually combine the segments - check out the line used here:

Sequence(repmat(cumsum(cat(2, 0, ... Description{startSegs}(2:2:2*nSegs-2)))',... 1, lSegs) + repmat(1:lSegs, nSegs, 1))... = Segments(Description{startSegs}(1:2:2*nSegs-1), :);Pretty impressive! This code snippet is a great example of vectorizing. REPMAT, CUMSUM, and the colon operator are all used in this line, making for a very efficient implementation.

Keeping in mind we're still inside an IF conditional, the above section of code only executes for segments of length greater than 200. The following block of code beginning with the ELSE describes the algorithm for used for shorter segments; it combines segments as soon as a match is found:

else [nSegs, lSegs] = size(Segments); [SortedHash Order] = sort(Segments * randn(lSegs,1)) ; Segments = Segments(Order(diff(cat(1,SortedHash,0)) ~= 0), :); [nSegs, lSegs] = size(Segments); theSegments = cell(nSegs, 1); theSegs = 1:nSegs; for index = theSegs, theSegments{index} = Segments(index, :); end copySegments = theSegments; matchLength = lSegs-1; while matchLength > 0 Segments(:, 1) = []; for leftSeg = theSegsThe following line is a very computationally expensive line. If you profile this function, you'll see that a majority of the time spent in this routine is spend on this line:

Match = find(strncmp(Segments(leftSeg, :), copySegments, matchLength)); if Match Select = Match(1); rightSeg = theSegs(Select); if rightSeg == leftSeg if length(Match) > 1 % Rare case temp = Match(Match ~= Select); Select = temp(1); rightSeg = theSegs(Select);Here's one of the code sections that does the actual combining:

theSegments{rightSeg} = cat(2,theSegments{leftSeg}, ... theSegments{rightSeg}(1+matchLength:end)); i = find(theSegs == leftSeg); theSegs(i) = []; if length(theSegs) == 1 Sequence = theSegments{theSegs}; return; end copySegments(Select) = copySegments(i); copySegments(i) = []; endThis section of code is then followed by:

else theSegments{rightSeg} = cat(2,theSegments{leftSeg}, ... theSegments{rightSeg}(1+matchLength:end)); i = find(theSegs == leftSeg); theSegs(i) = []; if length(theSegs) == 1 Sequence = theSegments{theSegs}; return; end copySegments(Select) = copySegments(i); copySegments(i) = []; end end end matchLength = matchLength - 1; end Sequence = [theSegments{theSegs}]; endTake a look at these two sections of code. They have a lot in common; their structure is very similar. It'd be interesting to see if it's possible to remove any redundancies between the code to improve the performance yet again.

We hope you've enjoyed this analysis of the code. If you have more questions on it, you may want to contact Francois, his contact information is listed next to his winning entry.