Winner Markus (ones 8)

Darkness 2007-11-07 12:00:00 UTC
 
Twilight 2007-11-08 12:00:00 UTC
 
Daylight 2007-11-09 12:00:00 UTC
 
Finish 2007-11-14 12:00:00 UTC

Gene Splicing - Rules

Arrow_down  Download Sample Code

Gene Splicing is the 16th MATLAB Online Programming Contest.

The Problem

This contest is based on the bioinformatics of gene transposition.

We can think of genes as linear sequences stored in a database. Physically, this database is contained in long strands of DNA. Put another way, DNA is the "filing cabinet" in which genes are stored. When needed, these sequences are retrieved from the DNA database and used to create functional proteins. But the DNA database is far from static. Sometimes the physical locations of the genes move around inside the DNA filing cabinet. Similar in concept to disk defragmentation in a computer, this phenomenon is known as transposition. Thus two organisms that are very similar physically might have substantially different DNA because the order of their genes has been scrambled over time. Working out how one genome transforms into another one is of great interest biologically, because it reveals how two organisms are related. Since it is both computationally non-trivial and biologically significant, it is an active area of bioinformatics research.

How does transposition work?

In transposition, a strip of DNA is snipped out of one position of the genome and spliced back in somewhere else. As an extra complication, it has a chance of being flipped front to back during the process.


Image courtesy of the University of Wisconsin-Madison

For the purposes of this contest, you will be given two strands of "DNA": the test sequence and the target sequence. In practice, each strand will be a vector of positive (nonzero) numbers. Your job is to transform the test sequence into the target sequence as efficiently as possible using gene transposition. We can think of the target sequence as the genome of a related individual. The relevant question is: how many transposition events are required to make the two sequences identical?

Details

  • The test sequence and the target sequence are always the same length.
  • It may not be possible to transform the test sequence to be an exact match for the target sequence.

An Example

Imagine we have an input sequence a and a target sequence t.
a = [ 1  5  6  8  5  4  3  2  7 ]
t = [ 1  2  3  4  5  6  6  7  8 ]
There are few things to notice right away. First of all, the starting vector has two fives and one six, whereas the target vector has two sixes and one five. So it's not possible to get a perfect alignment in this case. Second, nothing prevents us from simply returning a as the answer vector b. In that case, the score would be 21, as calculated by this code.
b = a;
score = sum(abs(b-t));
But clearly we can do better. We can see that the fragment [5 4 3 2] can be flipped and inserted earlier in the sequence. Suppose we have the following gene splicing function.
function b = splice(a, len, ai, bi, flip)
%SPLICE  Perform the gene splice
%   b = splice(a, len, ai, bi, flip) where a is the input sequence, 
%   len is the length of the extracted fragment, ai is the fragment's
%   starting index in a, bi is the fragment's starting index in b, and 
%   flip is a boolean flag (true = flip, false = don't flip)

b = zeros(size(a));

aSpan = ai + (1:len) - 1;
bSpan = aSpan - ai + bi;

if flip
    b(bSpan) = fliplr(a(aSpan));
else
    b(bSpan) = a(aSpan);
end
 
a(aSpan) = 0;
b(~b) = a(~~a);
Using our splicing function, we want to take a fragment of length 4 from position 5 of vector a and re-insert it at position 2. Thus
b = splice(a, 4, 5, 2, true)
Now
b = [ 1  2  3  4  5  5  6  8  7 ]
which corresponds to a score of 3. Much better!

Here's what that process looks like algorithmically.

This represents one transposition event. With one more, we can get very close to the target vector. The following code swaps the last two elements. Note that there are a several different commands that will have the same result. For the purposes of scoring, it is unimportant which one we choose.

b = splice(a, 1, 8, 9, false)
The score has now been reduced to 1, and it is impossible to do any better. Thus it has taken us two transpositions to get from the original vector to the best possible match. If we have a transposition budget of two or more, then it doesn't matter if we make our biggest improvement in the first or the second move. If, on the other hand, we are constrained to a budget of exactly one transposition, then clearly we should make the move with the best improvement in score.

The test suite will include many cases in which you are not given enough moves to make a perfect match with the target, so choose your moves accordingly.

Calling Syntax

Your code will be called with three input arguments and one output argument like so:
moves = solver(sequence, target, budget)
The input vector is sequence, the goal is target, and budget is the maximum number of transpositions you are allowed. You want to get the best possible score given a limited number of transpositions. Your answer must be returned in the form of the four column matrix moves in which each row is a set of transposition instructions ordered like so:
[len, ai, bi, flip]
where len, ai, bi, and flip are defined as in the splice function described above. The transpositions will be processed from top to bottom. If there are more than budget rows in the moves matrix, they will be ignored.

Scoring

The overall score for your entry is a combination of three things:
  • Your average score across all the problems
  • How fast your code runs
  • The cyclomatic complexity of your code (more information about this below)
Since all of these are to be minimized, the lowest overall score at the end of the contest wins.

NEW! Late-Stage Twilight

At noon on Monday, we will do a score-neutral swap of the test suite and hide all code for 24 hours. This means we will halt the queue briefly and switch in a new test suite. For the purposes of score continuity, we will then add an offset factor to the score of the leader so that the score for that entry is the same using the old and new test suites. That offset factor will remain constant for the rest of the contest.

Cyclomatic Complexity

Cyclomatic complexity, also known as McCabe complexity, is a measure of the number of independent paths through a program's source code. Typically, as this number gets higher, it becomes more difficult to understand what's happening in a program. This makes it harder to test, modify, and refactor.

Since a file can contain multiple functions, the complexity for any given file is defined as the MAXIMUM complexity of any functions contained in it. A good practice is to keep the complexity for each function below 10, so for this contest your overall score will increase according to the complexity in excess of 10. So there is no complexity penalty for submissions in which all functions have a complexity of 10 or less.

You can measure the cyclomatic (or McCabe) complexity of any function in MATLAB using the "cyc" switch for mlint. Try this, for example:

>> mlint -cyc magic.m
We have also included a getComplexity.m function with the contest distribution to make it easier to find the complexity for your entry.

Time and Size Limits

Your entry will time out and be disqualified if it takes more than 180 seconds (three minutes).

The code is limited in size by the database architecture. The column in our MySQL database that stores the M-code is of type text, which is limited of 65535 characters. Submissions longer than this limit will fail.

Developing Your Entry

The files you need to get started on the contest are in a ZIP-file available on the MATLAB Central File Exchange. If you download and uncompress this ZIP-file, you will have the files necessary to join the contest. The main routine is solver.m. The syntax is described above. Keep in mind that these functions must have the right signature. Variable names are unimportant. To test this function with the test suite in the ZIP-file, run runcontest.m.
 >> runcontest

About the Contest

The contest is divided into three segments. Most of the week will run as usual, with free sharing of code, but the first two days of the contest will hide some of the information about each entry.
  • Darkness (Wed. 12 PM to Thurs. 12 PM). During the first day of the contest, you can't see the code or scores for any of the entries.
  • Twilight (Thurs. 12 PM to Fri. 12 PM). During the second day of the contest, you can see scores but no code.
  • Daylight (Fri. 12 PM to Wed. 12 PM). For the remainder of the contest (with the exception of Late-Stage Twilight) you can see scores and code for all entries.
  • NEW Late-Stage Twilight (Mon. 12 PM to Tues. 12 PM).
All times are Eastern Standard Time, US (GMT-5).

Collaboration and Editing Existing Entries

Once an entry has been submitted, it cannot be changed. However, any entry can be viewed, edited, and resubmitted as a new entry. You are free to view and copy any entry in the queue. If your modification of an existing entry improves its score, then you are the "author" for the purpose of determining the winners of this contest. We encourage you to examine and optimize existing entries.

We also encourage you to discuss your solutions and strategies with others. You can do this by posting to the thread that we've started from our newsreader.

Fine Print

The allowable functions are those contained in the basic MATLAB package available in $MATLAB/toolbox/matlab, where $MATLAB is the root MATLAB directory. Functions from other toolboxes will not be available. Entries will be tested against MATLAB version 7.5 (R2007b).

The following are prohibited:

  • MEX-files
  • Java commands or object creation
  • eval, feval, inline, function handles, etc.
  • Shell escape such as !, dos, unix
  • Handle Graphics commands
  • ActiveX commands
  • File I/O commands
  • Debugging commands
  • Printing commands
  • Simulink commands
  • Benchmark commands such as tic, toc, flops, clock, pause
  • error, clear, persistent

Hacking

Entries that compromise the contest machinery are no longer allowed. We've all learned some interesting MATLAB tricks in the past by contestants figuring out how to pass information from one entry to the next, or finding clever ways to execute disallowed functions, but it's too hard for the few of us running the contest to keep ahead of the group intelligence. In short, out of consideration for everyone participating in the contest, we ask that you not abuse the system.

Extraction of puzzles in the test suite by manipulating the score, runtime, or error conditions is also forbidden. In the small scale, this has been an element of many past contests, but in the Blockbuster Contest, Alan Chalker turned this into a science. Tuning the entry to the contest test suite via tweak bombing or other techniques is still allowed, but we ask that you not overwhelm the queue.

FAQ

Check our FAQ for answers to frequently asked questions about the contest.

External Links

About named visibility periods

Contests are divided into segments where some or all of the scores and code may be hidden for some users. Here are the segments for this contest:

  • Darkness - You can't see the code or scores for any of the entries.
  • Twilight - You can see scores but no code.
  • Daylight - You can see scores and code for all entries.
  • Finish - Contest end time.