Darkness 2000-03-20 00:00:00 UTC
 
Twilight 2000-03-21 00:00:00 UTC
 
Daylight 2000-03-22 00:00:00 UTC
 
Finish 2000-03-27 00:00:00 UTC

Gene Sequencing - Rules

Arrow_down  Download Sample Code

The Challenge

You are working on the human genome project, and your team is closing in on the last few base pairs needed to complete the project and go on to certain glory and perhaps a Nobel prize. Researchers on your team have managed to obtain several sets of segments needed to sequence the last remaining gene, but your help is urgently needed to lead the effort. For each set of segments given, you must devise an algorithm to sequence them in the correct order. If you succeed, your name will appear in all biology textbooks from this day forward. If you are too slow, someone else will take your place. Good luck.
 
 

Problem

Your job is to write a MATLAB M-file function which can reconstruct the gene sequence using the given fragmentary segments of the larger gene. Note: While this problem is posed in the spirit of a real-life challenge, the problem has been modified to suit a MATLAB contest rather than test your knowledge of gene sequencing techniques. Make sure to look over the rules below!
 
 

Gene Segments

The gene segments given to you have been chemically snipped from a single gene and sequenced individually. Your job is to reassemble them jigsaw-style into the most likely single sequence they could have come from. The gene segments will be presented to your function as a char (or string) matrix. For example, an input could look like:
>> Segments
ans =
 TCGG
 GCAG
 ATCG
 CAGC
 GATC
Only four letters will be used. Each stands for one of the nitrogenous bases in DNA: A (adenine), C (cytosine), G (guanine), and T (thymine). The segments in any single test given to your function have the same length (since they're passed as a char matrix). But that length, as well as the number of segments, will vary across the many tests included in our test suite - in other words, your function will be given inputs of different, unknown sizes. In this case, we have 5 segments of length 4. Another test case might provide 20 segments of length 12 (To provide a sense of scale, the entire human genome is approximately three billion base pairs long).

Your function should return the shortest gene that can be made from these pieces. The rules for assembling a sequence are:

  • The sequence must use all the segments (and only the segments) provided
  • You cannot flip any of the segments.
  • The result must be a one row char array.
  • The best answer is the shortest answer.
  • Speed counts.


For example, if we slide around the above segments we can get:
 
 

         TCGG
    GCAG
        ATCG
  CAGC
       GATC
==============.
  CAGCAGATCGG


This gives a final sequence of length 11. Another possible solution is:
 
 

    TCGG
       GCAG
   ATCG
        CAGC
  GATC
==============.
  GATCGGCAGC


This solution has length 10. Both are consistent with the segments provided, but according to the rules of the contest, the second solution would receive a better score, since it's shorter. The easiest solution is:
 
 

TCGG
    GCAG
        ATCG
            CAGC
                GATC
====================.
TCGGGCAGATCGCAGCGATC


This is just all the pieces placed end to end. While this sequence is valid, it's unlikely to be a faithful reconstruction of the original gene and would therefore be penalized because of its length.

Keep in mind that length isn't the only criteria. Your entry will be judged also on its speed. So while a shorter result is better, entries that take a long time will also be penalized.
 
 

Function header

In MATLAB syntax, the function header for your solution should look like this:
function Sequence = mygene(Segments)
Your function should return a 1 row char array representing its solution.

Developing and Testing

To help in developing your sequencing program, we've created a file you can use for running trials of your entry.  Just click on the link below to download it:
    runtrials.m: This function allows you to run your own trials of the contest.

Also, if you click on Submit an entry at the top left of this page, you'll see one example of a very simple entry.
 
Example:
Download the files above and place it on your MATLAB path, then write your contest entry.  Here we're assuming you're calling it testsoln.

Now you can test your entry with:

%   Run 10 trials using testsoln.m
Result = runtrials('testsoln',10);
Result will be a matrix with your score and runtimes.  The example here uses a fixed input size to your entry, so you may want to experiment with different size inputs to see how your code performs.  The contest test suite contains a fixed (nonrandom) set of inputs of various sizes.

Judging Criteria

There are two criteria for the judging of this contest:

    length, the length of the sequence that you produced (shorter is better)
    time, the execution time of your algorithm (faster is better)
We will run your algorithm through a suite of tests. To receive a better score, you should minimize both the length of the sequence you return and the execution time of your function:

    score = (alpha * length) + (beta * time)

where alpha and beta are constant scaling factors.

The winning entry is the one with the lowest average score over the test suite.

The MATLAB Contest server will also place a cap on the maximum time allowed for each entry over the test suite. Entries which exceed 180 seconds for the test suite will time out and fail.

The test suite will be made available after the contest has been completed. You might also want to look at the Contest FAQ for some commonly asked questions which you might have. If your questions aren't answered there, you can e-mail contest@mathworks.com and we'll get back to you as soon as possible.
 
 

Optimizing your code

Try using the MATLAB profile command to help you optimize your code. For more information on the MATLAB profiler, type helpwin profile at the command line or doc profile to read the online documentation. The profiler has been enhanced significantly in MATLAB 5.3 (R11).

Deadline

The contest will close on Friday, March 24, 2000 at 5 PM EDT (21:00 GMT).

At that time, the contest queue will be closed to new entries. All remaining entries in the queue will continue to be processed, after which we will declare a winner.

Collaboration and editing existing entries

Once an entry has been submitted, it cannot be changed. However, any entry can be viewed, edited, and resubmitted as a new entry. You are free to view and modify any entries in the queue. The contest server maintains a history for each modified entry. If your modification of an existing entry improves its score, then you are the "author" for the purpose of determining the winners of this contest. We encourage you to examine and optimize existing entries.

We also encourage you to discuss your solutions and strategies with others. Consider using the newsreader to share ideas.

The Prizes

In addition to the grand prize, interim prizes will be awarded at various times throughout the period of the contest to the final author of the current highest ranking entry. Watch the Message of the Day for details and deadlines.

The author of the final winning entry in the contest will receive their choice of:

  • A pair of high-quality blue jeans and a MathWorks jeans shirt
  • Introduction to Computational Molecular Biology by Joao Carlos Setubal
  • Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield
Winning entries must be original or must be improvements on other entries. The contest administrators reserve the right the disqualify inconsequential changes which happen to result in better scores.
 

Fine Print

The allowable functions are those contained in the basic MATLAB package available in $MATLAB/toolbox/matlab, where $MATLAB is the root MATLAB directory. Functions from other toolboxes will not be available. Entries will be tested against MATLAB version 5.3.1 (R11.1).

The following are prohibited:

    MEX-files
    eval, feval, etc.
    Shell escape such as !, dos, unix
    Handle Graphics commands
    ActiveX commands
    File I/O commands

    Debugging commands

    Printing commands

    Simulink commands

    Benchmark commands such as tic, toc, and flops

FAQ

Check our FAQ for answers to frequently asked questions about the contest.

Disclaimer

This contest is not supported or endorsed by the Genome Data Base in any way. Further, we do not claim that this contest represents a real problem faced by the Genome Data Base. For more information on the GDB, visit the Genome Data Base Page.

About named visibility periods

Contests are divided into segments where some or all of the scores and code may be hidden for some users. Here are the segments for this contest:

  • Darkness - You can't see the code or scores for any of the entries.
  • Twilight - You can see scores but no code.
  • Daylight - You can see scores and code for all entries.
  • Finish - Contest end time.