Gene Sequencing - Rules
The Challenge
You are working on the human genome project, and your team is closing in on the last few
base pairs needed to complete the project and go on to certain glory and perhaps a Nobel prize. Researchers on your team have managed to obtain several sets of segments needed to sequence the last remaining gene, but your help is urgently needed to lead the effort. For each set of segments given, you must devise an algorithm to sequence them in the correct order. If you succeed, your name will appear in all biology textbooks from this day forward. If you are too slow, someone else will take your place. Good luck.
Problem
Your job is to write a MATLAB M-file function which can reconstruct the
gene sequence using the given fragmentary segments of the larger gene.
Note: While this problem is posed in the spirit of a real-life challenge, the problem has been modified to suit a MATLAB contest rather than test your knowledge of gene sequencing techniques. Make sure to look over the rules below!
Gene Segments
The gene segments given to you have been chemically snipped from a single gene and sequenced individually. Your job is to reassemble them jigsaw-style into the most likely single sequence they could have come from.
The gene segments will be presented to your function as a
char (or string) matrix. For example, an input could look like:
>> Segments
ans =
TCGG
GCAG
ATCG
CAGC
GATC
Only four letters will be used. Each stands for one of the nitrogenous bases in DNA: A (adenine), C (cytosine), G (guanine), and T (thymine).
The segments in any single test given to your function have the same length (since they're passed as a
char matrix). But that length, as well as the number of segments, will vary across the many tests included in our test suite - in other words, your function will be given inputs of different, unknown sizes. In this case, we have 5 segments of length 4. Another test case might provide 20 segments of length 12 (To provide a sense of scale, the entire human genome is approximately three billion base pairs long).
Your function should return the shortest gene that can be made from these pieces. The rules for assembling
a sequence are:
- The sequence must use all the segments (and only the segments) provided
- You cannot flip any of the segments.
- The result must be a one row char array.
- The best answer is the shortest answer.
- Speed counts.
For example, if we slide around the above segments we can get:
TCGG
GCAG
ATCG
CAGC
GATC
==============.
CAGCAGATCGG
This gives a final sequence of length 11. Another possible solution
is:
TCGG
GCAG
ATCG
CAGC
GATC
==============.
GATCGGCAGC
This solution has length 10. Both are consistent with the segments provided, but according to the rules of the contest, the second solution would receive a
better score, since it's shorter. The easiest solution is:
TCGG
GCAG
ATCG
CAGC
GATC
====================.
TCGGGCAGATCGCAGCGATC
This is just all the pieces placed end to end. While this sequence is valid, it's unlikely to be a faithful reconstruction of the original gene and would therefore be penalized because of its length.
Keep in mind that length isn't the only criteria. Your
entry will be judged also on its speed. So while a shorter result is better,
entries that take a long time will also be penalized.
Function header
In MATLAB syntax, the function header for your solution should look like
this:
function Sequence = mygene(Segments)
Your function should return a 1 row
char array representing its solution.
Developing and Testing
To help in developing your sequencing program, we've created a file you
can use for running trials of your entry. Just click on the link
below to download it:
Also, if you click on
Submit an entry at the top left of this page, you'll see one example of a very simple entry.
Example:
Download the files above and place it on your MATLAB path, then write
your contest entry. Here we're assuming you're calling it
testsoln.
Now you can test your entry with:
% Run 10 trials using testsoln.m
Result = runtrials('testsoln',10);
Result will be a matrix with your score and runtimes. The
example here uses a fixed input size to your entry, so you may want to
experiment with different size inputs to see how your code performs.
The contest test suite contains a fixed (nonrandom) set of inputs of various
sizes.
Judging Criteria
There are two criteria for the judging of this contest:
length, the length of the sequence
that you produced (shorter is better)
time, the execution time of your algorithm (faster is better)
We will run your algorithm through a suite of tests. To receive a better
score, you should minimize both the length of the sequence you return
and the execution time of your function:
score = (alpha * length) + (beta * time)
where alpha and beta are constant scaling factors.
The winning entry is the one with the lowest average score over the
test suite.
The MATLAB Contest server will also place a cap on the maximum time
allowed for each entry over the test suite. Entries which exceed 180 seconds
for the test suite will time out and fail.
The test suite will be made available after the contest has been completed.
You might also want to look at the Contest FAQ for some commonly asked
questions which you might have. If your questions aren't answered there,
you can e-mail contest@mathworks.com
and we'll get back to you as soon as possible.
Optimizing your code
Try using the MATLAB
profile command to help you optimize your
code. For more information on the MATLAB profiler, type
helpwin profile
at the command line or
doc profile to read the online documentation.
The profiler has been enhanced significantly in MATLAB 5.3 (R11).
Deadline
The contest will close on Friday, March 24, 2000 at 5 PM EDT (21:00 GMT).
At that time, the contest queue will be closed to new entries. All remaining
entries in the queue will continue to be processed, after which we will declare a winner.
Collaboration and editing existing entries
Once an entry has been submitted, it cannot be changed. However, any entry
can be viewed, edited, and resubmitted as a new entry. You are free to
view and modify any entries in the queue. The contest server maintains
a history for each modified entry. If your modification of an existing
entry improves its score, then you are the "author" for the purpose of
determining the winners of this contest. We encourage you to examine and
optimize existing entries.
We also encourage you to discuss your solutions and strategies with others.
Consider using the
newsreader
to share ideas.
The Prizes
In addition to the grand prize, interim prizes will be awarded at various times throughout the period of the contest to the final author of the current highest ranking entry. Watch the Message of the Day for details and deadlines.
The author of the final winning entry in the contest will receive their
choice of:
- A pair of high-quality blue jeans and a MathWorks jeans shirt
- Introduction to Computational Molecular Biology by Joao Carlos Setubal
- Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield
Winning entries must be original or must be improvements
on other entries. The contest administrators reserve the right the disqualify
inconsequential changes which happen to result in better scores.
Fine Print
The allowable functions are those contained in the basic MATLAB package
available in $MATLAB/toolbox/matlab, where $MATLAB is the root MATLAB directory.
Functions from other toolboxes will not be available. Entries will be tested
against MATLAB version 5.3.1 (R11.1).
The following are prohibited:
MEX-files
eval, feval, etc.
Shell escape such as !, dos, unix
Handle Graphics commands
ActiveX commands
File I/O commands
Debugging commands
Printing commands
Simulink commands
Benchmark commands such as tic, toc, and flops
FAQ
Check our
FAQ for answers to frequently
asked questions about the contest.
Disclaimer
This contest is not supported or endorsed by the Genome Data Base in any
way. Further, we do not claim that this contest represents a real problem
faced by the Genome Data Base. For more information on the GDB, visit the
Genome Data Base Page.
About named visibility periods
Contests are divided into segments where some or all of the scores and code may be hidden for some users. Here
are the segments for this contest:
- Darkness - You can't see the code or scores for any of the entries.
- Twilight - You can see scores but no code.
- Daylight - You can see scores and code for all entries.
- Finish - Contest end time.