Protein Folding - Rules
Protein Folding Contest Rules
This contest is similar to the last one in that it deals with molecules, but the problem is more current: protein folding. Proteins, which are initially synthesized in the cell as long chains of amino acid building blocks, automatically fold down into more compact working shapes. Exactly how they do this, and how they do it quickly and consistently, is poorly understood.
The problem is so notoriously difficult that even dramatic simplifications are difficult to solve. One stripped-down version of the protein-folding problem is the so-called HP lattice model introduced by Lau and Dill in 1989 (See the references at the bottom of this page). Our contest is based on a version of this problem.
This contest concerns a simplified two-dimensional version of the protein folding problem. Given the amino acid sequence of a simplified protein, your job is to determine a final "optimally folded" configuration. Each potential arrangement of the protein sequence is associated with a free energy level, which we call the "result" of your folding algorithm. An optimal folding (and there may be more than one) corresponds to the configuration with the lowest possible result, which can be thought of as a minimum energy configuration. In our idealized world, there are only two kinds of amino acids: hydrophobic amino acids (denoted by 1) and hydrophilic amino acids (denoted by 0). Since proteins exist in a watery environment, the hydrophobic amino acids prefer to cling to each other, compact and apart from the water as much as possible.
The sequence for the protein shown in the diagram below is
[0 1 1 1 1 0]
where hydrophobic amino acids are shown in red and hydrophilic amino acids are shown in blue.
Think of the amino acids that make up the protein as links in a chain. Each link may bend by a multiple of 90 degrees, but the distance between links may not change (and is assumed to be 1). No self-intersections are permitted, and the final result is constrained to lie in a single plane. Thus you can think of the chain as a line snaking through a Manhattan-style grid, one amino acid at a time. This is sometimes referred to as a self-avoiding walk.
We will refer to the complex plane to simplify the configuration of any given chain. The folding sequence your function returns will define the turn taken at each link by means of the complex numbers 1 (east), i (north), -1 (west), and -i (south). If you are given the amino acid sequence a
>> a = [0 1 1 1 1 0];
you might return the configuration sequence
>> c = [1 1 1 1 1];
which defines the straight line configuration shown in the diagram above. The configuration sequence is necessarily one element shorter than the sequence itself, since each element defines the relationship between two amino acids. If we assume the first amino acid is always anchored at 0, then the position of each amino acid can be succintly calculated as
>> p = [0 cumsum(c)]
0 1 2 3 4 5
So c can be interpreted as: "from your starting location, go east five consecutive times."
To introduce a small turn in the configuration, we can put a northward turn (given by i) in the fourth element of the configuration vector. We expect to see a picture like the one below.
And indeed, the following MATLAB code confirms our expectation.
>> c = [1 1 i 1 1];
>> p = [0 cumsum(c)]
0 1 2 2+1i 3+1i 4+1i
>> axis([-2 6 -2 3])
Complex arithmetic makes the folding calculations straightforward. If we want to fold the entire right half of the original sequence up by 90 degrees, we can multiply the original c vector by a fold vector that bends it like so
>> c = [1 1 1 1 1];
>> f = [1 1 i i i];
>> c2 = c.*f
1 1 i i i
>> p2 = [0 cumsum(c2)]
0 1 2 2+1i 2+2i 2+3i
thereby making the protein shown below.
By applying successive folding vectors, your code should try to minimize the result.
Here is one final configuration vector that tries to minimize the fold energy by bringing the four red amino acids close together. We apply yet another 90 degree turn to the last two elements.
>> f2 = [1 1 1 i i];
>> c3 = c2.*f2
1 1 i -1 -1
>> p3 = [0 cumsum(c3)]
0 1 2 2+1i 1+1i 1i
The factors, in order of importance, that determine your score are
- Free energy level (which we call the "result")
- Runtime of your code
The total score is a combination of these two, with linear dependence on the energy level and an exponential dependence on the runtime. Lower scores are better. Your code will time out and fail after 180 seconds, so beware of combinatoric explosions in your algorithm. Furthermore, the exponential contribution of the runtime penalty gets quite high as you approach 180 seconds.
The free energy (result) is calculated by totaling the distance between all pairs of hydrophobic amino acids. Your best configuration will have the hydrophobic amino acids in the most compact mass. The code to do this calculation is shown below. Suppose you have the last sequence shown above. We ignore the location of the hydrophilic amino acids, since they do not contribute to the result, and then total up the distance matrix for the remaining elements.
The first line removes the hydrophilic elements
>> p3 = p3(logical(a));
>> [x,y] = meshgrid(1:length(p3));
>> dist = abs(p3(x) - p3(y))
0 1.0000 1.4142 1.0000
1.0000 0 1.0000 1.4142
1.4142 1.0000 0 1.0000
1.0000 1.4142 1.0000 0
Here is the distance matrix. The result is the sum of all remaining distances: 1 + 1 + 1 + 1 + sqrt(2) + sqrt(2)
>> result = sum(sum(dist))/2
This result, along with the associated runtime, gets passed to our scoring algorithm before returning a final score, according to the equation
score = k1 * result + k2* e (k3*runtime)
We don't publish the values k1
, and k3
Developing your entry
The files you need to get started on the contest are included in a ZIP-file available here at the MATLAB Central File Exchange. If you download and uncompress this zip-file, you will have the files described below.
The routine that does the algorithmic work is solver.m, and the answer it returns is very simple: a list of ones. This means, take the amino acid vector a and string it out eastward.
function c = solver(a)
% Direction: 1 = east, i = north, -1 = west, -i = south
c = [ones(length(a)-1,1)];
A few points to keep in mind about this function and your contest entries:
- The function must have the right signature: one input argument, one output argument. Variable names are unimportant.
function c = solver(a)
- The function must return a configuration vector c of the appropriate length (length(a)-1) containing only 1, -1, i, and -i values.
To test this function with the test suite in the zip-file, run runcontest.m.
>> [output, message] = runcontest
The first column of results contains your result, and the second column gives you an estimate of the runtime. message returns a string describing the overall statistics of your submission.
It's sometimes useful to visualize what your entry is doing; you can do this by passing runcontest an input of 1.
>> [output, message] = runcontest(1)
This provides a plot of the resulting "protein" as each test is completed.
Collaboration and editing existing entries
Once an entry has been submitted, it cannot be changed. However, any entry
can be viewed, edited, and resubmitted as a new entry. You are free to
view and modify any entries in the queue. The contest server maintains
a history for each modified entry. If your modification of an existing
entry improves its score, then you are the "author" for the purpose of
determining the winners of this contest. We encourage you to examine and
optimize existing entries.
We also encourage you to discuss your solutions and strategies with
others. You can do this by posting to the comp.soft-sys.matlab thread that we've started from our newsreader.
Some notes on MATLAB 6.5
The version of MATLAB used for this contest, MATLAB 6.5, uses new JIT-Accelerator technology for fast execution of code. This can have a fairly dramatic effect on the speed of your code. In particular, some things, like loops, that you would have avoided in the past, might be acceptably fast now. If you are not running MATLAB 6.5, you may notice surprising differences between the runtime on your machine and the performance on the contest machine. Read this analysis
for more information on what's change.
The allowable functions are those contained in the basic MATLAB package
available in $MATLAB/toolbox/matlab, where $MATLAB is the root MATLAB directory.
Functions from other toolboxes will not be available. Entries will be tested
against MATLAB version 6.5 (R13).
The following are prohibited:
Java commands or object creation
eval, feval, etc.
Shell escape such as !, dos, unix
Handle Graphics commands
File I/O commands
Benchmark commands such as tic, toc, and
Check our FAQ
for answers to frequently
asked questions about the contest.
Lau K.F. and Dill K.A. (1989) A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules
Lau K.F. and Dill K.A. (1990) Theory for protein mutability and biogenesis. Proceedings of the National Academy of Sciences USA
About named visibility periods
Contests are divided into segments where some or all of the scores and code may be hidden for some users. Here
are the segments for this contest:
- Darkness - You can't see the code or scores for any of the entries.
- Twilight - You can see scores but no code.
- Daylight - You can see scores and code for all entries.
- Finish - Contest end time.