Certain patterns on DNA and protein sequences are strongly conserved by evolution, meaning they likely have important biological functions.
In DNA sequences, promoter binding sites are typically marked by a pattern of 5-10 nucleotides shortly upstream of each co-regulated gene.
In protein sequences, highly conserved regions among similar proteins likely define the most important folds or binding sites.
Highly conserved does not mean completely conserved, and such short sequence matches may not stand out in any pairwise alignment.
Thus it is important to be able to identify possible motifs from a set of related sequences, and be able to test whether a given string contains a given motif.
There are two standard ways to summarize a consensus pattern across multiple sequences.
A profile is a matrix of probabilities, where the rows represent possible bases, and the columns represent consecutive sequence positions.
A | 2 | 95 | 26 | 59 | 51 | 1 |
C | 9 | 2 | 14 | 13 | 20 | 3 |
G | 10 | 1 | 16 | 15 | 13 | 0 |
T | 79 | 3 | 44 | 13 | 17 | 96 |
motif | T | A | T | A | A | T |
A perfect profile has one base with probability 1.0 in each column, while a perfectly useless profile has entries of equal probability.
A motif is a concise representation of the highest probability characters of the profile, as a string or regular expression.
Motifs are suggestive of function preserved by evolution.
Protein sequences containing a given motif are typically clustered in the same family, and protein databases (PROSITE and Swiss-Prot) are typically organized around motifs.
When a motif is represented by a single contiguous string, exact text search methods suffice to see if a given sequence contains it.
When a profile does not contain indels, the probability that a given sequence containing the motif can be easily determined by multiplying the probabilities of each character in each position of each window.
If one has a probabilistic model of how typical DNA sequences are generated, one can easily compute the probability that a given profile score is significant.
When profiles or motifs have gaps, the optimal alignment can be computed using a dynamic programming-type algorithm.
Motifs stand out as highly conserved regions in a multiple sequence alignment
Fragment assembly programs must have a multiple sequence alignment code to extract a consensus call to define each base in the final sequence, but base calling accuracy is high enough that this sequence alignment problem is much easier than for conserved regions under evolution.
Sequence ID Description 1 SEQ#01 Humcetp 2 SEQ#02 Hupltp 3 SEQ#03 Rrrya3 4 SEQ#04 Bovbpi 5 SEQ#05 Ratlbp *** Heuristic Multiple Alignment *** 45213 --------MLAATVLTLALLGNAHACSKGTSHEAGIVCRITKPALLVLNHETA---KVIQTAFQRASYPD--ITG -------MALFGALF-LALLAGAH------AEFPGCKIRVTSKALELVKQEGL---RFLEQELETITIPD--LRG ---------MMPGVYALLLLWGLATPCLGLLETVGTLARIDKDELGKAIQNSLVGGPILQNVLGTVTSVNQGLLG MARGPDTARRWATLVVLAALGTAVT-----TTNPGIVARITQKGLDYACQQGV---LTLQKELEKITIPN--FSG --MKSATGPLLPTLLGLLLLSIPRT----QGVNPAMVVRITDKGLEYAAKEGL---LSLQRELYKITLPD--FSG *** Optimal Multiple Alignment *** --MLAATVLTLA-LLGNAHACSKGTS-HEAGIVCRITKPALLVLNHETAK---VIQTAFQRASYPD--ITGEKAM --MALFGALFLA-LLAGAHA-------EFPGCKIRVTSKALELVKQEGLR---FLEQELETITIPD--LRGKE-- --MMPGVYALLL-LWGLATPCLGLL--ETVGTLARIDKDELGKAIQNSLVGGPILQNVLGTVTSVNQGLLGAGGL MARGPDTARRWATLVVLAALGTAVTT-TNPGIVARITQKGLDYACQQGVL---TLQKELEKITIPN--FSGNFKI --MKSATGPLLPTLLGLLLLSIPRTQGVNPAMVVRITDKGLEYAAKEGLL---SLQRELYKITLPD--FSGDFKI
Pairwise sequence comparison algorithms based on dynamic programming can be natural generalized to 3 or more sequences, but at a cost.
The possible states of our dynamic programming algorithm are all subsets
of positions in each sequence, so the dynamic programming matrix of
sequences each of length
has size
.
Since the last column of each state can contain any non-empty
subset of sequences, the time to fill this matrix is .
This quickly gets hopeless as and
get large, e.g.
and
.
Further, even more information must be stored if we are to charge lower
per-base deletion penalties for gaps.
For each position state, we must keep track of the cost for all
possibilities of ending each sequences alignment with a blank/gap.
How best to score the quality of a multiple alignment is a tricky business.
Entropy methods score each column based on the probability distribution of the characters in it.
Tree alignment metrics assume knowledge of an existing phylogenic tree, and weight differences between closely related sequence pairs as more important than distant pairs.
The most popular metric is sum of pairs cost.
We sum up the cost of the pairs
of symbols in each column.
A standard PAM-type matrix defines the cost of each symbol pair, but what is the cost of an indel vs. an indel?
If we set the indel-indel cost to 0, then the sum of pair cost of a
multiple alignment is the sum of the scores of the
induced pairwise alignments, where the induced alignment
deletes all columns with just indels in the rows associated with the
two sequences.
Sum of pairs is a nice measure because (1) it is quite natural, and (2) a good upper bound on the best possible alignment follows by finding all pairwise alignments and summing the scores.
Finding the optimal sum of pairs alignment is NP-complete.
However, by exploiting the lower bounds given by each pairwise DP matrix, one can heuristically reduce the number of states in the multiple dynamic programming matrix and hope to find the optimal alignment of (say) 6-7 sequences of (say) 200 characters in a reasonable amount of time.
Since the most interesting applications of multiple alignment lie beyond the range of exact solution algorithms, we must use heuristic methods.
A natural method exploits our ability to do pairwise alignments, and inserts sequences one by one into the alignment.
Note that aligning two consensus alignments is not conceptually different than aligning two sequences, since we can treat indels as just a space character.
Thus heuristics differ according to the order in which they merge the sequences, which is defined by a tree.
Reasonable trees are (1) based on known phylogenic trees, or similarity clustering dendograms, (2) of fixed topologies, such as paths or binary trees.
Fixed topologies require associating each node with an input sequence.
In star alignments, one sequence is selected as the center, and all other sequences merged on a pairwise basis into it.
Picking as the center the sequence ``most similar to the others'' gives the best results.
Motifs tend to be fairly weakly conserved across sequences, so identifying them requires analyzing and aligning multiple sequences.
A rigorous definition of the problem is searching for -motifs,
where we seek a fixed but unknown sequence
of length
such
that each
of our input sequences contains a substring of length
with at most
substitutions from
.
If , then suffix trees would provide an efficient way to identify
such a motif.
But it isn't.
If , there are not likely to be long common substrings
you can find that are shared by all the input sequences.
If is sufficiently short, we can exhaustively try all
possible
DNA or
protein motifs.
But usually it isn't, say
.
If is sufficently close to
, then almost certainly many such
motifs exist, so finding the `right' one becomes impossible.
Typical sizes of hard motif problems are ,
and
.
Also popular are what I call sampling methods, that repeatedly insert and delete sequences to tune an alignment.
A profile is naturally represented as a hidden Markov model, since both define probabilities of character transitions, so one can envision training a HMM to discover motifs.
Gibbs sampling methods insert and delete sequences according to ``free energy'' calculations. These are analogous to simulated annealing methods of combinatorial optimization, and apparently work well in practice.
In a Gibbs sampler, we maintain a multiple alignment (initially randomly or heuristicly selected). In each iteration, we randomly pick a sequence for deletion from the alignment, and then reinsert it into the alignment at the position which either (1) maximizes the score, or (2) with probabilities biased toward the maximum score.
Even the previously discussed heuristic alignments might benefit by deleting and reinserting sequences, and moving characters to different columns as a post processing step.
Another approach is random projections, hashing all -substrings
according to random sets of
positions.
Motifs should lurk around buckets with more collisions than expected.