This page has been accessed
0 times since 26-Mar-99 CronCount
Multiple alignment is a powerful method
for identifying and visualizing conserved regions in family of homologous
proteins. Moreover, all phylogenetic methods need multiple alignment
to start with. The 'correctness' of this alignment determines the
'correctness'
of resulting phylogenetic tree.
Different approaches:
* Maximizing (or minimizing)
score. What score?
Find a good function and try to align
sequences in order to optimize this function.
MSA
SAGA
* Progressive pairwise
or others
Align most similar pairs with each other,then
merge and/or optimize those alignments.
Some of them start with finding local
similarities in aligned sequences (local-global approach).
* Probabilistic approach
(Profiles and Hidden Markov Models)
Find model, align each sequence to model
independently.
ClustalW - does simple profile
alignment
MaxHom - iterative search of database
with profiles (Sander,1991)
HMMER
2.0 package (HMMbuild + HMMalign) by Sean Eddy et al., WashU
COVE
for RNA alignment by Sean Eddy and Richard Durbin, WashU (Eddy,1994)
SAM
by University of California, Santa Cruz
Detailed description for each method follows:
Possible scoring
functions for multiple alignments: |
The scoring matrices and gap costs that are
used in pairwise alignments cannot be directly and easily applied to multiple
alignments. The methods that are dependent on optimal score should first
find out how to estimate this score of 'goodness'.
Ideally we would like that best alignment
would have highest score. But what is the best alignment? What criterion
can be used for comparing the 'goodness' of alignments? Most of the recent
papers that compare multiple alignment methods use structural alignments
as "criterion of truth". These are based on protein families whose
crystal structure has been established and that can be aligned based on
their secondary structure regions (See HSSP
database or FSSP
database for examples). The score is then given as percentage of residues
that are correctly aligned in structurally conserved regions.
The classical way for calculating
multiple alignment score is to sum all possible pairwise scores.
The main problem is that this system does not guarantee that the best score
is reflecting biologically best alignment.
An alternative method for scoring
multiple alignments was published recently: the COFFEE
method. This method tries to achieve maximum agreement with all pairwise
alignments. Thus it is fairly similar to the classical sum of pairs method.
The COFFEE is reported to result in slightly better
alignments when sequences are only moderately related (<40% identical).
The idea for using optimization methods for
multiple alignment is derived from analogy with pairwise alignment: all
pairwise alignment methods are based on optimization of alignment score.
It would be pretty natural to transfer the same logic to multiple alignment.
However, the dynamic programming algorithm gets too complex when used on
more than 2 sequences. The main concern here is computer memory - it needs
memory in order of Nk where N is average length and k is number
of sequences. In practice this means that for aligning of 6 average protein
sequences you would need more than million of Megabytes of memory. The
approximation algorithms (like FASTA in pairwise alignments) can be used,
but they still cannot handle more than 8 sequences with maximum length
of 100. The MSA program is the best example of such efforts
(available from BCM,
Texas). It still needs up to 100 Mb memory if you run it on your
own computer.
So why should we use those memory-hungry
methods? Well, keep in mind that dynamic programming is guaranteed
to return the best alignment (with given scoring matrix and gap cost).
So it is worth trying. MSA still gives you this best alignment, with little
tweaking of parameters (parameter epsilon determines the scope of search
and is giving you compromise between search speed and probability of getting
best alignment).
A different approach to optimization method
has been taken recently. The idea is that we can find the best score without
trying all possibilities. This sounds illogical, but is nearly true. The
random testing methods have widespread use in solving complicated problems
like this. More specifically, GENETIC ALGORITHM was used for finding best
alignment in program SAGA. With genetic algoritms,
the multiple alignment is changed and improved in evolution-like manner,
eventually evolving to (hopefully) best scoring alignment. It takes some
time (1 hour for 6 proteins of 330 amino-acids on RS10000 processor), but
is still much more affordable, mainly because of tiny memory requirements.
All optimization methods are dependent
on correct choice of gap costs and scoring matrices.
Progressive alignment
methods: |
Global alignment methods
So, we have difficulties with exhaustive alignment
of large number of sequences. In 1987 an alternative approach was
suggested (Feng, 1987). This method utilizes
the progressive pairwise alignment algorithm
iteratively to achieve the multiple alignment of a set of protein sequences
and to construct an evolutionary tree depicting their relationship. The
closest sequences according to this evolutionary tree are aligned first.
Then this pairwise alignment is aligned to other sequences or to
other alignments and so on until the final multiple alignment is put together.
The sequences are assumed a priori to share a common ancestor. This
approach is fast and relatively meaningful. By the nature, progressive
alignment does not optimize any score and does not guarantee the optimal
alignment. Nevertheless, the method is fast and with proper settings gives
nice alignments with all kind of sequences. The masterpiece among the multiple
alignment programs is ClustalW (and more
recently ClustalX with graphical interface).
The detailed analysis of ClustalW algorithm deserves a separate page, but
it's main advantages are:
Automatically variable scoring matrix
that depend on:
-
similarity of sequences (measured from guide
tree)
Automatically variable gap penalties
that depend on:
-
scoring matrix used
-
similarity of sequences
-
context (hydrophilicity, existence of gaps
in other sequences)
-
length of sequences (Gap opening penalty)
-
difference in sequence length (Gap extension
penalty)
Particularly important seems to be the adjustment
of gap penalty according to sequence context. The graphical interface
of ClustalX allows realignment of subset of sequences, manual rearrangement
of bases and alignment quality estimates of alignments generated
by ClustalW.
Local alignment methods
The other approach to progressive multiple
alignment would be local multiple alignment.
The
fact is that biological sequences are not equally conserved over their
entire lenght. Only short blocks that code for important structural or
functional elements of protein remain conserved in course of evolution.
Thus, many programs try to identify and align only conserved blocks from
sequences and put those together to form overall alignment. The unconserved
and lowly conserved areas between blocks may remain unaligned. This approach
is promising. We should consider also that local alignment methods have
been ruling among pairwise alignments since their invention. Those aligned
blocks should reflect important structural and functional regions reasonably
well. People who do multiple alignments because they are interested in
common structure or function of aligned proteins may find them useful.
But what if you are interested in phylogenetic analysis. Most phylogenetic
methods require global multiple alignment as a prerequisite. May be the
use of ONLY well conserved blocks is solution here too! The less conserved
areas can be wrongly aligned and cause wrong phylogenetic trees. So why
use them?
At least in cases where conserved areas
offer enough phylogenetic information to build tree, the unconserved areas
should be ignored. Particularly because their accepted mutation rate might
be unnacceptably high and thus saturated. Fortunately there exist several
alignment editor programs, that let user manually edit the alignment or
choose proper areas before proceeding with phylogenetic analysis (SeaView,
PhyloWin, Macaw).
As said, local alignment programs try
to solve this problem WHILE doing alignment. Some of them allow gaps
within conserved block, some don't. The main problem with local alignment
programs is in difficulty of defining the boundaries of conserved block.
Which sequences are conserved enough to be included in multiple alignment?
Certainly programs that do local multiple alignment have room for improvement.
From practical side, I would suggest to
try some local multiple alignment programs and see how do you like them.
The outcome will depend on your sequences and there is no good criterions
to say one is better than others so you have to try.
You may want to test:
MacaW
(Windows and Macintosh), that is statistically well grounded and allows
editing of alignment,
Iteralign,
that has the WWW-based service,
prrp
by Osamu Gotoh. The latter is UNIX program that uses iterations to make
alignment, phylogenetic tree and sequence weights mutually consistent.
Because of optimizing iterations, it could also be classified as optimizing
alignment method. Claimed to be good but is not simple to use.
Under this category I describe profile and
Hidden Markov Model based alignment methods. These can be considered as
fine tools, because they use more information for creating multiple alignment
than other methods. The immediate question is where this information is
coming from? The answer: the information is coming from existing
multiple alignments. So, probabilistic methods can be used only in cases
where similar sequences are already aligned and some specific information
can be extracted from existing alignment.
The simplest profile alignment is implemented
in ClustalW, option 3 (Profile/structure alignments). This method
needs only an old alignment as source of information. and then adds new
sequences to this alignment.
This is:
1. faster than starting alignment with
all sequences de novo
2. allows manual editing of the old alignment
to get better results.
More complicated methods code the information
of old alignment into the profile and can use such profiles for alignment.
Basically profiles are probabilities of finding given amino acid at given
position. Using profile for scoring alignment is technically almost as
easy as using usual scoring matrices for scoring each alignment position.
Usual scoring matrices also reflect probabilities of one amino acid changing
to another. But PAM, BLOSUM and Gonnet matrices contain generalized
information from all proteins. The main advantage of profiles is that they
specify these probabilities for each amino acid position of protein separately.
Of course they are not as general as scoring matrices, you need to create
different profile for each family of proteins. But the gain in alignment
quality is well worth the effort. A good example of intelligent use
of profiles for sequence alignment is PSI-BLAST
program, that creates alignment and profile to extract more members of
given family from database.
Even more advanced methods are Hidden Markov
Model (HMM) based methods. They use a sequence model to store probabilities
of amino acid substitutions and gap frequencies. They are
similar to profile methods, except that their model can contain more information
than a simple profile. Most often the gap penalty values are coded
into the HMM model, thus eliminating the need for user defined gap penalties,
the source of many if not most errors in multiple alignments. The most
used program for building HMM models and aligning sequences according to
HMM model is HMMer.
The collection of thousands of existing HMM models can be found from Pfam
database.
Beside the well-grounded gap-penalty handling,
HMM's have another advantage in multiple alignments. The speed of alignment.
The speed of alignment is in linear order with number of sequences as opposed
to exponential order for most other alignment methods. Thus, HMMs are particularly
useful for protein families with many members.
HMM models are not limited to profile and
gap modelling. For aligning RNA molecules, the incorporation of secondary
structure information into the HMM model is possible. Nevertheless, the
number of calculations increases with additional information, thus this
program COVE
currently does not handle RNA sequences longer than few hundred nucleotides.
References and suggested
reading: |
For viewing and editing your alignments
I would suggest the recent Java alignment viewer Jalview.
Netscape 4.05, Internet Explorer 4 or HotJava1.1 is needed to use it. It
can do alignment itself by clustalw algorithm. This viewer has nice color
scheme, based on structural features, identity level...Viewing of sequence
trees. Removal of redundant sequences and so on. Try also right mouse click
on different places of tree
Brocchieri
L, Karlin S (1998) A symmetric-iterated multiple alignment of
protein sequences. J Mol Biol 276: 249-64
Feng
DF, Doolittle RF (1987) Progressive sequence alignment as a
prerequisite to correct phylogenetic trees. J Mol Evol 25:
351-60
Gotoh
O (1996) Significant improvement in accuracy of multiple protein
sequence alignments by iterative refinement as assessed by reference to
structural alignments. J Mol Biol 264: 823-38
Eddy
S and Durbin R (1994) RNA Sequence Analysis Using Covariance
Models. Nucl. Acids Res. 22: 2079-2088
Morgenstern
B, Dress A, Werner T (1996) Multiple DNA and protein sequence
alignment based on segment-to-segment
comparison. Proc Natl Acad Sci U S A
93:12098-103
Neuwald
AF, Liu JS, Lipman DJ, Lawrence CE (1997) Extracting protein
alignment models from the sequence database. Nucleic Acids Res 25:
1665-1677
Notredame
C, Higgins DG (1996) SAGA: sequence alignment by genetic algorithm.
Nucleic Acids Res 24: 1515-24
Notredame
C, Holm L, Higgins DG (1998) COFFEE: an objective function for
multiple sequence alignments. Bioinformatics 14: 407-22
Sander
C, Schneider R (1991) Database of homology-derived protein structures
and the structural meaning of sequence alignment. Proteins 9:56-68
Schuler
GD, Altschul SF, Lipman DJ (1991) A workbench for multiple alignment
construction and analysis. Proteins 9:180-90
Thompson
JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice. Nucleic Acids
Res 22: 4673-80
Thompson
JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The
CLUSTAL_X windows interface: flexible strategies for multiple sequence
alignment aided by quality analysis tools. Nucleic Acids Res 25:
4876-82