Multiple alignments

This page has been accessed 0 times since 26-Mar-99 CronCount

Multiple alignments:

Multiple alignment is a powerful method for identifying and visualizing conserved regions in family of homologous proteins. Moreover, all phylogenetic methods need multiple alignment to start with. The 'correctness' of this alignment determines the 'correctness' of resulting phylogenetic tree.

Alignment methods:

Different approaches:

* Maximizing (or minimizing) score. What score?
Find a good function and try to align sequences in order to optimize this function.
MSA
SAGA

* Progressive pairwise or others
Align most similar pairs with each other,then merge and/or optimize those alignments.
Some of them start with finding local similarities in aligned sequences (local-global approach).

Global Local-global

ClustalX (Thompson,1994)
ClustalW (Thompson,1994)
PileUp
MAP DIALIGN (Morgenstern, 1996)
DFALIGN
Macaw (Schuler,1991)
IterAlign (Brocchieri,1998)
PIMA
prrp (Gotoh, 1996)

* Probabilistic approach (Profiles and Hidden Markov Models)
Find model, align each sequence to model independently.
ClustalW - does simple profile alignment
MaxHom - iterative search of database with profiles (Sander,1991)
HMMER 2.0 package (HMMbuild + HMMalign) by Sean Eddy et al., WashU
COVE for RNA alignment by Sean Eddy and Richard Durbin, WashU (Eddy,1994)
SAM by University of California, Santa Cruz

Detailed description for each method follows:

Possible scoring functions for multiple alignments:

The scoring matrices and gap costs that are used in pairwise alignments cannot be directly and easily applied to multiple alignments. The methods that are dependent on optimal score should first find out how to estimate this score of 'goodness'.
Ideally we would like that best alignment would have highest score. But what is the best alignment? What criterion can be used for comparing the 'goodness' of alignments? Most of the recent papers that compare multiple alignment methods use structural alignments as "criterion of truth". These are based on protein families whose crystal structure has been established and that can be aligned based on their secondary structure regions (See HSSP database or FSSP database for examples). The score is then given as percentage of residues that are correctly aligned in structurally conserved regions.
The classical way for calculating multiple alignment score is to sum all possible pairwise scores. The main problem is that this system does not guarantee that the best score is reflecting biologically best alignment.
An alternative method for scoring multiple alignments was published recently: the COFFEE method. This method tries to achieve maximum agreement with all pairwise alignments. Thus it is fairly similar to the classical sum of pairs method. The COFFEE is reported to result in slightly better alignments when sequences are only moderately related (<40% identical).

Optimization methods:

The idea for using optimization methods for multiple alignment is derived from analogy with pairwise alignment: all pairwise alignment methods are based on optimization of alignment score. It would be pretty natural to transfer the same logic to multiple alignment. However, the dynamic programming algorithm gets too complex when used on more than 2 sequences. The main concern here is computer memory - it needs memory in order of N^k where N is average length and k is number of sequences. In practice this means that for aligning of 6 average protein sequences you would need more than million of Megabytes of memory. The approximation algorithms (like FASTA in pairwise alignments) can be used, but they still cannot handle more than 8 sequences with maximum length of 100. The MSA program is the best example of such efforts (available from BCM, Texas). It still needs up to 100 Mb memory if you run it on your own computer.
So why should we use those memory-hungry methods? Well, keep in mind that dynamic programming is guaranteed to return the best alignment (with given scoring matrix and gap cost). So it is worth trying. MSA still gives you this best alignment, with little tweaking of parameters (parameter epsilon determines the scope of search and is giving you compromise between search speed and probability of getting best alignment).
A different approach to optimization method has been taken recently. The idea is that we can find the best score without trying all possibilities. This sounds illogical, but is nearly true. The random testing methods have widespread use in solving complicated problems like this. More specifically, GENETIC ALGORITHM was used for finding best alignment in program SAGA. With genetic algoritms, the multiple alignment is changed and improved in evolution-like manner, eventually evolving to (hopefully) best scoring alignment. It takes some time (1 hour for 6 proteins of 330 amino-acids on RS10000 processor), but is still much more affordable, mainly because of tiny memory requirements.

All optimization methods are dependent on correct choice of gap costs and scoring matrices.

Progressive alignment methods:

Global alignment methods

So, we have difficulties with exhaustive alignment of large number of sequences. In 1987 an alternative approach was suggested (Feng, 1987). This method utilizes the progressive pairwise alignment algorithm iteratively to achieve the multiple alignment of a set of protein sequences and to construct an evolutionary tree depicting their relationship. The closest sequences according to this evolutionary tree are aligned first. Then this pairwise alignment is aligned to other sequences or to other alignments and so on until the final multiple alignment is put together. The sequences are assumed a priori to share a common ancestor. This approach is fast and relatively meaningful. By the nature, progressive alignment does not optimize any score and does not guarantee the optimal alignment. Nevertheless, the method is fast and with proper settings gives nice alignments with all kind of sequences. The masterpiece among the multiple alignment programs is ClustalW (and more recently ClustalX with graphical interface). The detailed analysis of ClustalW algorithm deserves a separate page, but it's main advantages are:
Automatically variable scoring matrix that depend on:

similarity of sequences (measured from guide tree)

Automatically variable gap penalties that depend on:

scoring matrix used
similarity of sequences
context (hydrophilicity, existence of gaps in other sequences)
length of sequences (Gap opening penalty)
difference in sequence length (Gap extension penalty)

Particularly important seems to be the adjustment of gap penalty according to sequence context. The graphical interface of ClustalX allows realignment of subset of sequences, manual rearrangement of bases and alignment quality estimates of alignments generated by ClustalW.

Local alignment methods

The other approach to progressive multiple alignment would be local multiple alignment. The fact is that biological sequences are not equally conserved over their entire lenght. Only short blocks that code for important structural or functional elements of protein remain conserved in course of evolution. Thus, many programs try to identify and align only conserved blocks from sequences and put those together to form overall alignment. The unconserved and lowly conserved areas between blocks may remain unaligned. This approach is promising. We should consider also that local alignment methods have been ruling among pairwise alignments since their invention. Those aligned blocks should reflect important structural and functional regions reasonably well. People who do multiple alignments because they are interested in common structure or function of aligned proteins may find them useful. But what if you are interested in phylogenetic analysis. Most phylogenetic methods require global multiple alignment as a prerequisite. May be the use of ONLY well conserved blocks is solution here too! The less conserved areas can be wrongly aligned and cause wrong phylogenetic trees. So why use them?
At least in cases where conserved areas offer enough phylogenetic information to build tree, the unconserved areas should be ignored. Particularly because their accepted mutation rate might be unnacceptably high and thus saturated. Fortunately there exist several alignment editor programs, that let user manually edit the alignment or choose proper areas before proceeding with phylogenetic analysis (SeaView, PhyloWin, Macaw).
As said, local alignment programs try to solve this problem WHILE doing alignment. Some of them allow gaps within conserved block, some don't. The main problem with local alignment programs is in difficulty of defining the boundaries of conserved block. Which sequences are conserved enough to be included in multiple alignment? Certainly programs that do local multiple alignment have room for improvement.

From practical side, I would suggest to try some local multiple alignment programs and see how do you like them. The outcome will depend on your sequences and there is no good criterions to say one is better than others so you have to try.
You may want to test:
MacaW (Windows and Macintosh), that is statistically well grounded and allows editing of alignment,
Iteralign, that has the WWW-based service,
prrp by Osamu Gotoh. The latter is UNIX program that uses iterations to make alignment, phylogenetic tree and sequence weights mutually consistent. Because of optimizing iterations, it could also be classified as optimizing alignment method. Claimed to be good but is not simple to use.

Probabilistic approach:

Under this category I describe profile and Hidden Markov Model based alignment methods. These can be considered as fine tools, because they use more information for creating multiple alignment than other methods. The immediate question is where this information is coming from? The answer: the information is coming from existing multiple alignments. So, probabilistic methods can be used only in cases where similar sequences are already aligned and some specific information can be extracted from existing alignment.
The simplest profile alignment is implemented in ClustalW, option 3 (Profile/structure alignments). This method needs only an old alignment as source of information. and then adds new sequences to this alignment.
This is:
1. faster than starting alignment with all sequences de novo
2. allows manual editing of the old alignment to get better results.

More complicated methods code the information of old alignment into the profile and can use such profiles for alignment. Basically profiles are probabilities of finding given amino acid at given position. Using profile for scoring alignment is technically almost as easy as using usual scoring matrices for scoring each alignment position. Usual scoring matrices also reflect probabilities of one amino acid changing to another. But PAM, BLOSUM and Gonnet matrices contain generalized information from all proteins. The main advantage of profiles is that they specify these probabilities for each amino acid position of protein separately. Of course they are not as general as scoring matrices, you need to create different profile for each family of proteins. But the gain in alignment quality is well worth the effort. A good example of intelligent use of profiles for sequence alignment is PSI-BLAST program, that creates alignment and profile to extract more members of given family from database.

Even more advanced methods are Hidden Markov Model (HMM) based methods. They use a sequence model to store probabilities of amino acid substitutions and gap frequencies. They are similar to profile methods, except that their model can contain more information than a simple profile. Most often the gap penalty values are coded into the HMM model, thus eliminating the need for user defined gap penalties, the source of many if not most errors in multiple alignments. The most used program for building HMM models and aligning sequences according to HMM model is HMMer. The collection of thousands of existing HMM models can be found from Pfam database.
Beside the well-grounded gap-penalty handling, HMM's have another advantage in multiple alignments. The speed of alignment. The speed of alignment is in linear order with number of sequences as opposed to exponential order for most other alignment methods. Thus, HMMs are particularly useful for protein families with many members.

HMM models are not limited to profile and gap modelling. For aligning RNA molecules, the incorporation of secondary structure information into the HMM model is possible. Nevertheless, the number of calculations increases with additional information, thus this program COVE currently does not handle RNA sequences longer than few hundred nucleotides.

References and suggested reading:

For viewing and editing your alignments I would suggest the recent Java alignment viewer Jalview. Netscape 4.05, Internet Explorer 4 or HotJava1.1 is needed to use it. It can do alignment itself by clustalw algorithm. This viewer has nice color scheme, based on structural features, identity level...Viewing of sequence trees. Removal of redundant sequences and so on. Try also right mouse click on different places of tree

Brocchieri L, Karlin S (1998) A symmetric-iterated multiple alignment of protein sequences. J Mol Biol 276: 249-64
Feng DF, Doolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 25: 351-60
Gotoh O (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 264: 823-38
Eddy S and Durbin R (1994) RNA Sequence Analysis Using Covariance Models. Nucl. Acids Res. 22: 2079-2088
Morgenstern B, Dress A, Werner T (1996) Multiple DNA and protein sequence alignment based on segment-to-segment
comparison. Proc Natl Acad Sci U S A 93:12098-103
Neuwald AF, Liu JS, Lipman DJ, Lawrence CE (1997) Extracting protein alignment models from the sequence database. Nucleic Acids Res 25: 1665-1677
Notredame C, Higgins DG (1996) SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res 24: 1515-24
Notredame C, Holm L, Higgins DG (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14: 407-22
Sander C, Schneider R (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 9:56-68
Schuler GD, Altschul SF, Lipman DJ (1991) A workbench for multiple alignment construction and analysis. Proteins 9:180-90
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-80
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 25: 4876-82

Global	Local-global
ClustalX (Thompson,1994) ClustalW (Thompson,1994) PileUp MAP	DIALIGN (Morgenstern, 1996) DFALIGN Macaw (Schuler,1991) IterAlign (Brocchieri,1998) PIMA prrp (Gotoh, 1996)