This page has been accessed
0 times since 26-Mar-99 CronCount
Comparison of sequences is a central method
in whole bioinformatics. Almost all programs deal in one or another way
with comparing and aligning sequences. Particularly, pairwise alignment
is most important method - most methods in bioinformatics rely heavily
on pairwise comparisons.
There are many methods and programs to
align sequences but the basic process of alignment is the same. Each comparison
method itself can be separated to two key parts:
1. A method to estimate the goodness of
each possible alignment between those two sequences (the scoring
problem).
2. An algorithm to set the two sequences
against each other in best possible way (the actual alignment problem).
Scoring matrix is a table that is used to
evaluate alignment score. It contains scores (numbers) for each position
(match or mismatch) in alignment. The simplest scoring system is identity
matrix. This is often used for DNA alignments. For example, BLAST scores
1 for nucleotide match and 0 for mismatch.
Protein scoring matrixes are usually tables
of 20x20 position (the score for changing from one amino acid to another).
There are many different ways to score protein alignments:
1. Identity scoring (1 for match and 0
for mismatch). This is simplest and most logical way to score alignments.
2. Genetic code scoring (Fitch matrix).
Considers the minimum number of DNA changes (1, 2 or 3) that would be required
to convert one amino acid to another. This matrix is not usable in most
cases because the 3rd position in codons is usually saturated.
3. Chemical properties matrix. In this
matrix amino acids are intuitively classified on the basis of their chemical
properties (polarity, hydrophobity, size). It was expected that change
of amino acid to chemically similar amino acid is easier to happen and
should score higher. This matrix does not really reflect real evolution
and is currently gone to history.
4. Matrices based on observed frequency
of substitutions. These are statistical reflections of real evolution and
give best alignments. In fact, all currently used protein matrixes are
this kind of matrices. They are also known as log-odds matrices.
But there are many varieties of those:
Dayhoff matrix
PAM-family
Gonnet-family
BLOSUM-family
Choice of proper scoring method is crucial
for proper alignment results. The choice of scoring matrix will depend
of sequences you compare. If you compare similar sequences, you should
use one scoring matrix, for distantly related sequences another matrix.
Scoring matrix is an estimate of sequence evolution. The evolution of sequences
is dependent of sequence composition, function, structure and other factors,
therefore no matrix will be absolutely correct. Fortunately, for most database
searches you can use average scoring matrix (default) without worries -
the choice is optimized to give you the best similarities. For better alignments,
advanced programs use separate matrix for each family of sequences (see
section Advanced programs). Specific matrices can be
generated beforehand for each family of similar sequences or generated
dynamically, repeating the database search or alignment and making better
matrix each round.
Each family of matrices consist some range
of matrices that reflect evolution of sequences at different level of similarity.
Typical examples are PAM40, PAM120, PAM250 or BLOSUM40, BLOSUM60 and BLOSUM80.
To learn more about how different matrices are built read an advanced
lesson about scoring matrices.
What is the difference between different
matrices?
Dayhoff matrix
was created in 1978 based on few closely related (> 85% identity) sequences
available this time (1500 aligned amino-acids).
PAM-family of matrices
is a simple update of the original Dayhoff matrix.
Gonnet matrices
were created by exhaustive alignment of all database sequences at 1992.
BLOSUM series
is based on local similarities of proteins rather than overall alignments.
My personal suggestion is (based on theory
and matrix-testing papers) to use Gonnet matrix if possible, or BLOSUM
if possible, or PAM if possible.
Scoring for gaps (deletions).
What
should be the cost of deletion or insertion in given position of alignment?
This is difficult question and has no simple answer. Because the
biological mechanism of deletions is very different from simple point mutations,
the gap cannot be included in ordinary scoring matrix. For most matrixes,
gap cost (or gap penalty) is determined in two parts: gap opening penalty
and gap extension penalty. Gap opening penalty will given for first position
in gap and others will score for gap extension penalty. This reflects the
biological appearance of deletions - they usually delete more than one
amino acid. Patrick Argos has published a comprehensive
analysis of insertions/deletions in protein structures.
The basic idea of sequence comparison is simple:
try every possible alignment between two sequences and give each aligned
position a score according to the scoring mastrix. The alignment with highest
score is the best. Unfortunately, trying all possible combinations of one
sequence against another is not possible due to enormous amount of
combinations (in order of 2(lengthA + lengthB) calculations).
Consider also that in biological sequences can always be deletions or insertions
of different size - those should be considered also when doing exhaustive
alignment. Therefore, since appearance of first biological sequences, people
have developed various methods to make alignment process applicable in
relatively short time.
There are two types of alignment methods:
global alignment methods and local alignment methods.
Global alignment methods try to
align 2 sequences in their whole length. If one sequence is much shorter
than other, then at least shorter sequence must be fully aligned.
Local alignment is searching only
partial similarities in two sequences that reach above certain threshold
(defined by programmer or by user). The rest of sequence may remain unaligned.
Global alignments are appropriate for
sequences that are known to be homologous and to share similarity over
their whole length. They are often used in phylogenetic analysis for calculations
of evolutionary distance between two genes (or species). Local alignment
methods are appropriate when the sequences may show isolated regions of
similarity, like conserved domains. This is useful for database scanning
and other applications when there is no a priori knowledge that
the proteins are similar.
Simple pairwise alignments: |
The first reasonable alignment method was
a dynamic programming method developed in 1970 by Needleman
& Wunch. It expected that the sequences are somehow related and
are about the same length. It tried to align sequences within their entire
length, therefore it was an global alignment method. A modification of
that method to allow for search of local alignments was given in 1981 by
Smith
& Waterman. Both dynamic alignment algorithms are guaranteed to
find best alignment (within limits of used scoring system) take time that
is in order of (sequence1 length * sequence2 length).
* Best-known program that uses Needleman-Wunch
(NW) global alignment algorithm is gap.
* Popular programs that use Smith-Waterman
(SW) algorithm are: BestFit ,
The dynamic programming is applicable for
one pairwise alignment, but not for database scanning where thousands of
comparisons need to be made within seconds. Therefore some modifications
have made to classical dynamic programming algorithm over time to make
your life easier:
1. Dynamic programming algorithms on specialized
computers. [MPsrch]
[MPsrch]
[DeCypher
II]
MPsrch is
a specialized program that enables one alignment to be calculated on many
processors simultaneously (massively parallel computing). In DeCypher-type
computers, calculations are done on level of hardware, which gives speed
increase several orders of magnitude, comparing to ordinary computers.
2. Not allowing gaps in alignments. Compare+DotPlot
,
BLAST1
Compare and
DotPlot
are commercial programs from GCG package that do very quick global alignment
by finding short similar "words" in both sequences and plot them in XY-coordinate
graph.
BLAST was
originally developed in 1990 by Altschul and coworkers. The first version
did not allow for gaps. In 1997 it was replaced by faster and more sensitive
BLAST2
program.
3. Using other heuristic methods. FASTA,
BLAST2
Heuristic methods are methods that make
some guess which alignments to try. They do not search all possible alignments
between 2 sequences and are therefore NOT guaranteed to give best alignment
(within limits of current scoring system). Nevertheless, in practice they
are optimized to get reasonably sensitive results (you will get your best
alignment and not too many false positive alignments). Thanks to their
limitations in sensitivity, they gain huge amount in speed of search. Both
programs use only order of (sequence1 length + sequence2 length) calculations,
so they are very fast comparing to dynamic programming algorithms.
FASTA
was
developed in 1988 by Pearson and Lipman. Fasta-collection also includes
programs ALIGN and LALIGN
that do more rigorous global and local alignment respectively. Current
BLAST2
algorithm
is very similar to FASTA algorithm. BLAST2 is throughly optimized to speed
and is several times faster than FASTA. Nevertheless, FASTA might be more
sensitive for DNA comparisons.
On-line search of single sequences can
be done in few seconds using WWW servers for FASTA
or BLAST2
BLAST2 algorithm is principially similar
to FASTA but is designed more professionally and I think it deserves more
longer explanation. There I also try to explain
how to interpret BLAST output (E-values, Score, alignments et al.).
For more information about BLAST and FASTA,
read BLAST-FAQ, BLAST-FAQ
at NCBI, BLAST-Manual or FASTA-Manual
Advanced
programs for better alignments: |
A general problem with all alignments is a
choice of scoring matrix - usually it is determined empirically from the
whole database. Nevertheless, the evolution of the protein that YOU are
interested in might be quite different from this average model. The scoring
matrix is specific to each separate sequence and even for separate domains
of proteins. The alignments using genefamily-specific scoring matrices
are more sensitive and more correct. All of the following programs take
advantage of that fact. They run iterative searches against set of sequences
(database) and try to improve scoring matrix after each round.
Psi-BLAST
This program is based on BLAST2 algorithm and is meant for more sensitive
database searches. It recalculates scoring matrix according to matches
from previous search.
Most
by Roman Tatusov. Another iterative database searcher.
It was a predecessor of Psi-BLAST.
HMMER
from Sean Eddy's lab
The last programs is based on Hidden Markov
Models. It builds a model of sequence which is
position-dependent:
it is like having different scoring matrix for each position (amino-acid)
of gene. In many ways, these hidden Markov models correspond to profiles.
The HMM models are built from a set of aligned sequences. It is important
to have only relevant sequences in this building stage, because nonrelated
sequences would give a corrupt profile and accordingly wrong alignment
afterwards. Once the HMM or profile model is available it can be used for
extremely sensitive database searches or even for multiple alignment. Hundreds
of prebuilt HMM models can be found from Pfam
database
Often user needs to align two (or more) protein
coding genes (given DNA sequence). This is useful for comparing nucleotide
substitution rates in synonymous codons or for measuring phylogenetic distance
between closely related sequences. In this case you should consider that
program should compare the codons (amino-acids) and gaps
should be introduced in chunks of 3, to keep the alignment between homologous
codons if possible.
The common Fasta or ClustalW programs
are not appropriate for this task. Thus, for alignment of protein coding
genes,
use protein based DNA alignment programs (Tblastx
or Genal).
Multiple alignments
and comparative analysis of multiple alignment programs.
Here are some links
to Internet lectures that are certainly worth reading. Some of the materials
above is adopted from those pages:
Protein
Sequence Alignment and Database Scanning by Geoffrey J. Barton
Best tutorial on topic.
BioComputing
Hypertext Coursebook from VSNS and University
of Bielefild Little longer and more serious explanation
to bioinformatics.
References and suggested
reading: |
Altschul
SF (1991) Amino acid substitution matrices from an information
theoretic perspective. J. Mol. Biol. 219: 555-565.
Altschul
SF, Boguski MS, Gish W and Wootton JC (1994) Issues in searching
molecular sequence databases. Nature Genetics 6: 119-129.
Altschul
SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res 25: 3389-3402
Dayhoff
MO (1978) Atlas of Protein Sequence and Structure (Natl. Biomed.
Res. Found., Washington), Vol. 5, Suppl. 3, pp. 345-352.
Dayhoff
MO, Barker WC, Hunt LT (1983) Establishing homologies in protein
sequences. Methods Enzymol 91: 524-545
Gonnet
GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire
protein sequence database. Science 256: 1443-1445
Henikoff
S, Henikoff JG (1992) Amino acid substitution matrices from
protein blocks. Proc Natl Acad Sci USA 1992 89: 10915-10919
Needleman
SB, Wunsch CD (1970) A general method applicable to the search
for similarities in the amino acid
sequence of two proteins. J Mol Biol 48:
443-453
Pascarella
S & Argos P (1992) Analysis of insertions/deletions in protein
structures. J.Mol.Biol. 224: 461-471
Smith
TF, Waterman MS (1981) Identification of common molecular subsequences.
J Mol Biol 147: 195-197
Pearson
WR, Lipman DJ (1988) Improved tools for biological sequence
comparison. Proc Natl Acad Sci USA 85: 2444-2448
J Mol Biol 1981 Mar 25;147(1):195-197Proc
Natl Acad Sci USA 91: 12091-12095
Maido Remm, 1998