Comparison of sequences

This page has been accessed 0 times since 26-Mar-99 CronCount

Comparing sequences.

Comparison of sequences is a central method in whole bioinformatics. Almost all programs deal in one or another way with comparing and aligning sequences. Particularly, pairwise alignment is most important method - most methods in bioinformatics rely heavily on pairwise comparisons.
There are many methods and programs to align sequences but the basic process of alignment is the same. Each comparison method itself can be separated to two key parts:
1. A method to estimate the goodness of each possible alignment between those two sequences (the scoring problem).
2. An algorithm to set the two sequences against each other in best possible way (the actual alignment problem).

Scoring methods:

Scoring matrix is a table that is used to evaluate alignment score. It contains scores (numbers) for each position (match or mismatch) in alignment. The simplest scoring system is identity matrix. This is often used for DNA alignments. For example, BLAST scores 1 for nucleotide match and 0 for mismatch.
Protein scoring matrixes are usually tables of 20x20 position (the score for changing from one amino acid to another). There are many different ways to score protein alignments:
1. Identity scoring (1 for match and 0 for mismatch). This is simplest and most logical way to score alignments.
2. Genetic code scoring (Fitch matrix). Considers the minimum number of DNA changes (1, 2 or 3) that would be required to convert one amino acid to another. This matrix is not usable in most cases because the 3rd position in codons is usually saturated.
3. Chemical properties matrix. In this matrix amino acids are intuitively classified on the basis of their chemical properties (polarity, hydrophobity, size). It was expected that change of amino acid to chemically similar amino acid is easier to happen and should score higher. This matrix does not really reflect real evolution and is currently gone to history.
4. Matrices based on observed frequency of substitutions. These are statistical reflections of real evolution and give best alignments. In fact, all currently used protein matrixes are this kind of matrices. They are also known as log-odds matrices. But there are many varieties of those:

Dayhoff matrix

PAM-family

Gonnet-family

BLOSUM-family

Choice of proper scoring method is crucial for proper alignment results. The choice of scoring matrix will depend of sequences you compare. If you compare similar sequences, you should use one scoring matrix, for distantly related sequences another matrix. Scoring matrix is an estimate of sequence evolution. The evolution of sequences is dependent of sequence composition, function, structure and other factors, therefore no matrix will be absolutely correct. Fortunately, for most database searches you can use average scoring matrix (default) without worries - the choice is optimized to give you the best similarities. For better alignments, advanced programs use separate matrix for each family of sequences (see section Advanced programs). Specific matrices can be generated beforehand for each family of similar sequences or generated dynamically, repeating the database search or alignment and making better matrix each round.

Each family of matrices consist some range of matrices that reflect evolution of sequences at different level of similarity. Typical examples are PAM40, PAM120, PAM250 or BLOSUM40, BLOSUM60 and BLOSUM80. To learn more about how different matrices are built read an advanced lesson about scoring matrices.

What is the difference between different matrices?
Dayhoff matrix was created in 1978 based on few closely related (> 85% identity) sequences available this time (1500 aligned amino-acids).
PAM-family of matrices is a simple update of the original Dayhoff matrix.
Gonnet matrices were created by exhaustive alignment of all database sequences at 1992.
BLOSUM series is based on local similarities of proteins rather than overall alignments.
My personal suggestion is (based on theory and matrix-testing papers) to use Gonnet matrix if possible, or BLOSUM if possible, or PAM if possible.

Scoring for gaps (deletions). What should be the cost of deletion or insertion in given position of alignment? This is difficult question and has no simple answer. Because the biological mechanism of deletions is very different from simple point mutations, the gap cannot be included in ordinary scoring matrix. For most matrixes, gap cost (or gap penalty) is determined in two parts: gap opening penalty and gap extension penalty. Gap opening penalty will given for first position in gap and others will score for gap extension penalty. This reflects the biological appearance of deletions - they usually delete more than one amino acid. Patrick Argos has published a comprehensive analysis of insertions/deletions in protein structures.

Alignment algorithms:

The basic idea of sequence comparison is simple: try every possible alignment between two sequences and give each aligned position a score according to the scoring mastrix. The alignment with highest score is the best. Unfortunately, trying all possible combinations of one sequence against another is not possible due to enormous amount of combinations (in order of 2^{(lengthA + lengthB)}calculations). Consider also that in biological sequences can always be deletions or insertions of different size - those should be considered also when doing exhaustive alignment. Therefore, since appearance of first biological sequences, people have developed various methods to make alignment process applicable in relatively short time.

Alignment methods:

There are two types of alignment methods: global alignment methods and local alignment methods.
Global alignment methods try to align 2 sequences in their whole length. If one sequence is much shorter than other, then at least shorter sequence must be fully aligned.
Local alignment is searching only partial similarities in two sequences that reach above certain threshold (defined by programmer or by user). The rest of sequence may remain unaligned.
Global alignments are appropriate for sequences that are known to be homologous and to share similarity over their whole length. They are often used in phylogenetic analysis for calculations of evolutionary distance between two genes (or species). Local alignment methods are appropriate when the sequences may show isolated regions of similarity, like conserved domains. This is useful for database scanning and other applications when there is no a priori knowledge that the proteins are similar.

Simple pairwise alignments:

The first reasonable alignment method was a dynamic programming method developed in 1970 by Needleman & Wunch. It expected that the sequences are somehow related and are about the same length. It tried to align sequences within their entire length, therefore it was an global alignment method. A modification of that method to allow for search of local alignments was given in 1981 by Smith & Waterman. Both dynamic alignment algorithms are guaranteed to find best alignment (within limits of used scoring system) take time that is in order of (sequence1 length * sequence2 length).
* Best-known program that uses Needleman-Wunch (NW) global alignment algorithm is gap.
* Popular programs that use Smith-Waterman (SW) algorithm are: BestFit ,

The dynamic programming is applicable for one pairwise alignment, but not for database scanning where thousands of comparisons need to be made within seconds. Therefore some modifications have made to classical dynamic programming algorithm over time to make your life easier:
1. Dynamic programming algorithms on specialized computers. [MPsrch] [MPsrch] [DeCypher II]
MPsrch is a specialized program that enables one alignment to be calculated on many processors simultaneously (massively parallel computing). In DeCypher-type computers, calculations are done on level of hardware, which gives speed increase several orders of magnitude, comparing to ordinary computers.
2. Not allowing gaps in alignments. Compare+DotPlot , BLAST1
Compare and DotPlot are commercial programs from GCG package that do very quick global alignment by finding short similar "words" in both sequences and plot them in XY-coordinate graph.
BLAST was originally developed in 1990 by Altschul and coworkers. The first version did not allow for gaps. In 1997 it was replaced by faster and more sensitive BLAST2 program.

3. Using other heuristic methods. FASTA, BLAST2
Heuristic methods are methods that make some guess which alignments to try. They do not search all possible alignments between 2 sequences and are therefore NOT guaranteed to give best alignment (within limits of current scoring system). Nevertheless, in practice they are optimized to get reasonably sensitive results (you will get your best alignment and not too many false positive alignments). Thanks to their limitations in sensitivity, they gain huge amount in speed of search. Both programs use only order of (sequence1 length + sequence2 length) calculations, so they are very fast comparing to dynamic programming algorithms.
FASTA was developed in 1988 by Pearson and Lipman. Fasta-collection also includes programs ALIGN and LALIGN that do more rigorous global and local alignment respectively. Current BLAST2 algorithm is very similar to FASTA algorithm. BLAST2 is throughly optimized to speed and is several times faster than FASTA. Nevertheless, FASTA might be more sensitive for DNA comparisons.
On-line search of single sequences can be done in few seconds using WWW servers for FASTA or BLAST2

BLAST2 algorithm is principially similar to FASTA but is designed more professionally and I think it deserves more longer explanation. There I also try to explain how to interpret BLAST output (E-values, Score, alignments et al.).

For more information about BLAST and FASTA, read BLAST-FAQ, BLAST-FAQ at NCBI, BLAST-Manual or FASTA-Manual

Advanced programs for better alignments:

A general problem with all alignments is a choice of scoring matrix - usually it is determined empirically from the whole database. Nevertheless, the evolution of the protein that YOU are interested in might be quite different from this average model. The scoring matrix is specific to each separate sequence and even for separate domains of proteins. The alignments using genefamily-specific scoring matrices are more sensitive and more correct. All of the following programs take advantage of that fact. They run iterative searches against set of sequences (database) and try to improve scoring matrix after each round.

Psi-BLAST This program is based on BLAST2 algorithm and is meant for more sensitive database searches. It recalculates scoring matrix according to matches from previous search.
Most by Roman Tatusov. Another iterative database searcher. It was a predecessor of Psi-BLAST.
HMMER from Sean Eddy's lab
The last programs is based on Hidden Markov Models. It builds a model of sequence which is position-dependent: it is like having different scoring matrix for each position (amino-acid) of gene. In many ways, these hidden Markov models correspond to profiles. The HMM models are built from a set of aligned sequences. It is important to have only relevant sequences in this building stage, because nonrelated sequences would give a corrupt profile and accordingly wrong alignment afterwards. Once the HMM or profile model is available it can be used for extremely sensitive database searches or even for multiple alignment. Hundreds of prebuilt HMM models can be found from Pfam database

Alignments of genes:

Often user needs to align two (or more) protein coding genes (given DNA sequence). This is useful for comparing nucleotide substitution rates in synonymous codons or for measuring phylogenetic distance between closely related sequences. In this case you should consider that program should compare the codons (amino-acids) and gaps should be introduced in chunks of 3, to keep the alignment between homologous codons if possible.
The common Fasta or ClustalW programs are not appropriate for this task. Thus, for alignment of protein coding genes, use protein based DNA alignment programs (Tblastx or Genal).

Multiple alignments:

Multiple alignments and comparative analysis of multiple alignment programs.

Tutorials:

Here are some links to Internet lectures that are certainly worth reading. Some of the materials above is adopted from those pages:
Protein Sequence Alignment and Database Scanning by Geoffrey J. Barton Best tutorial on topic.
BioComputing Hypertext Coursebook from VSNS and University of Bielefild Little longer and more serious explanation to bioinformatics.

References and suggested reading:

Altschul SF (1991) Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219: 555-565.
Altschul SF, Boguski MS, Gish W and Wootton JC (1994) Issues in searching molecular sequence databases. Nature Genetics 6: 119-129.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402
Dayhoff MO (1978) Atlas of Protein Sequence and Structure (Natl. Biomed. Res. Found., Washington), Vol. 5, Suppl. 3, pp. 345-352.
Dayhoff MO, Barker WC, Hunt LT (1983) Establishing homologies in protein sequences. Methods Enzymol 91: 524-545
Gonnet GH, Cohen MA, Benner SA (1992) Exhaustive matching of the entire protein sequence database. Science 256: 1443-1445
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992 89: 10915-10919
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid
sequence of two proteins. J Mol Biol 48: 443-453
Pascarella S & Argos P (1992) Analysis of insertions/deletions in protein structures. J.Mol.Biol. 224: 461-471
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147: 195-197
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85: 2444-2448
J Mol Biol 1981 Mar 25;147(1):195-197Proc Natl Acad Sci USA 91: 12091-12095

Maido Remm, 1998