This page has been accessed
 0 times since 26-Mar-99 CronCount
Patterns in sequences
Searching for information within sequences.
| Most common problems and their solutions: | 
Question: I
have a gene sequence. I want to know where are the restriction sites.
Solution: Search your DNA for restriction sites
[WebCutter]
Question:  I have a gene sequence. I want to
know where are the coding regions. 
This is a non-trivial problem, particularly for higher eukaryotes with 
complex exon-intron structure and highly variable GC-content.  
Simple solution:
Search for Open Reading Frames (ORFs). 
[ORF Finder]  
[Alces Webtranslator] 
[FramePlot] 
[AAT]  
Simple programs look only for start and stop codons and show 
you the areas that principally CAN code for protein. This method fails in 
eukaryotes because of introns and even in prokaryotes the existence of ORF 
is not the proof for the existence of protein-coding gene. Both ORF Finder 
(and FramePlot) let you search your ORF product against 
general databases to find homologous genes. This gives you additional proof 
that coding region is real.
In GC-rich bacterial genomes stop-codons are rare and the beginning of ORFs are
extremely difficult to find. In this case it is wise to use the FramePlot. It
calculates GC at 3rd position (GC3) in all codons of potential ORF and plots it
against average GC content. Regions with higher GC3 are likely to be coding
regions. 
The real solution:  Find ORFs with complex
mathematical methods.  [See special section on this
page]
FIND FUNCTIONAL DOMAINS, PROMOTERS, SPLICING
SITES, SIGNALS AND PATTERNS IN YOUR SEQUENCE
 In some cases the sequence of an unknown protein is too distantly related to any
protein of known structure to detect its resemblance by overall sequence
alignment, but it can be identified by the occurrence in its sequence of a
particular cluster of residue types which is variously known as a pattern,
motif, signature, or fingerprint. These motifs arise because of particular
requirements on the structure of specific region(s) of a protein which may be
important, for example, for their binding properties or for their enzymatic
activity. These requirements impose very tight constraints on the evolution of
those limited (in size) but important portion(s) of a protein sequence.
In some cases the sequence of an unknown protein is too distantly related to any
protein of known structure to detect its resemblance by overall sequence
alignment, but it can be identified by the occurrence in its sequence of a
particular cluster of residue types which is variously known as a pattern,
motif, signature, or fingerprint. These motifs arise because of particular
requirements on the structure of specific region(s) of a protein which may be
important, for example, for their binding properties or for their enzymatic
activity. These requirements impose very tight constraints on the evolution of
those limited (in size) but important portion(s) of a protein sequence.
Question: I have a protein sequence. 
I want to know where could be functional domains, phosphorylation
sites, transport signals, ATP binding sites, active sites of enzymes, etc.
Solution 1: 
Search protein against PROSITE database. 
[PPSRCH]
Current release of PROSITE 
contains 997 documentation entries that describe 1335
different patterns, rules and profiles/matrices. PROSITE is handicraft. All
patterns and profiles are verified to represent real biological information.
Some servers offer the possibility to search also PROSITE-prerelease database,
which entries have not yet been confirmed.
Sequences are presented 
(and searched) in form of pattern [FY]-C-R-N-P-[DNR] or profile.
Profile is table of probabilities of each amino acid to occur in given
position. Profiles are more sensitive and usually longer. Patterns are simpler
and shorter. They do not detect rare exceptions to common consensus sequence.
Some examples of patterns:
     C-x-[DN]-x(4)-[FY]-x-C-x-C 
     Aspartic acid and Asparagine hydroxylation site 
     x(k) means ANY k amino acids 
     [] means ANY of the enclosed amino acids is permitted 
     {DERK}(6)-[LIVMFWSTAG](2)-[LIVMFYSTAGCQ]-[AGS]-C 
     Prokaryotic membrane lipoprotein lipid attachment site. 
     {} means NONE of the enclosed amino acids is permitted 
     (k) means that the previous amino acid type is repeated k times 
     [KRHQSA]-[DENQ]-E-L> 
     Endoplasmic reticulum targeting sequence 
     > means the C-terminal 
          Glutamine amidotransferases class-II active site 
     < means the N-terminal 
     (k1,k2) means from k1 to k2 occurrences of the previous amino acid type 
PPSRCH allows multiple sequences in input. Analyze results carefully: all
those databases contain both eukaryotic and prokaryotic patterns.
Solution 2: 
Search protein against BLOCKS database. 
[Blocks WWW server]
Blocks are multiply aligned
ungapped segments corresponding to the most highly conserved regions of
proteins. Block Searcher, Get Blocks and BlockMaker are aids to detection
and verification of protein sequence homology. They compare a protein
or DNA sequence to a database of protein blocks, retrieve blocks,
and create new blocks, respectively. BLOCKS allows you also search for a
suitable conserved region where to design PCR primers! 
BLOCKS (together with smaller PRINTS database) is currently consisting of 5188 
entries representing 1204 protein groups. Blocks are generated automatically 
from protein database entries. Blocks entries are not manually checked, 
therefore they might be of lower quality but on the other hand, BLOCKS is 
currently the most comprehensive motif database.
Block searcher does not accept multiple sequence entries. Accepts DNA!
Block database entries are poorly documented, fortunately the search result
file contain links to similar entries from PROSITE or PRINTS database, which are 
well documented.
Solution 3: Search your protein against protein
superfamilies (Pfam) database
[Pfam] 
 Pfam is a database of multiple alignments of protein domains
or conserved protein regions. They represent some evolutionary conserved
structure which has implications for the protein's function.  Pfam is actually
formed in two separate ways. Pfam-A are accurate human crafted multiple
alignments whereas Pfam-B is an automatic clustering of the rest of Swissprot
using the program Domainer. Pfam-A database contains ca 570 protein superfamilies
(they cover 50% of SwissProt database). The Pfam entries are usually longer than
BLOCKS or PROSITE patterns or motifs. Unlike PROSITE, Pfam also contains multiple
alignment for each conserved domain.
Question: Where could my protein be located in the cell?
Solution: Search for protein sorting signals.
[Psort]
Predicts localization of both eukaryotic and prokaryotic proteins.
Question: I have a gene sequence. I want to know
where are the transcription factor binding sites and promoter(s).
Solution A: Search your DNA sequence against 
TRANSFAC or TFD database.
[SignalScan]
[FastM]
TRANSFAC is a database on
eukaryotic cis-acting regulatory DNA elements and trans-acting factors.
It covers the whole range from yeast to human. Currently contains info
for 4602 binding sites and 2285 transcription factors.
FastM server lets you search for interesting combinations: e.g. in which
genes binding sites for A and B occur within defined distance.
Solution B: Analyze your DNA
with some promoter-finding program.
[NNPP]
Question: I
have a gene sequence. I want to know how good (typical) are the translation
start and stop codons of that gene. Will my gene be expressed normally in ...
cells? 
Solution: 
Compare your gene start and end to other genes from the same species.
[TransTerm]
TransTerm database contains
statistics and species specific consensus sequence around start and stop
codons of genes from several hundreds of different organisms. Remember that
translation is initiated differently in eukaryotes and prokaryotes. Eukaryotic
ribosomes recognize the start codon itself. In human cells, the good consensus 
for starting translation is RNNATGG... In prokaryotic genes ribosomes
bind to so called Shine-Dalgarno sequence that is placed 10-20 nucleotides in
front of start codon.
IDENTIFY SEQUENCE PATTERN IN PROTEIN FAMILY.
Question: I have a family of protein sequences. 
All of them contain common motif (active site or binding site for other
 molecules or transport signal, etc). I want to extract the general consensus 
sequence of this motif.
Solution: Identify pattern in family of sequences
You could try the following specialized pages:
  
[Blockmaker] 
BLOCKS server. Uses the same algorithms than are used for preparing BLOCK
database entries.  
[eMOTIF] in Stanford.
Needs aligned sequences for input.
[PRATT] in EBI.
Pattern identification from non-aligned sequences.
SEARCH DNA PATTERN AGAINST DATABASE.
Question: I am working with a transcription
factor. I have identified (refined) it's binding site. I want to know what other
genes could contain this binding site.
Solution:
Search pattern against DNA database:
Currently the only program (that I know) to do that is 
PATSCAN.
Patscan accepts both patterns and matrixes
ANALYSIS OF CODON USAGE. CORRESPONDENCE ANALYSIS OF GENES.
Question:
I have many genes from one species. I want to know which codons are
preferred and how are genes distributed by their codon usage.
Solution:
Do codon usage analysis and/or correspondence analysis
on your genes.
 Codon usage
Correspondence analysis is a statistical
method that tells you about distribution and similarity of your genes based on
codon usage. Lyon server 
is able to calculate correspondence analysis on many genes, but it is complicated to use. 
When you need to do a lot of codon usage analysis, I suggest to download and install the
program codonW. 
COMPLEX ANALYSIS OF LONG SEQUENCES. AUTOMATIC IDENTIFICATION
OF GENES AND PROMOTERS
Question: I have an eukaryotic gene sequence. I
want to know whether there are any exons, promoters, splicing sites,
transcription factor binding sites, polyadenylation sites or other eukaryotic
gene features in it. 
Solution: 
Victor Solovyov's collection of programs 
in EBI, England 
in BMC, Texas, US
Both collections include famous eukaryotic
exon-finding programs like Grail II, FGENEH, Genie, Hexon as well as programs
for predicting promoters, splicing sites and transcription factor binding sites.
Most of those programs try to find potential splice junctions, open reading
frames and promoters in complex. They are based on advanced learning-recognition
methods like Hidden Markov Models and Neural Networks use sample datasets for
training. Therefore, they are good only for the species DNA they were trained on
(mostly for human and mammalian DNA)
Question: I have a prokaryotic gene sequence. I
want to know where are the ORFs, promoters, ribosome binding sites or other
prokaryotic gene features in it. 
Solution: GeneMark by G.Borodovsky 
in EBI, England 
 in GIT,
Georgia, US
Genemark is a learning program
- it needs to know at least 10 kb of the sequence before making correct predictions.
Fortunately, data for most common prokaryotes and lower eukaryotes has
been already included in program. EBI version has also data for human and
A.thaliana genes, based on their GC content.
For short overview about sequence comparisions and alignments read 
additional tutorial.
Collection of programs in Pasteur Institute, France 
Collection of
programs in CMS, Italy
Recent paper in
Nature Genetics, with hyperlinks.
  
 
1. Download complete genes for elongation factor Tu (tuf) 
and elongation factor G (fus) from the following organisms:
- Rickettsia prowazekii
- Chlamydia trachomatis
- Mycoplasma pneumonia
- Bacillus subtilis
- Treponema pallidum
- Escherichia coli
- Mycobacterium leprae
- Mycobacterium tuberculosis
There may be more than one gene from each organism.
Hint: After retrieving the gene from 
DNA databank it may contain additional
sequences or other genes. Click on link CDS beside your correct gene to
retrieve only coding part of the gene.  
Use the following 
form to calculate
basic characteristics for those genes: total GC content, GC content at each codon 
position and observed Nc (number of used codons) value. There is two Nc values 
on form output, the first is expected Nc (theoretically calculated from GC3 
content), the second is observed Nc. 
Make two plots from the data:
Plot A. GC1, GC2 and GC3 plotted against total GC of the gene.  
Plot B. Observed Nc plotted against GC3s. See how much observed codon 
usage differs from the theoretical codon usage.
Send or show the plots. On plot A, draw 3 lines
 through all GC1, GC2 and GC3 points. Do they have different slope? Why?
On plot B, see how much observed codon usage differs from the theoretical 
codon usage. What could cause the difference from theoretical value? 
Which genome has most significant difference between theoretical
and observed codon usage? 
2. Analyze the frequency of codons (codon usage) in tuf genes using 
links mentioned above.  Which codons are preferred 
in your genes? Are the preferred codons same in each organism?
Hint: 
Here is the codon table to help you find which codons are coding for the same
amino acid. You do not have to compare all codon families, take 2 or 3 (Proline
and Threonine are good examples for analyzing codon usage)
Phe     UUU     
Ser     UCU    
Tyr     UAU     
Cys     UGU     
        
UUC            
UCC            
UAC            
UGC     
Leu     
UUA            
UCA     
TER     UAA    
TER     UGA     
        
UUG            
UCG            
UAG     Trp     
UGG     
        CUU    
Pro     CCU     
His    
CAU     Arg     
CGU     
        
CUC            
CCC            
CAC            
CGC     
        
CUA            
CCA     Gln     
CAA            
CGA     
        
CUG            
CCG            
CAG            
CGG 
Ile     AUU     
Thr    
ACU     Asn     
AAU    
Ser     AGU     
        
AUC            
ACC            
AAC            
AGC     
        
AUA            
ACA     Lys     
AAA    
Arg     AGA     
Met     
AUG            
ACG            
AAG            
AGG
 
Val     GUU    
Ala     GCU    
Asp     GAU       
Gly     GGU
   
       
GUC            
GCC            
GAC            
GGC
   
       
GUA            
GCA       
Glu     
GAA            
GGA
           
GUG            
GCG            
GAG            
GGG 
 
Hint: 
If you are using Alces server for 
codon usage analysis remember that Alces server has three different modes:
A.Translate
B.Codon Table (what you probably need)
C. CAI values
To change the output you need to change the button at the end of form just
above the submit button. Another trap is first button on top of form -
this has to be set to "Raw" if you use Copy-Paste.
Anyway, for all codon usage programs you have to use fasta format 
or just plain sequence - numbers have to be removed. SRS5 has those options 
(save sequence in fasta format, but those might be difficult to find. They are
now changing their forms, so their outlook is not entirely consistent.
Anyway, try to use numerous buttons in SRS form to convert your output
sequence to fasta format. Or try to use sequence converters in Singapore or
 in Texas.
 
3. Identify Open Reading Frames (ORFs) on your personal contig that was 
assigned for
previous
homework.
 Try several different ORF-finding programs. Choose 3 ORFs for
further analysis at your own choice. 
 Hint: 
Frameplot might not be able to return you the gif-image if your
sequence is bigger than 35 kb. In this case use only half of your contig for
finding ORFs.
 Which program was most convenient for finding ORFs? Why?
 
4. Find patterns in protein sequences. First read some good 
documentation pages about
PROSITE,  
BLOCKS and 
Pfam
databases. 
Follow the links and try to understand what are patterns, what
are matrixes and how are they selected for databases. 
Now find potential active sites and other patterns using PROSITE, BLOCKS and 
Pfam database. Take one sample protein 
and see what results can you get from those databases.
After getting results, try to identify the sample protein by BLAST search. 
Compare sequence description in database with the data you got from your pattern 
search. Are those discovered patterns mentioned in 
database description?   
Now try to identify patterns in your own sequences. 
 Translate some (at least 3) ORFs from task 3 to protein sequence. 
Alternatively use some proteins from your own scientific project. 
Which homologies do you find from each database? Did you find any useful
information from those searches? Can you predict protein function based on this
search?   
 
5. Test the promoter identification programs: try to identify promoters in that
piece of eukaryotic DNA. 
 
6. Generate a result file with answers to each task. Answer to red questions. Send it to me 
 by email or with the form below.
Feel free to email or see me if you have any questions about interpreting your
results, understanding the program input, output or algorithms etc. 
A form for sending your results:
Maido Remm
Back to homepage