This page has been accessed 0 times since 26-Mar-99 CronCount

Patterns in sequences
Searching for information within sequences.

Most common problems and their solutions:

Question: I have a gene sequence. I want to know where are the restriction sites.
Solution: Search your DNA for restriction sites [WebCutter]

Question: I have a gene sequence. I want to know where are the coding regions.
This is a non-trivial problem, particularly for higher eukaryotes with complex exon-intron structure and highly variable GC-content.
Simple solution: Search for Open Reading Frames (ORFs). [ORF Finder] [Alces Webtranslator] [FramePlot] [AAT]
Simple programs look only for start and stop codons and show you the areas that principally CAN code for protein. This method fails in eukaryotes because of introns and even in prokaryotes the existence of ORF is not the proof for the existence of protein-coding gene. Both ORF Finder (and FramePlot) let you search your ORF product against general databases to find homologous genes. This gives you additional proof that coding region is real.
In GC-rich bacterial genomes stop-codons are rare and the beginning of ORFs are extremely difficult to find. In this case it is wise to use the FramePlot. It calculates GC at 3rd position (GC3) in all codons of potential ORF and plots it against average GC content. Regions with higher GC3 are likely to be coding regions.
The real solution: Find ORFs with complex mathematical methods. [See special section on this page]

FIND FUNCTIONAL DOMAINS, PROMOTERS, SPLICING SITES, SIGNALS AND PATTERNS IN YOUR SEQUENCE

In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint. These motifs arise because of particular requirements on the structure of specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity. These requirements impose very tight constraints on the evolution of those limited (in size) but important portion(s) of a protein sequence.
Question: I have a protein sequence. I want to know where could be functional domains, phosphorylation sites, transport signals, ATP binding sites, active sites of enzymes, etc.
Solution 1: Search protein against PROSITE database. [PPSRCH]
Current release of PROSITE contains 997 documentation entries that describe 1335 different patterns, rules and profiles/matrices. PROSITE is handicraft. All patterns and profiles are verified to represent real biological information. Some servers offer the possibility to search also PROSITE-prerelease database, which entries have not yet been confirmed.
Sequences are presented (and searched) in form of pattern [FY]-C-R-N-P-[DNR] or profile. Profile is table of probabilities of each amino acid to occur in given position. Profiles are more sensitive and usually longer. Patterns are simpler and shorter. They do not detect rare exceptions to common consensus sequence.
Some examples of patterns:

C-x-[DN]-x(4)-[FY]-x-C-x-C
Aspartic acid and Asparagine hydroxylation site
x(k) means ANY k amino acids
[] means ANY of the enclosed amino acids is permitted

{DERK}(6)-[LIVMFWSTAG](2)-[LIVMFYSTAGCQ]-[AGS]-C
Prokaryotic membrane lipoprotein lipid attachment site.
{} means NONE of the enclosed amino acids is permitted
(k) means that the previous amino acid type is repeated k times

[KRHQSA]-[DENQ]-E-L>
Endoplasmic reticulum targeting sequence
> means the C-terminal

Glutamine amidotransferases class-II active site
< means the N-terminal
(k1,k2) means from k1 to k2 occurrences of the previous amino acid type

PPSRCH allows multiple sequences in input. Analyze results carefully: all those databases contain both eukaryotic and prokaryotic patterns.

Solution 2: Search protein against BLOCKS database. [Blocks WWW server]
Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. Block Searcher, Get Blocks and BlockMaker are aids to detection and verification of protein sequence homology. They compare a protein or DNA sequence to a database of protein blocks, retrieve blocks, and create new blocks, respectively. BLOCKS allows you also search for a suitable conserved region where to design PCR primers! BLOCKS (together with smaller PRINTS database) is currently consisting of 5188 entries representing 1204 protein groups. Blocks are generated automatically from protein database entries. Blocks entries are not manually checked, therefore they might be of lower quality but on the other hand, BLOCKS is currently the most comprehensive motif database.
Block searcher does not accept multiple sequence entries. Accepts DNA!
Block database entries are poorly documented, fortunately the search result file contain links to similar entries from PROSITE or PRINTS database, which are well documented.
Solution 3: Search your protein against protein superfamilies (Pfam) database [Pfam]
Pfam is a database of multiple alignments of protein domains or conserved protein regions. They represent some evolutionary conserved structure which has implications for the protein's function. Pfam is actually formed in two separate ways. Pfam-A are accurate human crafted multiple alignments whereas Pfam-B is an automatic clustering of the rest of Swissprot using the program Domainer. Pfam-A database contains ca 570 protein superfamilies (they cover 50% of SwissProt database). The Pfam entries are usually longer than BLOCKS or PROSITE patterns or motifs. Unlike PROSITE, Pfam also contains multiple alignment for each conserved domain.

Question: Where could my protein be located in the cell?
Solution: Search for protein sorting signals. [Psort] Predicts localization of both eukaryotic and prokaryotic proteins.

Question: I have a gene sequence. I want to know where are the transcription factor binding sites and promoter(s).
Solution A: Search your DNA sequence against TRANSFAC or TFD database. [SignalScan] [FastM]
TRANSFAC is a database on eukaryotic cis-acting regulatory DNA elements and trans-acting factors. It covers the whole range from yeast to human. Currently contains info for 4602 binding sites and 2285 transcription factors.
FastM server lets you search for interesting combinations: e.g. in which genes binding sites for A and B occur within defined distance.
Solution B: Analyze your DNA with some promoter-finding program. [NNPP]

Question: I have a gene sequence. I want to know how good (typical) are the translation start and stop codons of that gene. Will my gene be expressed normally in ... cells?
Solution: Compare your gene start and end to other genes from the same species. [TransTerm]
TransTerm database contains statistics and species specific consensus sequence around start and stop codons of genes from several hundreds of different organisms. Remember that translation is initiated differently in eukaryotes and prokaryotes. Eukaryotic ribosomes recognize the start codon itself. In human cells, the good consensus for starting translation is RNNATGG... In prokaryotic genes ribosomes bind to so called Shine-Dalgarno sequence that is placed 10-20 nucleotides in front of start codon.

IDENTIFY SEQUENCE PATTERN IN PROTEIN FAMILY.

Question: I have a family of protein sequences. All of them contain common motif (active site or binding site for other molecules or transport signal, etc). I want to extract the general consensus sequence of this motif.
Solution: Identify pattern in family of sequences
You could try the following specialized pages:
[Blockmaker] BLOCKS server. Uses the same algorithms than are used for preparing BLOCK database entries.
[eMOTIF] in Stanford. Needs aligned sequences for input.
[PRATT] in EBI. Pattern identification from non-aligned sequences.

SEARCH DNA PATTERN AGAINST DATABASE.

Question: I am working with a transcription factor. I have identified (refined) it's binding site. I want to know what other genes could contain this binding site.
Solution: Search pattern against DNA database: Currently the only program (that I know) to do that is PATSCAN.
Patscan accepts both patterns and matrixes

ANALYSIS OF CODON USAGE. CORRESPONDENCE ANALYSIS OF GENES.

Question: I have many genes from one species. I want to know which codons are preferred and how are genes distributed by their codon usage.
Solution: Do codon usage analysis and/or correspondence analysis on your genes. Codon usage
Correspondence analysis is a statistical method that tells you about distribution and similarity of your genes based on codon usage. Lyon server is able to calculate correspondence analysis on many genes, but it is complicated to use. When you need to do a lot of codon usage analysis, I suggest to download and install the program codonW.

COMPLEX ANALYSIS OF LONG SEQUENCES. AUTOMATIC IDENTIFICATION OF GENES AND PROMOTERS

Question: I have an eukaryotic gene sequence. I want to know whether there are any exons, promoters, splicing sites, transcription factor binding sites, polyadenylation sites or other eukaryotic gene features in it.
Solution: Victor Solovyov's collection of programs
in EBI, England
in BMC, Texas, US
Both collections include famous eukaryotic exon-finding programs like Grail II, FGENEH, Genie, Hexon as well as programs for predicting promoters, splicing sites and transcription factor binding sites. Most of those programs try to find potential splice junctions, open reading frames and promoters in complex. They are based on advanced learning-recognition methods like Hidden Markov Models and Neural Networks use sample datasets for training. Therefore, they are good only for the species DNA they were trained on (mostly for human and mammalian DNA)

Question: I have a prokaryotic gene sequence. I want to know where are the ORFs, promoters, ribosome binding sites or other prokaryotic gene features in it.
Solution: GeneMark by G.Borodovsky
in EBI, England
in GIT, Georgia, US
Genemark is a learning program - it needs to know at least 10 kb of the sequence before making correct predictions. Fortunately, data for most common prokaryotes and lower eukaryotes has been already included in program. EBI version has also data for human and A.thaliana genes, based on their GC content.

For short overview about sequence comparisions and alignments read additional tutorial.

Assignments for week 3:

Collection of programs in Pasteur Institute, France
Collection of programs in CMS, Italy
Recent paper in Nature Genetics, with hyperlinks.

1. Download complete genes for elongation factor Tu (tuf) and elongation factor G (fus) from the following organisms:

Rickettsia prowazekii
Chlamydia trachomatis
Mycoplasma pneumonia
Bacillus subtilis
Treponema pallidum
Escherichia coli
Mycobacterium leprae
Mycobacterium tuberculosis

There may be more than one gene from each organism.
Hint: After retrieving the gene from DNA databank it may contain additional sequences or other genes. Click on link CDS beside your correct gene to retrieve only coding part of the gene.
Use the following form to calculate basic characteristics for those genes: total GC content, GC content at each codon position and observed Nc (number of used codons) value. There is two Nc values on form output, the first is expected Nc (theoretically calculated from GC3 content), the second is observed Nc. Make two plots from the data:
Plot A. GC1, GC2 and GC3 plotted against total GC of the gene.
Plot B. Observed Nc plotted against GC3s. See how much observed codon usage differs from the theoretical codon usage.
Send or show the plots. On plot A, draw 3 lines through all GC1, GC2 and GC3 points. Do they have different slope? Why? On plot B, see how much observed codon usage differs from the theoretical codon usage. What could cause the difference from theoretical value? Which genome has most significant difference between theoretical and observed codon usage?

2. Analyze the frequency of codons (codon usage) in tuf genes using links mentioned above. Which codons are preferred in your genes? Are the preferred codons same in each organism?
Hint: Here is the codon table to help you find which codons are coding for the same amino acid. You do not have to compare all codon families, take 2 or 3 (Proline and Threonine are good examples for analyzing codon usage)

Phe     UUU     Ser     UCU     Tyr     UAU     Cys     UGU
        UUC             UCC             UAC             UGC
Leu     UUA             UCA     TER     UAA     TER     UGA
        UUG             UCG             UAG     Trp     UGG

        CUU     Pro     CCU     His     CAU     Arg     CGU
        CUC             CCC             CAC             CGC
        CUA             CCA     Gln     CAA             CGA
        CUG             CCG             CAG             CGG

Ile     AUU     Thr     ACU     Asn     AAU     Ser     AGU
        AUC             ACC             AAC             AGC
        AUA             ACA     Lys     AAA     Arg     AGA
Met     AUG             ACG             AAG             AGG

Val     GUU     Ala     GCU     Asp     GAU     Gly     GGU
        GUC             GCC             GAC             GGC
        GUA             GCA     Glu     GAA             GGA
        GUG             GCG             GAG             GGG

Hint: If you are using Alces server for codon usage analysis remember that Alces server has three different modes:
A.Translate
B.Codon Table (what you probably need)
C. CAI values
To change the output you need to change the button at the end of form just above the submit button. Another trap is first button on top of form - this has to be set to "Raw" if you use Copy-Paste.
Anyway, for all codon usage programs you have to use fasta format or just plain sequence - numbers have to be removed. SRS5 has those options (save sequence in fasta format, but those might be difficult to find. They are now changing their forms, so their outlook is not entirely consistent. Anyway, try to use numerous buttons in SRS form to convert your output sequence to fasta format. Or try to use sequence converters in Singapore or in Texas.

3. Identify Open Reading Frames (ORFs) on your personal contig that was assigned for previous homework. Try several different ORF-finding programs. Choose 3 ORFs for further analysis at your own choice.
Hint: Frameplot might not be able to return you the gif-image if your sequence is bigger than 35 kb. In this case use only half of your contig for finding ORFs. Which program was most convenient for finding ORFs? Why?

4. Find patterns in protein sequences. First read some good documentation pages about PROSITE, BLOCKS and Pfam databases. Follow the links and try to understand what are patterns, what are matrixes and how are they selected for databases. Now find potential active sites and other patterns using PROSITE, BLOCKS and Pfam database. Take one sample protein and see what results can you get from those databases. After getting results, try to identify the sample protein by BLAST search. Compare sequence description in database with the data you got from your pattern search. Are those discovered patterns mentioned in database description?
Now try to identify patterns in your own sequences.
Translate some (at least 3) ORFs from task 3 to protein sequence. Alternatively use some proteins from your own scientific project. Which homologies do you find from each database? Did you find any useful information from those searches? Can you predict protein function based on this search?

5. Test the promoter identification programs: try to identify promoters in that piece of eukaryotic DNA.

6. Generate a result file with answers to each task. Answer to red questions. Send it to me by email or with the form below. Feel free to email or see me if you have any questions about interpreting your results, understanding the program input, output or algorithms etc.

A form for sending your results:

Maido Remm

Back to homepage

Patterns in sequences Searching for information within sequences.

A form for sending your results:

Maido Remm

Patterns in sequences
Searching for information within sequences.