Analysis of biological sequences.(Lesk chapter 4)
Sequence alignment • Sequence assembly• Classification• Prediction of function• Comparative genomics• Phylogeny / Evolutionary history
Pattern matching
Recognition of signals / statistical properties
/ character relationships
• Prediction of protein function• Identification of transcription regulatory sites
• Gene prediction• RNA and protein secondary structure prediction
Regulatory elements
PromoterTranslation start
Transcription stop
polyA signal
Transcription start
Translation stop
Exons
Introns
Expression from a eukaryotic gene
Transcription
Translation
DNA
RNA (primarytranscript)
RNA (spliced)
Protein
%G 11 74 100 0 29%A 64 9 0 0 61%U 13 12 0 100 7%C 11 6 0 0 2
Exon Intron
Two-dimensional weight matrices are used in Two-dimensional weight matrices are used in identification of splicing signalsidentification of splicing signals
Prediction of RNA secondary structure
GCCUCUUGGC
G
CC
U
C
G
C
G
UU
5’ 3’
5’ 3’
Problem of sequence alignment - interaction betweenmolecular biology / computer science / statistics
* What biological problems are addressed ?* Algorithm (dynamic programming)* ‘Simple’ implementation / source code / compiling
* Common implementations in molecular biology software packages* Statistics and probability theory of alignments
Biological aspects of sequence alignments
Why do we want to align 2 sequences?
As one example, consider this common application:
We have a ‘new’ sequence. It is similar to a previously known sequence?
Alignment to all previously known sequences. (Many of these have annotation such as a description of function )
similarity
?
no similarity
•Prediction of function •Phylogeny / evolutionary history
Basic concepts of protein sequence alignments
Proteins are homologous if they are related by divergence from a common ancestor.
Two kinds of homology:
Orthologs Proteins that carry out the same function in different species
Paralogs Proteins that perform different but related functions within one organism
X
X
X1
X
X2
Speciation
Ancestral organism
Organism A
Organism A
Organism B
Organism B
Orthologs
Orthologs
X
X
Xa
X
Xb
Gene duplication
Paralogs
Paralogs
Mouse trypsin -- orthologs -- Human trypsin | | paralogs paralogs | | Mouse chymotrypsin -- orthologs -- Human chymotrypsin
Ortholog / paralog relationships may be identified using local alignment algorithmssuch as Smith Waterman
But: databases are huge, current nucleotidedatabase = 100 billion nucleotides
M A K L Q G A L G K R Y
M *A * *K * *I
Q *G * *A * *L * *A * * K * *R *Y
M A K L Q G A L G K R Y
* * * * * * * * * *M A K I Q G A L A K R Y
Comparing 2 sequences - Dotplot analysis
Sequence alignment
Searching databases with FASTA / BLAST
Improvement of speed as compared to local alignment algorithm:
Initial search is for short words.Word hits are then extended in either direction.
First step in BLAST - obtaining a list of words based on the query sequence
Query sequence: FSGTWAMA ....
Words derived from query sequence:FSG, SGT, GTW, TWA ....etc
GTW (6+5+11=22) GSW (6+1+11=18) GNW (6+0+11=17) GAW (6+0+11=17) ATW (0+5+11=16) DTW (-1+5+11=15) GTF (6+5+1=12)
GTM (6+5-1=10) DAW (-1+0+11=10)
threshold
Output from Fasta
Fasta searches a protein or DNA sequence data bank version 3.3t04 January 25, 2000Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
../seq/ramp4.seq: 75 aa >ramp4.seq vs /vol1/gcgdata/ncbi_nr/nr.dat librarysearching /vol1/gcgdata/ncbi_nr/nr.dat library
173831120 residues in 553635 sequences statistics extrapolated from 60000 to 552908 sequences Expectation_n fit: rho(ln(x))= 4.8232+/-0.0004; mu= 0.7959+/- 0.022; mean_var=53.2306+/- 9.966, 0's: 686 Z-trim: 26 B-trim: 2227 in 1/63 Kolmogorov-Smirnov statistic: 0.0519 (N=29) at 46
FASTA (3.34 January 2000) function [optimized, BL50 matrix (15:-5)] ktup: 2 join: 36, opt: 24, gap-pen: -12/ -2, width: 16 Scan time: 102.010The best scores are: opt bits E(552908)gi|4585827|emb|CAB40910.1| (AJ238236) ribosome as ( 75) 483 130 1.9e-30gi|7657552|ref|NP_055260.1| stress-associated end ( 66) 426 116 3.7e-26gi|7504801|pir||T23009 hypothetical protein F59F4 ( 65) 251 71 8.5e-13gi|9802529|gb|AAF99731.1|AC004557_10 (AC004557) F ( 77) 145 45 0.00012gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|2 ( 136) 105 35 0.22gi|1800061|dbj|BAA16538.1| (D90891) similar to [S ( 217) 105 35 0.33gi|6319639|ref|NP_009721.1| involved in the secre ( 65) 92 31 1.2gi|2498674|sp|Q56109|NRDI_SALTY NRDI PROTEIN gi|1 ( 136) 93 32 1.8
How do we know from an alignment if two sequences are evolutionary related?
This seems convincing:
GWFTREKLREEDHIKKGWFTKEKIREEDHIKK
But what about this:
VAKTSRNAPEEKASVG IASGNRNFGEAYGRAG ?
We need some input from statistics / probability theory
For instance, alignment methods like BLAST will ask:What is the probability that this match occurs by chance only ?
The Expect value (E)
Parameter that describes the number of hits one can "expect" tosee just by chance when searching a database of a particularsize. Essentially, the E value describes the random backgroundnoise that exists for matches between sequences. For example,an E value of 1 assigned to a hit can be interpreted as meaningthat in a database of the current size one might expect to see 1match with a similar score simply by chance. This means thatthe lower the E-value, or the closer it is to "0" the more"significant" the match is.
>>gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associa (75 aa) initn: 483 init1: 483 opt: 483 Z-score: 682.4 bits: 130.3 E(): 1.9e-30Smith-Waterman score: 483; 100.000% identity in 75 aa overlap (1-75:1-75)
10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::gi|458 MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC 10 20 30 40 50 60
70ramp4. GSAIFQIIQSIRMGM :::::::::::::::gi|458 GSAIFQIIQSIRMGM 70
>>gi|7504801|pir||T23009 hypothetical protein F59F4.2 - (65 aa) initn: 227 init1: 143 opt: 251 Z-score: 365.3 bits: 71.4 E(): 8.5e-13Smith-Waterman score: 251; 53.846% identity in 65 aa overlap (10-74:1-64)
10 20 30 40 50 60ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVC :. :::. .::.. :::...::::::. . : :.: ..:::..::.::::gi|750 MAPKQRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVC 10 20 30 40 50
70 ramp4. GSAIFQIIQSIRMGM :::.:.::. ..:: gi|750 GSAVFEIIRYVKMGW 60
>>gi|2498673|sp|Q47415|NRDI_ECOLI NRDI PROTEIN gi|212121 (136 aa) initn: 66 init1: 41 opt: 105 Z-score: 160.3 bits: 34.6 E(): 0.22Smith-Waterman score: 105; 30.488% identity in 82 aa overlap (3-75:50-125)
10 20 30 ramp4. MVGAGGAAKMVAKQRIRMANEKHSKNITQRGN :.::.: : .: ::. :..:.. . :: gi|249 RLGLPAVRIPLNERERIQVDEPYILIVPSYGGGGTAGAVPRQVIRFLNDEHNRALL-RGV 20 30 40 50 60 70
40 50 60 70 ramp4. VAKTSRNAPEEKASVG---------PWLLALFIFVVCGSAIFQIIQSIRMGM .:. .:: : . .: ::: . : . :. . :...: :. gi|249 IASGNRNFGEAYGRAGDVIARKCGVPWL---YRFELMGTQ--SDIENVRKGVTEFWQRQP 80 90 100 110 120 130
Output from Blast
BLASTP 2.0.11 [Jan-20-2000]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq (75 letters)
Database: nr 457,798 sequences; 140,871,481 total letters
Searching..................................................done
Score ESequences producing significant alignments: (bits) Value
gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65 Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)
Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++MSbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63
Query: 74 G 74 GSbjct: 64 G 64
Query Database
blastp Protein Proteinblastn DNA DNAtblastn Protein DNAblastx DNA Proteintblastx DNA DNA
The different variants of BLAST
Basic BLAST command line
blastall -i input_sequence -d database -p blast_version
In a BLAST search low complexity regions in the query sequence arefiltered out by default
Regions with low-complexity sequence have an unusual composition andthis can create problems in sequence similarity searching. Low-complexity sequence can often be recognized by visual inspection. Forexample, the protein sequence PPCDPPPPPKDKKKKDDGPP has lowcomplexity and so does the nucleotide sequenceAAATAAAAAAAATAAAAAAT. Filters are used to remove low-complexitysequence because it can cause artifactual hits. In BLAST searchesperformed without a filter, often certain hits will be reported with highscores only because of the presence of a low-complexity region. Mostoften, this type of match cannot be thought of as the result of homologyshared by the sequences. Rather, it is as if the low-complexity region is"sticky" and is pulling out many sequences that are not truly related.
Another reason why hits to low-complexity regions in proteins should befiltered out is that such regions often have a disordered 3D structure andare not associated with well-defined biological functions.
BLAST and filtering of low-complexity sequence
Query:295 DDIFGELSSGKNAPKTGGGAKGNNASPAGSGNTKNNGASGADINNYAGQIKSAIESKFYD DDIFGELSSGKNAPKTGGGAKGNNASPAGSGNTKNNGASGADINNYAGQIKSAIESKFYD Sbjct:87 DDIFGELSSGKNAPKTGGGAKGNNASPAGSGNTKNNGASGADINNYAGQIKSAIESKFYD
Query:355 ASSYAGKTCTLRIKLAPDGMLLDIKPEGGDXXXXXXXXXXXXXXXXXXXXSQAVYEVFKN ASSYAGKTCTLRIKLAPDGMLLDIKPEGGD SQAVYEVFKNSbjct:147 ASSYAGKTCTLRIKLAPDGMLLDIKPEGGDPALCQAALAAAKLAKIPKPPSQAVYEVFKN
Query:415 APLDFKP 421 APLDFKPSbjct:207 APLDFKP 213
Introduction to practicals - biological sequences
M A K R K L K K N L K T F V A F S A I T F1
W Q R E S * K R T * K L L L H L V L L L F2 G K E K V K K E L K N F C C I * C Y Y C F3 1 ATGGCAAAGAGAAAGTTAAAAAAGAACTTAAAAACTTTTGTTGCATTTAGTGCTATTACT 60 ----:----|----:----|----:----|----:----|----:----|----:----|
1 TACCGTTTCTCTTTCAATTTTTTCTTGAATTTTTGAAAACAACGTAAATCACGATAATGA 60 X A F L F N F F F K F V K T A N L A I V F6 X P L S F T L F S S L F K Q Q M * H * * F5 H C L S L * F L V * F S K N C K T S N S F4
A L L L T N G I P I S A L T Q S S N T T F1 L Y C * L M V F Q L V L * L S L P I Q L F2 F I V N * W Y S N * C F N S V F Q Y N * F3 61 GCTTTATTGTTAACTAATGGTATTCCAATTAGTGCTTTAACTCAGTCTTCCAATACAACT 120
----:----|----:----|----:----|----:----|----:----|----:----| 61 CGAAATAACAATTGATTACCATAAGGTTAATCACGAAATTGAGTCAGAAGGTTATGTTGA 120 A K N N V L P I G I L A K V * D E L V V F6 Q K I T L * H Y E L * H K L E T K W Y L F5 S * Q * S I T N W N T S * S L R G I C S F4
E I T S Q A T T G L R N V M Y Y G D W S F1 R L L H K L L Q G Y V M * C I M V T G L F2 D Y F T S Y Y R V T * C N V L W * L V Y F3 121 GAGATTACTTCACAAGCTACTACAGGGTTACGTAATGTAATGTATTATGGTGACTGGTCT 180
----:----|----:----|----:----|----:----|----:----|----:----| 121 CTCTAATGAAGTGTTCGATGATGTCCCAATGCATTACATTACATAATACCACTGACCAGA 180 S I V E C A V V P N R L T I Y * P S Q D F6 Q S * K V L * * L T V Y H L T N H H S T F5 L N S * L S S C P * T I Y H I I T V P R F4
Translation of a nucleotide sequence using ‘sixpack’
Plotorf to show open reading frames
Ribosomal protein S16 1771-2019
Deviations from the standard genetic code
# Yeast mitochondria
UGA = Trp:W CUU = Thr:T CUC = Thr:T CUA = Thr:T CUG = Thr:T AUA = Met:M
# Mammalian mitochondria
UGA = Trp:W AUU = Ile:I AUC = Ile:I AUA = Met:M AGA = * :* AGG = * :*
# Drosophila mitochondria
UGA = Trp:W AUU = Ile:I AUA = Met:M AGA = Ser:S AGG = Ser:S
# mycoplasma
UGA = Trp
# Cilian protozoa
UAA = Gln:Q UAG = Gln:Q
EMBOSS
sixpackplotorf
water - Smith Waterman alignmentneedle - Needleman - Wunsch alignmentdottup - dotplot analysis
Introduction to practicals - biological sequences
Alignment of mRNA sequence to genomic DNA sequence with needle
effect of gap parameters
Dot plot analysis (dottup) reveals repeats
hprt_mouse 1 mptrspsvvisddepgydldlfcipnhyaedlekvfiphglimdrterl +++++++++++++++++++++++++++++++++++++++++++++++++ MPTRSPSVVISDDEPGYDLDLFCIPNHYAEDLEKVFIPHGLIMDRxERLgi|26145909|dbj 36 acacacaggaagggcgtgcgtttacactgggtgagtaccgcaagaagac tccggcgtttgaaacgaatattgtcaaacaataatttcagtttagNagt ggcctcccgtcttaattcatgttattttcgtgaagttttagtgcgtaat
hprt_mouse 50 ardvmkemgghhivalcvlkggykffadlldyikalnrnsdrsipmtvd ++++++++++++++ +++++++++++++ +++++++++++++++++++ ARDVMKEMGGHHIV!LCVLKGGYKFFAD!LDYIKALNRNSDRSIPMTVHgi|26145909|dbj 183 gcggaagaggccag4ctgcaggtattgg4cgtaagcaaaagatacaagc cgattaatggaatt tgttaggaattca taatactagagagctctcta tatcggggactctg ctgcggctgcttc gtctaagtatttacttgtat
hprt_mouse 99 firlksycndqstgdikviggddlstltgk +++++++++++++++++++++++++++++ SIRLKSYCNDQSTGDIKVIGGDDLSTLTGKgi|26145909|dbj 332 taacaattagctaggaagaggggctataga ctgtagagaaaccgatattggaatcctcga tcaggcctttgaggcaatttattcatatag
Alignment of protein sequence to DNA sequence ( genewise)
Introduction to practicals - biological sequences
Basic BLAST command line
blastall -i input_sequence -d database -p blast_version
Output from Blast
BLASTP 2.0.11 [Jan-20-2000]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database searchprograms", Nucleic Acids Res. 25:3389-3402.
Query= ramp4.seq (75 letters)
Database: nr 457,798 sequences; 140,871,481 total letters
Searching..................................................done
Score ESequences producing significant alignments: (bits) Value
gi|4585827|emb|CAB40910.1| (AJ238236) ribosome associated membr... 126 2e-29gi|3851666 (AF100470) ribosome attached membrane protein 4 [Rat... 126 2e-29gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder;... 74 1e-13gi|3935169 (AC004557) F17L21.12 [Arabidopsis thaliana] 46 3e-05gi|3935171 (AC004557) F17L21.14 [Arabidopsis thaliana] 36 0.048gi|5921764|sp|O13394|CHS5_USTMA CHITIN SYNTHASE 5 (CHITIN-UDP A... 29 3.6
>gi|3877972|emb|CAB03157.1| (Z81095) predicted using Genefinder; cDNA EST EMBL:D71338 comes from this gene; cDNA EST EMBL:D74010 comes from this gene; cDNA EST EMBL:D74852 comes from this gene; cDNA EST EMBL:C07354 comes from this gene; cDNA EST EMBL:C0... Length = 65 Score = 74.1 bits (179), Expect = 1e-13 Identities = 33/61 (54%), Positives = 48/61 (78%), Gaps = 1/61 (1%)
Query: 14 QRIRMANEKHSKNITQRGNVAKTSRNAPEEKASVGPWLLALFIFVVCGSAIFQIIQSIRM 73 QR+ +AN++ SKN+ RGNVAK+ + A E+K PWL+ LF+FVVCGSA+F+II+ ++MSbjct: 5 QRMTLANKQFSKNVNNRGNVAKSLKPA-EDKYPAAPWLIGLFVFVVCGSAVFEIIRYVKM 63
Query: 74 G 74 GSbjct: 64 G 64
Introduction to practicals - biological sequences
Substitution matricesEach amino acid change has a characteristic probability
Aligning two sequences using the BLAST algorithm:
bl2seq -i sequence_1 -j sequence_2 -p blastn
Introduction to practicals - biological sequences
BLAST and word size
blastall -i ........ -W7 (default is W11)
GTCAAGTGGCAACTCCGTCAG ********** ********** GTCAAGTGGCTACTCCGTCAG
Introduction to practicals - biological sequences
‘seg’ - NCBI utility to identify low-complexity regions
‘fastacmd’ retrieves sequences from BLAST-formatted databases:
fastacmd -s accession_number -d database
Introduction to practicals - biological sequences
Query= gi|28872819|ref|NP_057849.4| Gag-Pol [Human immunodeficiencyvirus 1] (1435 letters)
Database: All non-redundant GenBank CDStranslations+PDB+SwissProt+PIR+PRF excluding environmental samples 2,506,223 sequences; 849,940,114 total letters
Searching..................................................done
Score ESequences producing significant alignments: (bits) Value
ref|NP_057849.4| Gag-Pol [Human immunodeficiency virus 1] 2849 0.0gb|AAG28737.1| gag-pol fusion protein [synthetic construct] 2770 0.0gb|AAD03191.1| gag-pol fusion polyprotein [Human immunodeficienc... 2768 0.0dbj|BAB85751.1| Gag-pol fusion polyprotein [Human immunodeficien... 2759 0.0gb|AAD03200.1| gag-pol fusion polyprotein [Human immunodeficienc... 2745 0.0gb|AAG30116.1| gag-pol fusion polyprotein [Human immunodeficienc... 2741 0.0gb|AAD03217.1| gag-pol fusion polyprotein [Human immunodeficienc... 2727 0.0dbj|BAC77511.1| Gag-Pol fusion protein [Human immunodeficiency v... 2710 0.0gb|AAD03326.1| gag-pol fusion polyprotein [Human immunodeficienc... 2704 0.0dbj|BAC77477.1| Gag-Pol fusion polyprotein [Human immunodeficien... 2702 0.0dbj|BAC77486.1| Gag-Pol fusion polyprotein [Human immunodeficien... 2693 0.0gb|AAD03241.1| gag-pol fusion polyprotein [Human immunodeficienc... 2692 0.0gb|AAD03225.1| gag-pol fusion polyprotein [Human immunodeficienc... 2684 0.0gb|AAD03233.1| gag-pol fusion polyprotein [Human immunodeficienc... 2680 0.0gb|AAD03209.1| gag-pol fusion polyprotein [Human immunodeficienc... 2679 0.0gb|AAN73492.1| gag-pol fusion polyprotein [Human immunodeficienc... 2664 0.0gb|AAN73736.1| gag-pol fusion polyprotein [Human immunodeficienc... 2657 0.0emb| AD59561 1| gag-pol fusion protein [Human immunodefi ien y v 2653 0 0
Introduction to practicals - biological sequences
FASTA:
30 40 50 60 70 80AF1862 GAUAGUCCAGGACUAUUGGAUUUAAUUCCAAAUGCUCCUGAGAGCUCCAUAGAGCGGAA- :::::::::::::::: : : : ::::::AF1862 GUGCGUCUUUCGGGGCGCGCGGGGCGAAAGAAUGCUCCUGAGAGCUUCCU-GGGCGGAAA 20 30 40 50 60 70 90 100 110 120 130 140AF1862 -----GCUCUGGACGAAGCCAUCAGAAAAAUCGCUUACUUGUGAAGUGAUGGGCCACUCU : : :: :: :::::::::::AF1862 UAUUUCCGCCGGGCGUCGCCAUCAGAAAUUCAGCAGGCUAUGCUUGCAUGGGAGGCGGCG 80 90 100 110 120 130
BLAST:
Query: 60 aatgctcctgagagct 75 ||||||||||||||||Sbjct: 50 aatgctcctgagagct 65 Score = 22.3 bits (11), Expect = 1.0 Identities = 11/11 (100%) Strand = Plus / Plus Query: 101 gccatcagaaa 111 |||||||||||Sbjct: 96 gccatcagaaa 106