Upload
julia-jordan
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Pairwise alignment
Lesson 6Based on presentation by Irit Gat-Viks,
which is based on presentation by Amir Mitchel,Introduction to bioinformatics course,
Bioinformatics unit, Tel Aviv University.and.. Benny shomer, Bar-Ilan university
DefinitionAlignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical characters in the sequences.
VLSPADKTNVKAAWAKVGAHAAGHG
||| | | |||| | ||||
VLSEAEWQLVLHVWAKVEADVAGHG
Sequence comparisons
Goal: similarity search on sequence database
Multiple pairwise comparisons
We wish to optimize for speed, not accuracy
BLAST, FASTA programs
Next goal: refine database search, are the reported
matches really interesting?
Goal: Comparing two specific sequences
Single pairwise comparisons
We wish to optimize for accuracy, not speed
Dynamic programming methods (Smith-Waterman,
Needleman-Wunsch)
Identify homologous, common domains, common active sites
etc.
How similar are two sequences?
• The common measure of sequence similarity is their alignment score
• Simpler measures, e.g., % identity are also common
• These require algorithm that compute the optimal alignment between sequences
Comparison methods
• Global alignment – Finds the best alignment across the whole two sequences.
• Local alignment – Finds regions of similarity in parts of the sequences.
Global Local
_____ _______ __ ____
__ ____ ____ __ ____
Pairwise Alignment - Scoring
• The final score of the alignment is the sum of the positive scores and penalty scores:
+ Number of Identities
+ Number if Similarities
- Number of gap insertions
- Number of Gap extensions
Alignment score
Intuition of Dynamic Programming
If we already have the optimal solution to:XYAB
then we know the next pair of characters will either be:
XYZ or XY- or XYZABC ABC AB-
(where “-” indicates a gap).
So we can extend the match by determining which of these has the highest score.
V(k,l) has the following properties:• Base conditions:
– V(i,0) = k=0..i(sk,-)
– V(0,j) = k=0..j(-,tk)
• Recurrence relation: V(i-1,j-1) + (si,tj)
1in, 1jm: V(i,j) = max V(i-1,j) + (si,-)
V(i,j-1) + (-,tj)
Alignment with 0 elements spacing
S’=s1...si-1 with T’=t1...tj-1
si with tj.
S’=s1...si with T’=t1...tj-1and ‘-’ with tj.
V(i,j) := optimal score of the alignment of S’=s1…si and T’=t1…tj (0 i n, 0 j m)
Optimal Alignment - Tabular Computation
• Add back pointer(s) from cell (i,j) to father cell(s) realizing V(i,j).
• Trace back the pointers from (m,n) to (0,0)
• Needleman-Wunsch, ‘70
Backtracking the alignment
PAM vs. BLUSOM• Choosing n
– Different BLOSUM matrices are derived from blocks with different identity percentage. (e.g., blosum62 is derived from an alignment of sequences that share at least 62% identity.) Larger n smaller evolutionary distance.
– Single PAM was constructed from at least 85% identity dataset. Different PAM matrices were computationally derived from it. Larger n larger evolutionary distance
• Blosum uses more sequences
Observed % Difference
Evolutionary distance (PAM)
BLOSUM
1 1 9910 11 9020 23 8030 38 7040 56 6050 80 5060 120 4070 159 3080 250 20
62
120
250
DNA scoring matrices
• Non-uniform substitutions in all nucleotides:
From
To
A G C T
A 2
G -4 2
C -6 -6 2
T -6 -6 -4 2
MatchMismatchtransition
Mismatchtransversion
Topics to be Covered
• Introduction• Comparison methods – Global, local alignment• Alignment parameters• Alignment scoring matrices – proteins• Alignment scoring matrices – DNA• Evaluation• Comparison programs• Choosing between Global / local alignment
Example: Global or local?
• Two human transcription factors:
1. SP1 factor, binds to GC rich areas.
2. EGR-1 factor, active at differentiation stage
(Fasta fromats from http://us.expasy.org/sprot/)
>sp|P08047|SP1_HUMAN Transcription factor Sp1 - Homo sapiens (Human). MSDQDHSMDEMTAVVKIEKGVGGNNGGNGNGGGAFSQARSSSTGSSSSTGGGGQESQPSP
LALLAATCSRIESPNENSNNSQGPSQSGGTGELDLTATQLSQGANGWQIISSSSGATPTS KEQSGSSTNGSNGSESSKNRTVSGGQYVVAAAPNLQNQQVLTGLPGVMPNIQYQVIPQFQ TVDGQQLQFAATGAQVQQDGSGQIQIIPGANQQIITNRGSGGNIIAAMPNLLQQAVPLQG LANNVLSGQTQYVTNVPVALNGNITLLPVNSVSAATLTPSSQAVTISSSGSQESGSQPVT SGTTISSASLVSSQASSSSFFTNANSYSTTTTTSNMGIMNFTTSGSSGTNSQGQTPQRVS GLQGSDALNIQQNQTSGGSLQAGQQKEGEQNQQTQQQQILIQPQLVQGGQALQALQAAPL SGQTFTTQAISQETLQNLQLQAVPNSGPIIIRTPTVGPNGQVSWQTLQLQNLQVQNPQAQ TITLAPMQGVSLGQTSSSNTTLTPIASAASIPAGTVTVNAAQLSSMPGLQTINLSALGTS GIQVHPIQGLPLAIANAPGDHGAQLGLHGAGGDGIHDDTAGGEEGENSPDAQPQAGRRTR REACTCPYCKDSEGRGSGDPGKKKQHICHIQGCGKVYGKTSHLRAHLRWHTGERPFMCTW SYCGKRFTRSDELQRHKRTHTGEKKFACPECPKRFMRSDHLSKHIKTHQNKKGGPGVALS VGTLPLDSGAGSEGSGTATPSALITTNMVAMEAICPEGIARLANSGINVMQVADLQSINI SGNGF
>sp|P18146|EGR1_HUMAN Early growth response protein 1 (EGR-1) (Krox-24 protein) (ZIF268) (Nerve growth factor-induced protein A) (NGFI-A) (Transcription factor ETR103) (Zinc finger protein 225) (AT225) - Homo sapiens (Human).
MAAAKAEMQLMSPLQISDPFGSFPHSPTMDNYPKLEEMMLLSNGAPQFLGAAGAPEGSGS NSSSSSSGGGGGGGGGSNSSSSSSTFNPQADTGEQPYEHLTAESFPDISLNNEKVLVETS YPSQTTRLPPITYTGRFSLEPAPNSGNTLWPEPLFSLVSGLVSMTNPPASSSSAPSPAAS SASASQSPPLSCAVPSNDSSPIYSAAPTFPTPNTDIFPEPQSQAFPGSAGTALQYPPPAY PAAKGGFQVPMIPDYLFPQQQGDLGLGTPDQKPFQGLESRTQQPSLTPLSTIKAFATQSG SQDLKALNTSYQSQLIKPSRMRKYPNRPSKTPPHERPYACPVESCDRRFSRSDELTRHIR IHTGQKPFQCRICMRNFSRSDHLTTHIRTHTGEKPFACDICGRKFARSDERKRHTKIHLR QKDKKADKSVVASSATSSLSSYPSPVATSYPSPVTTSYPSPATTSYPSPVPTSFSSPGSS TYPSPVHSGFPSPSVATTYSSVPPAFPAQVSSFPSSAVTNSFSASTGLSDMTATFSPRTI EIC
SP1 at swissprot
EGR1 at swissprot
Available softwares…
• http://en.wikipedia.org/wiki/Sequence_alignment_software
• http://fasta.bioch.virginia.edu/fasta_www/home.html– LAlign (local alignment), PLalign(dot plot)– PRSS/ PRFX (significance by Monte Carlo)
• http://bioportal.weizmann.ac.il/toolbox/overview.html (Many useful software), Needle, Water.
• Bl2seq (NCBI)
Using LAlign
• http://www.ch.embnet.org/software/LALIGN_form.html
• http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_006758.2
• http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_066300.1
Bl2Seq at NCBIhttp://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi
Bl2seq results
Conclusions• The proteins share only a limited area of sequence
similarity. Therefore, the use of local alignment is recommended.
• We found a local alignment that pointed to a possible structural similarity, which points to a possible function similarity.
• Reasons to make Global alignment:• Checking minor differences between close homologous.• Analyzing polymorphism.• A good reason