Pairwise alignment Lesson 6 Based on presentation by Irit Gat-Viks, which is based on presentation by Amir Mitchel, Introduction to bioinformatics course,

Pairwise alignment

Lesson 6Based on presentation by Irit Gat-Viks,

which is based on presentation by Amir Mitchel,Introduction to bioinformatics course,

Bioinformatics unit, Tel Aviv University.and.. Benny shomer, Bar-Ilan university

DefinitionAlignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical characters in the sequences.

VLSPADKTNVKAAWAKVGAHAAGHG

||| | | |||| | ||||

VLSEAEWQLVLHVWAKVEADVAGHG

Sequence comparisons

Goal: similarity search on sequence database

Multiple pairwise comparisons

We wish to optimize for speed, not accuracy

BLAST, FASTA programs

Next goal: refine database search, are the reported

matches really interesting?

Goal: Comparing two specific sequences

Single pairwise comparisons

We wish to optimize for accuracy, not speed

Dynamic programming methods (Smith-Waterman,

Needleman-Wunsch)

Identify homologous, common domains, common active sites

etc.

How similar are two sequences?

• The common measure of sequence similarity is their alignment score

• Simpler measures, e.g., % identity are also common

• These require algorithm that compute the optimal alignment between sequences

Comparison methods

• Global alignment – Finds the best alignment across the whole two sequences.

• Local alignment – Finds regions of similarity in parts of the sequences.

Global Local

_____ _______ __ ____

__ ____ ____ __ ____

Pairwise Alignment - Scoring

• The final score of the alignment is the sum of the positive scores and penalty scores:

+ Number of Identities

+ Number if Similarities

- Number of gap insertions

- Number of Gap extensions

Alignment score

Intuition of Dynamic Programming

If we already have the optimal solution to:XYAB

then we know the next pair of characters will either be:

XYZ or XY- or XYZABC ABC AB-

(where “-” indicates a gap).

So we can extend the match by determining which of these has the highest score.

V(k,l) has the following properties:• Base conditions:

– V(i,0) = k=0..i(sk,-)

– V(0,j) = k=0..j(-,tk)

• Recurrence relation: V(i-1,j-1) + (si,tj)

1in, 1jm: V(i,j) = max V(i-1,j) + (si,-)

V(i,j-1) + (-,tj)

Alignment with 0 elements spacing

S’=s1...si-1 with T’=t1...tj-1

si with tj.

S’=s1...si with T’=t1...tj-1and ‘-’ with tj.

V(i,j) := optimal score of the alignment of S’=s1…si and T’=t1…tj (0 i n, 0 j m)

Optimal Alignment - Tabular Computation

• Add back pointer(s) from cell (i,j) to father cell(s) realizing V(i,j).

• Trace back the pointers from (m,n) to (0,0)

• Needleman-Wunsch, ‘70

Backtracking the alignment

PAM vs. BLUSOM• Choosing n

– Different BLOSUM matrices are derived from blocks with different identity percentage. (e.g., blosum62 is derived from an alignment of sequences that share at least 62% identity.) Larger n smaller evolutionary distance.

– Single PAM was constructed from at least 85% identity dataset. Different PAM matrices were computationally derived from it. Larger n larger evolutionary distance

• Blosum uses more sequences

Observed % Difference

Evolutionary distance (PAM)

BLOSUM

1 1 9910 11 9020 23 8030 38 7040 56 6050 80 5060 120 4070 159 3080 250 20

62

120

250

DNA scoring matrices

• Non-uniform substitutions in all nucleotides:

From

To

A G C T

A 2

G -4 2

C -6 -6 2

T -6 -6 -4 2

MatchMismatchtransition

Mismatchtransversion

Topics to be Covered

• Introduction• Comparison methods – Global, local alignment• Alignment parameters• Alignment scoring matrices – proteins• Alignment scoring matrices – DNA• Evaluation• Comparison programs• Choosing between Global / local alignment

Example: Global or local?

• Two human transcription factors:

1. SP1 factor, binds to GC rich areas.

2. EGR-1 factor, active at differentiation stage

(Fasta fromats from http://us.expasy.org/sprot/)

>sp|P08047|SP1_HUMAN Transcription factor Sp1 - Homo sapiens (Human). MSDQDHSMDEMTAVVKIEKGVGGNNGGNGNGGGAFSQARSSSTGSSSSTGGGGQESQPSP

LALLAATCSRIESPNENSNNSQGPSQSGGTGELDLTATQLSQGANGWQIISSSSGATPTS KEQSGSSTNGSNGSESSKNRTVSGGQYVVAAAPNLQNQQVLTGLPGVMPNIQYQVIPQFQ TVDGQQLQFAATGAQVQQDGSGQIQIIPGANQQIITNRGSGGNIIAAMPNLLQQAVPLQG LANNVLSGQTQYVTNVPVALNGNITLLPVNSVSAATLTPSSQAVTISSSGSQESGSQPVT SGTTISSASLVSSQASSSSFFTNANSYSTTTTTSNMGIMNFTTSGSSGTNSQGQTPQRVS GLQGSDALNIQQNQTSGGSLQAGQQKEGEQNQQTQQQQILIQPQLVQGGQALQALQAAPL SGQTFTTQAISQETLQNLQLQAVPNSGPIIIRTPTVGPNGQVSWQTLQLQNLQVQNPQAQ TITLAPMQGVSLGQTSSSNTTLTPIASAASIPAGTVTVNAAQLSSMPGLQTINLSALGTS GIQVHPIQGLPLAIANAPGDHGAQLGLHGAGGDGIHDDTAGGEEGENSPDAQPQAGRRTR REACTCPYCKDSEGRGSGDPGKKKQHICHIQGCGKVYGKTSHLRAHLRWHTGERPFMCTW SYCGKRFTRSDELQRHKRTHTGEKKFACPECPKRFMRSDHLSKHIKTHQNKKGGPGVALS VGTLPLDSGAGSEGSGTATPSALITTNMVAMEAICPEGIARLANSGINVMQVADLQSINI SGNGF

>sp|P18146|EGR1_HUMAN Early growth response protein 1 (EGR-1) (Krox-24 protein) (ZIF268) (Nerve growth factor-induced protein A) (NGFI-A) (Transcription factor ETR103) (Zinc finger protein 225) (AT225) - Homo sapiens (Human).

MAAAKAEMQLMSPLQISDPFGSFPHSPTMDNYPKLEEMMLLSNGAPQFLGAAGAPEGSGS NSSSSSSGGGGGGGGGSNSSSSSSTFNPQADTGEQPYEHLTAESFPDISLNNEKVLVETS YPSQTTRLPPITYTGRFSLEPAPNSGNTLWPEPLFSLVSGLVSMTNPPASSSSAPSPAAS SASASQSPPLSCAVPSNDSSPIYSAAPTFPTPNTDIFPEPQSQAFPGSAGTALQYPPPAY PAAKGGFQVPMIPDYLFPQQQGDLGLGTPDQKPFQGLESRTQQPSLTPLSTIKAFATQSG SQDLKALNTSYQSQLIKPSRMRKYPNRPSKTPPHERPYACPVESCDRRFSRSDELTRHIR IHTGQKPFQCRICMRNFSRSDHLTTHIRTHTGEKPFACDICGRKFARSDERKRHTKIHLR QKDKKADKSVVASSATSSLSSYPSPVATSYPSPVTTSYPSPATTSYPSPVPTSFSSPGSS TYPSPVHSGFPSPSVATTYSSVPPAFPAQVSSFPSSAVTNSFSASTGLSDMTATFSPRTI EIC

SP1 at swissprot

EGR1 at swissprot

Available softwares…

• http://en.wikipedia.org/wiki/Sequence_alignment_software

• http://fasta.bioch.virginia.edu/fasta_www/home.html– LAlign (local alignment), PLalign(dot plot)– PRSS/ PRFX (significance by Monte Carlo)

• http://bioportal.weizmann.ac.il/toolbox/overview.html (Many useful software), Needle, Water.

• Bl2seq (NCBI)

Using LAlign

• http://www.ch.embnet.org/software/LALIGN_form.html

• http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_006758.2

• http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_066300.1

Bl2Seq at NCBIhttp://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

Bl2seq results

Conclusions• The proteins share only a limited area of sequence

similarity. Therefore, the use of local alignment is recommended.

• We found a local alignment that pointed to a possible structural similarity, which points to a possible function similarity.

• Reasons to make Global alignment:• Checking minor differences between close homologous.• Analyzing polymorphism.• A good reason

Documents

Pairwise alignment Lesson 6 Based on presentation by Irit Gat-Viks, which is based on presentation by Amir Mitchel, Introduction to bioinformatics course,