93
Sequence Alignment and Phylogenetic Analysis

Sequence Alignment and Phylogenetic Analysis

  • Upload
    lajos

  • View
    75

  • Download
    0

Embed Size (px)

DESCRIPTION

Sequence Alignment and Phylogenetic Analysis. Evolution. Sequence Alignment. AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC. - AG G CTATCAC CT GACC T C CA GG C CGA -- TGCCC --- T AG - CTATCAC -- GACC G C -- GG T CGA TT TGCCC GAC. Definition - PowerPoint PPT Presentation

Citation preview

Page 1: Sequence Alignment and Phylogenetic Analysis

Sequence Alignment and Phylogenetic Analysis

Page 2: Sequence Alignment and Phylogenetic Analysis

Evolution

Page 3: Sequence Alignment and Phylogenetic Analysis

Sequence Alignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,

an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each

letter in one sequence with either a letter, or a gapin the other sequence

AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC

Page 4: Sequence Alignment and Phylogenetic Analysis

What is a good alignment?AGGCTAGTT, AGCGAAGTTT

AGGCTAGTT- 6 matches, 3 mismatches, 1 gapAGCGAAGTTT

AGGCTA-GTT- 7 matches, 1 mismatch, 3 gapsAG-CGAAGTTT

AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gapsAG-CG-AAGTTT

Page 5: Sequence Alignment and Phylogenetic Analysis

Scoring Function• Sequence edits:

AGGCCTC

• Mutations AGGACTC

• Insertions AGGGCCTC

• Deletions AGG . CTC

Scoring Function:Match: +mMismatch: -sGap: -d

Score F = (# matches) m - (# mismatches) s – (#gaps) d

Page 6: Sequence Alignment and Phylogenetic Analysis

G -

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

F(i,j) i = 0 1 2 3 4

Examplex = AGTA m = 1y = ATA s = -1

d = -1

j = 0

1

2

3

F(1, 1) = max{F(0,0) + s(A, A), F(0, 1) – d, F(1, 0) – d} =

max{0 + 1, -1 – 1, -1 – 1} = 1

AA

TT

AA

Page 7: Sequence Alignment and Phylogenetic Analysis

The Needleman-Wunsch Matrixx1 ……………………………… xMy

1 ……

……

……

……

……

……

yN

Every nondecreasing path

from (0,0) to (M, N)

corresponds to an alignment of the two sequences

An optimal alignment is composed of optimal subalignments

Page 8: Sequence Alignment and Phylogenetic Analysis

8

Example

H E A G A W G H E E

0 -8 -16

-24

-32

-40

-48

-56

-64

-72

-80

P -8 -2 -9 -17

-25

-33

-42

-49

-57

-65

-73

A -16

W -24

H -32

E -40

A -48

E -56

A E G H W

A 5 -1 0 -2 -3

E -1 6 -3 0 -3

H -2 0 -2 10 -3

P -1 -1 -2 -2 -4

W -3 -3 -3 -3 15

Page 9: Sequence Alignment and Phylogenetic Analysis

9

H E A G A W G H E E

0 -8 -16

-24

-32

-40

-48

-56

-64

-72

-80

P -8 -2 -9 -17

-25

-33

-42

-49

-57

-65

-73

A -16

-10

-3 -4 -12

-20

-28

-36

-44

-52

-60

W -24

-18

-11

-6 -7 -15

-5 -13

-21

-29

-37

H -32

-14

-18

-13

-8 -9 -13

-7 -3 -11

-19

E -40

-22

-8 -16

-16

-9 -12

-15

-7 3 -5

A -48

-30

-16

-3 -11

-11

-12

-12

-15

-5 2

E -56

-38

-24

-11

-6 -12

-14

-15

-12

-9 1

A E G H W

A 5 -1 0 -2 -3

E -1 6 -3 0 -3

H -2 0 -2 10 -3

P -1 -1 -2 -2 -4

W -3 -3 -3 -3 15

Page 10: Sequence Alignment and Phylogenetic Analysis

PAMX

• PAMx = PAM1x

• PAM250 = PAM1250

• PAM250 is a widely used scoring matrix:

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ...Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ...Arg R 3 17 4 3 2 5 3 2 6 3 2 9Asn N 4 4 6 7 2 5 6 4 6 3 2 5Asp D 5 4 8 11 1 7 10 5 6 3 2 5Cys C 2 1 1 1 52 1 1 2 2 2 1 1Gln Q 3 5 5 6 1 10 7 3 7 2 3 5...Trp W 0 2 0 0 0 0 0 0 1 0 1 0Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1Val V 7 4 4 4 4 4 4 4 5 4 15 10

Page 11: Sequence Alignment and Phylogenetic Analysis

The Blosum50 Scoring Matrix

Page 12: Sequence Alignment and Phylogenetic Analysis

Affine Gap Penalties

• In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:

Normal scoring would give the same score for both alignments

This is more likely.

This is less likely.

Page 13: Sequence Alignment and Phylogenetic Analysis

Affine gaps(n) = d + (n – 1)e

| | gap gap open extend

To compute optimal alignment,

F(i, j): score of alignment x1…xi to y1…yjifif xi aligns to yj

G(i, j): score ifif xi aligns to a gap after yjH(i, j): score ifif yj aligns to a gap after xi

V(i, j) = best score of alignment x1…xi to y1…yj

de

(n)

Page 14: Sequence Alignment and Phylogenetic Analysis

Needleman-Wunsch with affine gaps

Initialization:V(i, 0) = d + (i – 1)eV(0, j) = d + (j – 1)e

Iteration:V(i, j) = max{ F(i, j), G(i, j), H(i, j) }

F(i, j) = V(i – 1, j – 1) + s(xi, yj)

V(i, j – 1) – d G(i, j) = max

G(i, j – 1) – e

V(i – 1, j) – d H(i, j) = max

H(i – 1, j) – e

Termination: similar

Page 15: Sequence Alignment and Phylogenetic Analysis

Pairwise Alignment Tools

Page 16: Sequence Alignment and Phylogenetic Analysis

Some Typical Dot-plot Comparisons

• Divergent sequences where only a segment is homologous

• Long insertions and deletions• Tandem repeats

• The square shape of the pattern is characteristic of these repeats

Page 17: Sequence Alignment and Phylogenetic Analysis

Using Dotlet• Dotlet is one of the handiest tools for making

dot plots• Dotlet is a Java applet• Open and download the applet at the following

site: http://myhits.isb-sib.ch/cgi-bin/dotlet• Use Firefox or IE

Page 18: Sequence Alignment and Phylogenetic Analysis

Window size

Dot plot window

Alignment window

Threshold window for fine tuning

Page 19: Sequence Alignment and Phylogenetic Analysis

Window size

Dot plot window

Alignment window

Threshold window for fine tuning

Page 20: Sequence Alignment and Phylogenetic Analysis

Window size

Dot plot window

Alignment window

Threshold window for fine tuning

Page 21: Sequence Alignment and Phylogenetic Analysis

Looking at Repeated Domains with Dotlet

• The square shape is typical of tandem repeats

• The repeats are not perfect because the sequences have diverged after their duplication

Page 22: Sequence Alignment and Phylogenetic Analysis

Comparing a Gene and Its Product

• Eukaryotic genes are transcribed into RNA

• The RNA is then spliced to remove the introns’ sequences

• It may be necessary to compare the gene and its product

• Dotlet makes this comparative analysis easy

Page 23: Sequence Alignment and Phylogenetic Analysis

Lalign and BLAST• Lalign is like a very precise BLAST

• It works on only two sequences at a time

• You must provide both sequences

Page 24: Sequence Alignment and Phylogenetic Analysis

LaLignhttp://www.ch.embnet.org/software/LALIGN_form.html

Page 25: Sequence Alignment and Phylogenetic Analysis
Page 26: Sequence Alignment and Phylogenetic Analysis

Lalign Output• Lalign produces an output

similar to the alignment section of BLAST

• The E-value indicates the significance of each alignment

• Low E-value good alignment

Page 27: Sequence Alignment and Phylogenetic Analysis

Multiple Alignment

Page 28: Sequence Alignment and Phylogenetic Analysis

Example

Page 29: Sequence Alignment and Phylogenetic Analysis

4 Ways of Using MSAs . . .

Page 30: Sequence Alignment and Phylogenetic Analysis

4 More Ways of Using MSAs

Page 31: Sequence Alignment and Phylogenetic Analysis

Generalizing the Notion of Pairwise Alignment

• Alignment of 2 sequences is represented as a 2-row matrix• In a similar way, we represent alignment of 3 sequences

as a 3-row matrix

A T _ G C G _ A _ C G T _ A A T C A C _ A

• Score: more conserved columns, better alignment

Page 32: Sequence Alignment and Phylogenetic Analysis

Aligning Three Sequences• Same strategy as aligning

two sequences• Use a 3-D “”, with each

axis representing a sequence to align

• For global alignments, go from source to sink

source

sink

Page 33: Sequence Alignment and Phylogenetic Analysis

Architecture of 3-D Alignment Cell(i-1,j-1,k-1)

(i,j-1,k-1)

(i,j-1,k)

(i-1,j-1,k) (i-1,j,k)

(i,j,k)

(i-1,j,k-1)

(i,j,k-1)

Page 34: Sequence Alignment and Phylogenetic Analysis

Multiple Alignment: Dynamic Programming

• si,j,k = max

(x, y, z) is an entry in the 3-D scoring matrix

si-1,j-1,k-1 + (vi, wj, uk)

si-1,j-1,k + (vi, wj, _ )

si-1,j,k-1 + (vi, _, uk)

si,j-1,k-1 + (_, wj, uk)

si-1,j,k + (vi, _ , _)

si,j-1,k + (_, wj, _)

si,j,k-1 + (_, _, uk)

cube diagonal: no indels

face diagonal: one indel

edge diagonal: two indels

Page 35: Sequence Alignment and Phylogenetic Analysis

Multiple Alignment: Running Time

• For 3 sequences of length n, the run time is 7n3; O(n3)

• For k sequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk)

• Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

Page 36: Sequence Alignment and Phylogenetic Analysis

Sum of Pairs Score(SP-Score)

• Consider pairwise alignment of sequences ai and aj

imposed by a multiple alignment of k sequences • Denote the score of this suboptimal (not necessarily

optimal) pairwise alignment as s*(ai, aj)• Sum up the pairwise scores for a multiple alignment:

s(a1,…,ak) = Σi,j s*(ai, aj)

Page 37: Sequence Alignment and Phylogenetic Analysis

SP-Score: Examplea1

.ak

ATG-C-AATA-G-CATATATCCCATTT

ji

jik aaSaaS,

*1 ),()...(

2

nPairs of Sequences

A

A A11

1

G

C G1

Score=3 Score = 1 –

Column 1 Column 3

s s*(

To calculate each column:

Page 38: Sequence Alignment and Phylogenetic Analysis

Multiple Alignment Induces Pairwise Alignments

Every multiple alignment induces pairwise alignments

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 39: Sequence Alignment and Phylogenetic Analysis

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments

Given 3 arbitrary pairwise alignments:

x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAGy: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG

can we construct a multiple alignment that inducesthem?

Page 40: Sequence Alignment and Phylogenetic Analysis

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments

Given 3 arbitrary pairwise alignments:

x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAGy: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG

can we construct a multiple alignment that inducesthem? NOT ALWAYS

Pairwise alignments may be inconsistent

Page 41: Sequence Alignment and Phylogenetic Analysis

Profile Representation of Multiple Alignment

- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G

A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4

Page 42: Sequence Alignment and Phylogenetic Analysis

Multiple Alignment: Greedy Approach

• Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat

• This is a heuristic greedy method

u1= ACGTACGTACGT…

u2 = TTAATTAATTAA…

u3 = ACTACTACTACT…

uk = CCGGCCGGCCGG

u1= ACg/tTACg/tTACg/cT…

u2 = TTAATTAATTAA…

uk = CCGGCCGGCCGG…

kk-1

Page 43: Sequence Alignment and Phylogenetic Analysis

Greedy Approach: Example

• Consider these 4 sequencess1 GATTCAs2 GTCTGAs3 GATATTs4 GTCAGC

Page 44: Sequence Alignment and Phylogenetic Analysis

Greedy Approach: Example (cont’d)

• There are = 6 possible alignments

2

4

s2 GTCTGAs4 GTCAGC (score = 2)

s1 GAT-TCAs2 G-TCTGA (score = 1)

s1 GAT-TCAs3 GATAT-T (score = 1)

s1 GATTCA--s4 G—T-CAGC(score = 0)

s2 G-TCTGAs3 GATAT-T (score = -1)

s3 GAT-ATTs4 G-TCAGC (score = -1)

Page 45: Sequence Alignment and Phylogenetic Analysis

Greedy Approach: Example (cont’d)

s2 and s4 are closest; combine:

s2 GTCTGAs4 GTCAGC

s2,4 GTCt/aGa/cA (profile)

s1 GATTCAs3 GATATTs2,4 GTCt/aGa/c

new set of 3 sequences:

Page 46: Sequence Alignment and Phylogenetic Analysis

Progressive Alignment

• Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments.

• Progressive alignment works well for close sequences, but deteriorates for distant sequences• Gaps in consensus string are permanent• Use profiles to compare sequences

Page 47: Sequence Alignment and Phylogenetic Analysis

ClustalW

• Popular multiple alignment tool today• ‘W’ stands for ‘weighted’ (different parts

of alignment are weighted differently).• Three-step process

1.) Construct pairwise alignments2.) Build Guide Tree3.) Progressive Alignment guided by the tree

Page 48: Sequence Alignment and Phylogenetic Analysis

Step 1: Pairwise Alignment

Page 49: Sequence Alignment and Phylogenetic Analysis

Step 2: Guide Tree

• Create Guide Tree using the similarity matrix

• ClustalW uses the neighbor-joining method

• Guide tree roughly reflects evolutionary relations

Page 50: Sequence Alignment and Phylogenetic Analysis
Page 51: Sequence Alignment and Phylogenetic Analysis

Step 3: Progressive Alignment• Start by aligning the two most similar

sequences• Following the guide tree, add in the next

sequences, aligning to the existing alignment• Insert gaps as necessary

Page 52: Sequence Alignment and Phylogenetic Analysis
Page 53: Sequence Alignment and Phylogenetic Analysis

Multiple Alignment: History

1975 SankoffFormulated multiple alignment problem and gave dynamic programming solution

1988 Carrillo-LipmanBranch and Bound approach for MSA

1990 Feng-DoolittleProgressive alignment

1994 Thompson-Higgins-Gibson-ClustalWMost popular multiple alignment program

1998 Morgenstern et al.-DIALIGNSegment-based multiple alignment

2000 Notredame-Higgins-Heringa-T-coffeeUsing the library of pairwise alignments

2004 MUSCLE

Page 54: Sequence Alignment and Phylogenetic Analysis

Practice of MSA

Page 55: Sequence Alignment and Phylogenetic Analysis

Choosing the Right Sequences• When building an alignment, it is your job to select the sequences• Two main factors when selecting sequences:

• Number of sequences• Nature of the sequences

• A reasonable number of sequences: 20 to 50• Ideal for most methods• Small alignments are easy to display and analyze

• Types of sequences• Well-selected sequences informative alignment

Page 56: Sequence Alignment and Phylogenetic Analysis

Some Guidelines for Choosing the Right Sequences

Page 57: Sequence Alignment and Phylogenetic Analysis

DNA or Proteins?• DNA sequences are harder to align than proteins

• DNA-comparison models are less sophisticated

• Most methods work for both DNA and proteins • The results are less useful for DNA

• If your DNA is coding, work on the translated proteins

• If sequences are homologous . . .• Along their entire length use progressive alignment methods (next slide)• In terms of local similarity use motif-discovery methods (end of chapter)

Page 58: Sequence Alignment and Phylogenetic Analysis

Choosing Sequences That Are Different Enough

• An alignment is useful if . . .• The sequences are correctly aligned • It can be used to produce trees, profiles, and structure predictions

• To obtain this result, the sequences must be• Not too similar • Not too different

• Sequences that are very similar . . .• Are easy to align correctly• Are not informative useless trees and profiles, bad predictions

• Sequences that are very different . . . • Are difficult to align • Are very informative good trees and profiles, good predictions

Page 59: Sequence Alignment and Phylogenetic Analysis

Steps

• Gathering right sequences• Compute MSA using servers/local programs• Evaluate the results visually• If it is hard to interpret

• Closer examination, remove trouble makers• Redo and trim if needed

Page 60: Sequence Alignment and Phylogenetic Analysis

Gathering Sequences with BLAST

• The most convenient way to select your sequences is to use a BLAST server

• Some BLAST servers are integrated with multiple-alignment methods:• www.expasy.ch (protein only)• srs.ebi.ac.uk (DNA/protein)• npsa-pbil.ibcp.fr

Page 61: Sequence Alignment and Phylogenetic Analysis

Gathering Sequences with BLAST

• Select some of the top sequences

• Evenly select some sequences down to the bottom

• The idea is to have many intermediate sequences

Page 62: Sequence Alignment and Phylogenetic Analysis

ExPASY• www.expasy.ch/tools/blast

Page 63: Sequence Alignment and Phylogenetic Analysis
Page 64: Sequence Alignment and Phylogenetic Analysis
Page 65: Sequence Alignment and Phylogenetic Analysis
Page 66: Sequence Alignment and Phylogenetic Analysis
Page 67: Sequence Alignment and Phylogenetic Analysis
Page 68: Sequence Alignment and Phylogenetic Analysis

>sp|P20472|PRVA_HUMAN Parvalbumin alpha OS=Homo sapiens GN=PVALB PE=1 SV=2 MSMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIE EDELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES >sp|P80079|PRVA_FELCA Parvalbumin alpha OS=Felis catus GN=PVALB PE=1 SV=2 MSMTDLLGAEDIKKAVEAFTAVDSFDYKKFFQMVGLKKKSPDDIKKVFHILDKDKSGFIE EDELGFILKGFYPDARDLSVKETKMLMAAGDKDGDGKIDVDEFFSLVAKS >sp|P02627|PRVA_RANES Parvalbumin alpha OS=Rana esculenta PE=1 SV=1 PMTDLLAAGDISKAVSAFAAPESFNHKKFFELCGLKSKSKEIMQKVFHVLDQDQSGFIEK EELCLILKGFTPEGRSLSDKETTALLAAGDKDGDGKIGVDEFVTLVSES >sp|P02626|PRVA_AMPME Parvalbumin alpha OS=Amphiuma means PE=1 SV=1 SMTDVIPEADINKAIHAFKAGEAFDFKKFVHLLGLNKRSPADVTKAFHILDKDRSGYIEE EELQLILKGFSKEGRELTDKETKDLLIKGDKDGDGKIGVDEFTSLVAES >sp|P02619|PRVB_ESOLU Parvalbumin beta OS=Esox lucius PE=1 SV=1 SFAGLKDADVAAALAACSAADSFKHKEFFAKVGLASKSLDDVKKAFYVIDQDKSGFIEED ELKLFLQNFSPSARALTDAETKAFLADGDKDGDGMIGVDEFAAMIKA >sp|P43305|PRVU_CHICK Parvalbumin, thymic CPV3 OS=Gallus gallus PE=1 SV=2 MSLTDILSPSDIAAALRDCQAPDSFSPKKFFQISGMSKKSSSQLKEIFRILDNDQSGFIE EDELKYFLQRFECGARVLTASETKTFLAAADHDGDGKIGAEEFQEMVQS >sp|Q91482|PRVB1_SALSA Parvalbumin beta 1 OS=Salmo salar PE=1 SV=1 MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ >sp|P02620|PRVB_MERME Parvalbumin beta OS=Merluccius merluccius PE=1 SV=1 AFAGILADADITAALAACKAEGSFKHGEFFTKIGLKGKSAADIKKVFGIIDQDKSDFVEE DELKLFLQNFSAGARALTDAETATFLKAGDSDGDGKIGVEEFAAMVKG >sp|P02622|PRVB_GADCA Parvalbumin beta OS=Gadus callarias PE=1 SV=1 AFKGILSNADIKAAEAACFKEGSFDEDGFYAKVGLDAFSADELKKLFKIADEDKEGFIEE DELKLFLIAFAADLRALTDAETKAFLKAGDSDGDGKIGVDEFGALVDKWGAKG

Page 69: Sequence Alignment and Phylogenetic Analysis

If Know Protein Sequences• www.expasy.ch/sprot/sprot-retrieve-

list.html

Page 70: Sequence Alignment and Phylogenetic Analysis

Aligning Your Sequences

• Aligning sequences correctly is very difficult• It’s hard to align protein sequences with less than 25%

identity (70% identity for DNA)

• All methods are approximate• Alignment methods use the progressive algorithm

• Compares the sequences two by two• Builds a guide tree• Aligns the sequences in the order indicated by the tree

Page 71: Sequence Alignment and Phylogenetic Analysis

Selecting a Method

• Many alternative methods exist for MSAs

• Most of them use the progressive algorithm• They all are approximate methods• None is guaranteed to deliver the best alignments

• All existing methods have pros and cons• ClustalW is the most popular (21,000 citations)• T-Coffee and ProbCons are more accurate but slower• MUSCLE is very fast, ideal for very large datasets

Page 72: Sequence Alignment and Phylogenetic Analysis

Selecting a Method (cont’d.)• It’s impossible to guess in advance which method will do

best.• Accuracy is merely an average estimation

• Methods are tested on reference datasets• Their accuracy is the average accuracy obtained on the reference

• The most accurate method can always be outperformed by a less accurate method on a given dataset.

• An alternative: Use consensus methods such as MCOFFEE

Page 73: Sequence Alignment and Phylogenetic Analysis

ClustalW

• www.ebi.ac.uk/clustalw• pir.georgetown.edu/pirwww/search/

multialn.shtml• www.ddbj.nig.ac.jp/search/clustalw-e.html

Page 74: Sequence Alignment and Phylogenetic Analysis
Page 75: Sequence Alignment and Phylogenetic Analysis
Page 76: Sequence Alignment and Phylogenetic Analysis
Page 77: Sequence Alignment and Phylogenetic Analysis
Page 78: Sequence Alignment and Phylogenetic Analysis

Tcoffee

• TCOFFEE: www.tcoffee.org• CORE: evaluate MSA• MCOFFEE: run many and combine• EXPRESSO: with structural information

Page 79: Sequence Alignment and Phylogenetic Analysis
Page 80: Sequence Alignment and Phylogenetic Analysis
Page 81: Sequence Alignment and Phylogenetic Analysis
Page 82: Sequence Alignment and Phylogenetic Analysis

Running Many Methods at Once• MCOFFEE is a a meta-method

• It runs all the individual MSA methods• It gathers all the produced MSAs• It combines the MSAs into a single MSA

• MCOFFEE is more accurate than any individual method

• Its color output lets you estimate the reliability of your MSA

• MCOFFEE is available on www.tcoffee.org

Page 83: Sequence Alignment and Phylogenetic Analysis

MCOFFEE Color Output• Red and orange residues are probably well

aligned• Yellow should be treated with caution• Green and blue are probably incorrectly aligned

Page 84: Sequence Alignment and Phylogenetic Analysis

MCOFFEE

Page 85: Sequence Alignment and Phylogenetic Analysis

TCOFFEE

Page 86: Sequence Alignment and Phylogenetic Analysis

TCOFFEE Results

Page 87: Sequence Alignment and Phylogenetic Analysis

Interpreting Your MSA• Don’t put blind trust in the output of the servers

• Specialists always edit their MSAs by hand

• You must always estimate the biological accuracy of your MSA• Use the color code of Tcoffee • Use the conservation patterns of ClustalW:

• ‘*’ Completely conserved position• ‘:’ Highly conserved position • ‘.’ Conserved position

• Use experimental knowledge of your proteins

Page 88: Sequence Alignment and Phylogenetic Analysis

Understanding Conserved Positions

Page 89: Sequence Alignment and Phylogenetic Analysis

Finding Information from Alignment

• Conserved regions• Insert/delete• Phylogenetic Reconstruction• Motif• …

Page 90: Sequence Alignment and Phylogenetic Analysis

>sp|P02586|TNNC2_RABIT Troponin C, skeletal muscle OS=Oryctolagus cuniculus GN=TNNC2 PE=1 SV=2 MTDQQAEARSYLSEEMIAEFKAAFDMFDADGGGDISVKELGTVMRMLGQTPTKEELDAII EEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEIF RASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ>sp|P20472|PRVA_HUMAN Parvalbumin alpha OS=Homo sapiens GN=PVALB PE=1 SV=2 MSMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIE EDELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES >sp|P80079|PRVA_FELCA Parvalbumin alpha OS=Felis catus GN=PVALB PE=1 SV=2 MSMTDLLGAEDIKKAVEAFTAVDSFDYKKFFQMVGLKKKSPDDIKKVFHILDKDKSGFIE EDELGFILKGFYPDARDLSVKETKMLMAAGDKDGDGKIDVDEFFSLVAKS >sp|P02627|PRVA_RANES Parvalbumin alpha OS=Rana esculenta PE=1 SV=1 PMTDLLAAGDISKAVSAFAAPESFNHKKFFELCGLKSKSKEIMQKVFHVLDQDQSGFIEK EELCLILKGFTPEGRSLSDKETTALLAAGDKDGDGKIGVDEFVTLVSES >sp|P02626|PRVA_AMPME Parvalbumin alpha OS=Amphiuma means PE=1 SV=1 SMTDVIPEADINKAIHAFKAGEAFDFKKFVHLLGLNKRSPADVTKAFHILDKDRSGYIEE EELQLILKGFSKEGRELTDKETKDLLIKGDKDGDGKIGVDEFTSLVAES >sp|P02619|PRVB_ESOLU Parvalbumin beta OS=Esox lucius PE=1 SV=1 SFAGLKDADVAAALAACSAADSFKHKEFFAKVGLASKSLDDVKKAFYVIDQDKSGFIEED ELKLFLQNFSPSARALTDAETKAFLADGDKDGDGMIGVDEFAAMIKA >sp|P43305|PRVU_CHICK Parvalbumin, thymic CPV3 OS=Gallus gallus PE=1 SV=2 MSLTDILSPSDIAAALRDCQAPDSFSPKKFFQISGMSKKSSSQLKEIFRILDNDQSGFIE EDELKYFLQRFECGARVLTASETKTFLAAADHDGDGKIGAEEFQEMVQS >sp|Q91482|PRVB1_SALSA Parvalbumin beta 1 OS=Salmo salar PE=1 SV=1 MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ >sp|P02620|PRVB_MERME Parvalbumin beta OS=Merluccius merluccius PE=1 SV=1 AFAGILADADITAALAACKAEGSFKHGEFFTKIGLKGKSAADIKKVFGIIDQDKSDFVEE DELKLFLQNFSAGARALTDAETATFLKAGDSDGDGKIGVEEFAAMVKG >sp|P02622|PRVB_GADCA Parvalbumin beta OS=Gadus callarias PE=1 SV=1 AFKGILSNADIKAAEAACFKEGSFDEDGFYAKVGLDAFSADELKKLFKIADEDKEGFIEE DELKLFLIAFAADLRALTDAETKAFLKAGDSDGDGKIGVDEFGALVDKWGAKG

Page 91: Sequence Alignment and Phylogenetic Analysis
Page 92: Sequence Alignment and Phylogenetic Analysis

When Sequences Are Hard to Align

• Most MSA programs assume your sequences are related along their whole length

• When this assumption is not true, the progressive approach will not work

• The only alternative is to compare multiple sequences locally

Page 93: Sequence Alignment and Phylogenetic Analysis

Local Multiple-Comparison Methods

• Gibbs Sampler• Will make a local multiple alignment• Will ignore unrelated segments of your sequences• Ideal for finding DNA patterns such as promoters

• Motif discovery methods• Will look for motifs conserved in your sequences• The sequences do not need to be aligned

• The most popular motif-discovery methods:• TEIRESIAS, MEME, SMILE, PRATT