Lecture 5 : Phylogenies

Preview:

DESCRIPTION

Lecture 5 : Phylogenies. 9/16/09. Translated blast = protein vs translated database. Blasting Genbank - blastn. Z. bruijni - long beaked echidna T. aculeatus - echidna T. rostratus = honey possum. AX8GS9DG01S. Blasting Genbank - discont megablast - exactly same as blastn. - PowerPoint PPT Presentation

Citation preview

Lecture 5 : PhylogeniesLecture 5 : Phylogenies

9/16/09

Translated blast = protein vs translated database

Blasting Genbank - blastnBlasting Genbank - blastn

Z. bruijni - long beaked echidna T. aculeatus - echidna T. rostratus = honey possum

AX8GS9DG01S

Blasting Genbank - discont Blasting Genbank - discont megablast - exactly same as megablast - exactly same as

blastnblastn

Z. bruijni - long beaked echidna T. aculeatus - echidna T. rostratus = honey possum

AX9N23U7014

Blasting Genbank - megablast - Blasting Genbank - megablast - same species but different ordersame species but different order

Z. bruijni - long beaked echidna T. aculeatus - echidna T. rostratus = honey possum

AX9TUM1G016

Blasting Genbank - Blasting Genbank - TblastnTblastn

AX9DYYTE01N

T. aculeatus - echidna S. brachyurus - quokka S. crassicaudata - fat tailed dunnart M. fasciatus - numbat I. obesulus - quenda

Species found by BLASTSpecies found by BLAST

I. obesulus = quenda = bandicoot

T. aculeatus = echidna

M. fasciatus = numbat

T. rostratus = honey possum S. crassicaudata

= fat tailed dunnart

O. anatinus = platypus

S. brachyurus = quokka

Z. bruijni - Long beaked echidna

Homologene - can be reached Homologene - can be reached from NCBI home pagefrom NCBI home page

Scroll down - they are listed alphabetically

QuestionsQuestions

Phylogenies - what are they?

1. How do we build them?

2. What do they tell us?

PhylogenyPhylogeny Evolutionary

history of a a group of organisms, especially as depicted in a family tree

Haeckel, 1879

Things trees might tell Things trees might tell you :you :

How are organisms with particular trait related?

Did trait evolve multiple times or only once?

What is evolutionary pathwayOf organismsOf genes

Molecules can be used to Molecules can be used to learn how organisms are learn how organisms are

relatedrelated

To learn about vertebrate To learn about vertebrate evolution: Compare >600 genesevolution: Compare >600 genes

1998

Used genes to measure time

1) Time since common ancestor with human

2) Time since two groups diverged

More recent version of vertebrate evolution which shows divergence times on the animal tree

Ponting 2008

OrangutanHumanChimpRhesus monkey

MouseRat

DogCatHorseCowOpposum

Wallaby

Anole

Chicken

FrogFish -Medaka Fugu Tetraodon ZebrafishElephant sharkLamprey

Platypus

Primates 25 MY

Mammals 100 MY100 MY

All vertebrates 550 MY

Tetrapods 420 MY420 MY

Fish 320 MY

Molecular clockMolecular clock

Molecules change at a steady rate We can calibrate how fast they

change using fossils The molecules then become a time

piece to measure how recently different groups split off from each other

Sequence conservation may Sequence conservation may be highbe high

Gene might code for a protein which is highly constrained

Might have to interact with lots of other proteins

Selection might be quite strong

Sequence conservation may Sequence conservation may be lowbe low

Not much constraint

Few sites of interaction

Selection might be weak

Phylogeny stepsPhylogeny steps

Align sequences so homologous AA can be compared

Determine the similarity between sequences

Use this to generate a relationship between sequences

Clustalw2 to align Clustalw2 to align sequencessequences

Put sequences in FASTA Put sequences in FASTA filefile

>TetraodonG1MVWDGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYPQYYLVDPIMFKMLALYMFFLICTGTPINGLTLLVTAQNKKLRQPLNYILVNLAVAGLIMCAFGFTITITSAINGYFILGATACAVEGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFTGTHAAVGVLFTWIMAFACAGPPLFGWSRYLPEGMQCSCGPDYYTLAPGYNNESYVIYMFVVHFFVPVFLIFFTYGSLVLTVRAAAQQQESESTQKAQREVTRMCILMVLGFLVAWTPYATFSGWIFMNKGAAFHPLTAALCAFFAKSSALYNPVIYVLMNKQFRNCMLSTFGMGGAVDDETSVSASKTEVSSVS

>ZebrafishG1MNGTEGSNFYIPMSNRTGLVRSPYDYTQYYLAEPWKFKALAFYMFLLIIFGFPINVLTLVVTAQHKKLRQPLNYILVNLAFAGTIMVIFGFTVSFYCSLVGYMALGPLGCVMEGFFATLGGQVALWSLVVLAIERYIVVCKPMGSFKFSANHAMAGIAFTWFMACSCAVPPLFGWSRYLPEGMQTSCGPDYYTLNPEYNNESYVMYMFSCHFCIPVTTIFFTYGSLVCTVKAAAAQQQESESTQKAEREVTRMVILMVLGFLFAWVPYASFAAWIFFNRGAAFSAQAMAVPAFFSKTSAVFNPIIYVLLNKQFRSCMLNTLFCGKSPLGDDESSSVSTSKTEVSSVSPA

>CichlidG1MAWEGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYTQYYLADPIFFKLLAFYMFFLICTGTPINSLTLFVTAQNKKLRQPLNYILVNLAVAGLIMCCFGFTITITSAFNGYFILGSTFCAIEGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFSGAHAGAGVLFTWIMAMACAAPPLFGWSRYIPEGMQCSCGPDYYTLAPGFNNESYVIYMFVVHFFVPVFIIFFTYGSLVMTVKAAAAQQQDSASTQKAEKEVTRMCVLMVMGFLIAWTPYASFAGWIFMNKGASFSALTAAIPAFFAKSSALYNPVIYVLMNKQFRNCMLSTIGMGGMVEDETSVSTSKTEVSSVS

Aligned sequences .aln ; Jalview gives colored version

Funky tree .dnd (need special program to draw)

Scroll down this page for tree (use Phylogram)

CLUSTAL W (1.83) multiple sequence alignment

TetraodonG1 MVWDGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYPQYYLVDPIMFKMLALYMFFLICTGT 60CichlidG1 MAWEGGIEPNGTEGKNFYIPMSNRTGIVRSPFEYTQYYLADPIFFKLLAFYMFFLICTGT 60ZebrafishG1 --------MNGTEGSNFYIPMSNRTGLVRSPYDYTQYYLAEPWKFKALAFYMFLLIIFGF 52 *****.***********:****::*.****.:* ** **:***:** *

TetraodonG1 PINGLTLLVTAQNKKLRQPLNYILVNLAVAGLIMCAFGFTITITSAINGYFILGATACAV 120CichlidG1 PINSLTLFVTAQNKKLRQPLNYILVNLAVAGLIMCCFGFTITITSAFNGYFILGSTFCAI 120ZebrafishG1 PINVLTLVVTAQHKKLRQPLNYILVNLAFAGTIMVIFGFTVSFYCSLVGYMALGPLGCVM 112 *** ***.****:***************.** ** ****::: .:: **: **. *.:

TetraodonG1 EGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFTGTHAAVGVLFTWIMAFACAGPPL 180CichlidG1 EGFMATLGGEVALWSLVVLAIERYIVVCKPMGSFKFSGAHAGAGVLFTWIMAMACAAPPL 180ZebrafishG1 EGFFATLGGQVALWSLVVLAIERYIVVCKPMGSFKFSANHAMAGIAFTWFMACSCAVPPL 172 ***:*****:**************************:. ** .*: ***:** :** ***

TetraodonG1 FGWSRYLPEGMQCSCGPDYYTLAPGYNNESYVIYMFVVHFFVPVFLIFFTYGSLVLTVR- 239CichlidG1 FGWSRYIPEGMQCSCGPDYYTLAPGFNNESYVIYMFVVHFFVPVFIIFFTYGSLVMTVKA 240ZebrafishG1 FGWSRYLPEGMQTSCGPDYYTLNPEYNNESYVMYMFSCHFCIPVTTIFFTYGSLVCTVKA 232 ******:***** ********* * :******:*** ** :** ********* **:

TetraodonG1 AAAQQQESESTQKAQREVTRMCILMVLGFLVAWTPYATFSGWIFMNKGAAFHPLTAALCA 299CichlidG1 AAAQQQDSASTQKAEKEVTRMCVLMVMGFLIAWTPYASFAGWIFMNKGASFSALTAAIPA 300ZebrafishG1 AAAQQQESESTQKAEREVTRMVILMVLGFLFAWVPYASFAAWIFFNRGAAFSAQAMAVPA 292 ******:* *****::***** :***:***.**.***:*:.***:*:**:* . : *: *

TetraodonG1 FFAKSSALYNPVIYVLMNKQFRNCMLSTFGMGG--AVDDETS-VSASKTEVSSVS-- 351CichlidG1 FFAKSSALYNPVIYVLMNKQFRNCMLSTIGMGG--MVEDETS-VSTSKTEVSSVS-- 352ZebrafishG1 FFSKTSAVFNPIIYVLLNKQFRSCMLNTLFCGKSPLGDDESSSVSTSKTEVSSVSPA 349 **:*:**::**:****:*****.***.*: * :**:* **:*********

Alignment is keyAlignment is key

Any other analysis that you do is only as good as your alignment

If your alignment is bad subsequent analyses will be bad

Junk in = Junk out

AlignmentsAlignments

Tell you about sequence conservationHow much is there?Where is it?

Calculate sequence Calculate sequence similaritiessimilarities

Zebrafish M--------NGTEGSNFYIPMSNR Trout M------Q-NGTEGSNFYIPMSNR Medaka M------E-NGTEGKNFYIPMNNR Cod M----RMEANGTEGKNFYIPMSNR Halibut MVWDGGIEPNGTEGKNFYIPMSNR Tetraodon MVWDGGIEPNGTEGKNFYIPMSNR Goldfish M--------NGTEGNNFYVPLSNR Killifish M---GYG-PNGTEGNNFYIPMSNK * *****.***:*:.*:

Pairwise comparisons

Use tree to show Use tree to show sequence relationshipssequence relationships

Short branches mean sequences are more similarLong branches mean there are more differences

Q3. How do we build Q3. How do we build phylogenies?phylogenies?

Assume the relationships involve bifurcating branches

ATC

ATG

ACG

CCG

CCC

ATC

ATG

ACG

CCG

CCC

Methods to determine Methods to determine similaritiessimilarities

Parsimony

Distance

Maximum likelihood

Bayesian

ParsimonyParsimony

The least complex explanation is the most likely to be correctOccam’s razor

The preferred phylogenetic tree is one that requires fewest changes Count up # changes for all possible

treesFind the shortest one

Trees based on parsimonyTrees based on parsimony

ATCG

ATCG

ACCG

ACCG

ATCG

ACCG

ATCG

ACCG

CT

CT

CT

Most parsimonious

Trees based on parsimonyTrees based on parsimony

T

T

C

C

T

C

T

C

CT

CT

CT

Most parsimonious

Can’t always distinguish tree Can’t always distinguish tree topologiestopologies

T

T

C

C

T

T

C

C

CT CT

Equally parsimonious

Other limitationsOther limitations

All changes are weighted the sameC-T same as C - ASame no matter how long it takes for

the change to occur

Distance methodsDistance methods

Calculate a numerical value for sequence differencesDo for all pairwise combinations

Build tree by joining most similar sequences and then more divergent

Distance methodsDistance methods

Fast Pretty robust Only deals with data in pairs

Pairwise distancesPairwise distances

Taxa1 AACGGTCATGGCGTTGCATTTaxa2 AACGGTCAGGGCGTTGCATTTaxa3 AACGGTCACGCCGCTGCATT

1 2 3

1 0 .05 .15

2 .05 0 .15

3 .15 .15 0

Distance, dDistance, d

p is fractional similarity of sequence

Simplest form of distance: d = 1 - p

AACGGTCATGGCGTTGCATTAACGGTCACGGCGTTGCATT

p = 19/20 d = 0.05

Tree buildingTree building

Neighbor joiningJoin most similar pair of sequencesAdd more divergent after

1 2 3

1 0 .05 .15

2 .05 0 .15

3 .15 .15 0

1

2

3

How different can 2 sequences How different can 2 sequences get?get?

At infinite time, random probability that two sequences are the sameProbability a base is same = 1/4

DNA only has 4 basesCertain sites will start to change

multiple timesNeed to account for these multiple hits

Random sequencesRandom sequences

Write down 20 bases of sequence

Compare your sequence Compare your sequence to this oneto this one

AGTCCGATTACGGCTAGCAG

What fraction of sites are the same in the two sequences?

Sequence similarity Sequence similarity decays to 25% over long decays to 25% over long

timestimes

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5 2 2.5 3 3.5

Time

Sequence similarity

Sequence difference Sequence difference maxes at 0.75maxes at 0.75

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3 3.5

Time

Sequence difference

Sequence change accumulates Sequence change accumulates linearly with time at beginninglinearly with time at beginning

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3 3.5

Time

Sequence difference

DNA modelsDNA models Use different DNA models to

account for how sequences evolve with timeAllows you to apply different molecular

clocksRelate sequence change to timeClock is not linear except for small

changes and short times Models same as used in maximum

likelihood methods

How good is your tree?How good is your tree?

Bootstrap approachRun the same method multiple timesSubsample data each time

Use 50% of dataSee how reproducible the trees areCount how many times a particular

grouping occurs

Distance tree Distance tree for rod and for rod and cone cone transducin transducin alpha alpha subunitsubunit

Branch lengths Branch lengths are are proportional to proportional to sequence sequence

differencesdifferences

Boot strap values are given for each node which tells how reproducible that

grouping is

58

100

100

95

98

72

69

72

98

86

98

68

97

Recommended