63
Whole-Genome Prokaryote Phylogeny without Sequence Alignment Bailin HAO and Ji QI T-Life Research Center, Fudan University Shanghai 200433, China Institute of Theoretical Phys ics, Academia Sinica Beijing 100080, China http://www.itp.ac.cn/~hao/

bai2

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: bai2

Whole-Genome Prokaryote Phylogeny without

Sequence Alignment

Bailin HAO and Ji QI

T-Life Research Center, Fudan UniversityShanghai 200433, China

Institute of Theoretical Physics, Academia SinicaBeijing 100080, China

http://www.itp.ac.cn/~hao/

Page 2: bai2

Classification of Prokaryotes:A Long-Standing Problem

• Traditional taxonomy: too few features• Morphology : spheric, helices, rod-shaped……• Metabolism : photosythesis, N-fixing, desulfurization…

…• Gram staining : positive and negative

• SSU rRNA Tree (Carl Woese et al., 1977):– 16S rRNA: ancient conserved sequences of about 15

00kb– Discovery of the three domains of life: Archaea, Bact

eria and Eucarya– Endosymbiont origin of mitochondria and chloroplast

s

Page 3: bai2

The SSU rRNA Tree of Life:A big progress in molecular phylogeny o

f prokaryotes as evidenced by thehistory of the

Bergey’s Manual

Page 4: bai2

Bergey’s Manual Trust:Bergey’s Manual

• 1st Ed. “Determinative Bacteriology”: 1923• 8th Ed. “Determinative Bacteriology”: 1974• 1st Ed. “Systematic Bacteriology”: 1984-1989, 4

volumes• 9th Ed. “Determinative Bacteriology”: 1994• 2nd Ed. “Systematic Bacteriology”: 2001-200?, 5 v

olumes planned; On-Line “Taxonomic Outline of Procarytes” by Garrity et al. (October 2003)

(26 phyla: A1-A2, B1-B24)

Page 5: bai2

Our Final Result

• 132 organisms (16A + 110B + 6E)• Input: genome data• Output: phylogenetic tree• No selection of genes, no alignment of sequ

ences, no fine adjustment whatsoever• See the tree first. Story follows.

Page 6: bai2
Page 7: bai2

Protein Tree for 145 OrganismsFrom 82 Genera

(K=5)

16 Archaea (11 genera, 16 species)123 Bacteria (65 genera, 98 species)

6 Eukaryotes

Page 8: bai2
Page 9: bai2

Complete Bacterial Genomes Appeared since 1995

Early Expectations:

• More support to the SSU rRNA Tree of Life

• Add details to the classification (branchings and groupings)

• More hints on taxonomic revisions

Page 10: bai2

Confusion brought by the hyperthermophiles

– Aquifex aeolicus (Aquae) 1998: 1551335– Thermotoga maritima (Thema) 1999: 1860725

– “Genome Data Shake tree of life”

Science 280 (1 May 1998) 672

– “Is it time to uproot the tree of life?” Science 284 (21 May 1999) 130

– “Uprooting the tree of life” W. Ford Doolittle, Scientific American (February 2000) 90

Page 11: bai2

Debate on Lateral Gene Transfer

• Extreme estimate: 17% in E. Coli Limitations of the above approach B. Wang, J. Mol. Evol. 53 (2001) 244• “Phase transition” and “crystalization” of species

(C. Woese 1998)• Lateral transfer within smaller gene pools as an in

novative agent• Composition vector may incorporate LGT within s

mall gene pools

Page 12: bai2

Alignment-Based Molecular Phylogeny

• TCAGACGC• TCGGAGT

T C A G A C G C

T C G G A - G T

Scoring schemeGap penalty16S rRNA tree was based on sequence alignment

Page 13: bai2

– Problem: sequence alignment cannot be readily applied to complete genomes

– Homology -> alignment– Different genome size, gene content and

gene orderGene A

A’

B

Gene B’

C

?1st species

2nd species

Page 14: bai2

Our Motivations:• Develop a molecular phylogeny method that make

s use of complete genomes – no selection of particular genes

• Avoid sequence alignment • Try to reach higher resolution to provide an indepe

ndent comparison with other approaches such as SSU tRNA trees

• Make comparison with bacteriologists’ systematics as reflected in Bergey’s Manual (2001, 2002)

• Our paper accepted by J. Molecular Evolution

Page 15: bai2

Other Whole-Genome Approaches

• Gene content• Presence or absence of COGs• Conserved Gene Pairs• “Information” distances• Domain order in proteins (Ken Nishikawa’s

talk at InCoB2003)• …

Page 16: bai2

Comparison of Complete Genomes/Proteomes

• Compositional vectorsNucleotides: a 、 t 、 c 、 g

aatcgcgcttaagtc

Di-nucleotide (K=2) distribution: {aa at ac ag ta tt tc tg ca ct cc cg ga gt gc gg} { 2 ,1 ,0 , 1 , 1 ,1, 1, 0, 0, 1, 0, 2, 0, 1 ,2 , 0}

} }

Page 17: bai2

K-strings make a composition vector

• DNA sequence vector of dimension 4K

• Protein sequence vector of dimension 20K

• Given a genomic or protein sequence a unique composition vector

• The converse: a vector one or more sequences ?• K big enough -> uniqueness• Connection with the number of Eulerian loops in a gra

ph (a separate study available as a preprint at ArXiv:physics/0103028 and from Hao’s webpage)

Page 18: bai2

A Key Improvement:Subtraction of Random Background

• Mutations took place randomly at molecular level

• Selection shaped the direction of evolution• Many neutral mutations remain as random

background• At single amino acid level protein sequences are

quite close to random• Highlighting the role of selection by subtraction

a random background

Page 19: bai2

Frequency and Probability

• A sequence of length • A K-string • Frequency of appearance • Probability

L

K 21)( 21 Kf

1)()( 21

21

KLfP K

K

Page 20: bai2

Predicting #(K-strings) from that of lengths (K-1) and (K-2) strings

Joint probability vs. conditional probability

Making the weakest Markov assumption:

Another joint probability:

)()()( 12112121 KKKK ppp ) ( ) ( ) (1 2 1 1 2 2 1 K K K Kp p p

)()()( 121212 KKKKK ppp

Page 21: bai2

(K-2)-th Order Markov Model

Change to frequencies:

Normalization factor may be ignored when L>>K

)()()()(

12

12121121

0

K

KKKKK p

ppp

212

2111

0

)2()3)(1(

)()()()(

KLKLKL

ffff

K

KKK

Page 22: bai2

Construct compositional vectors using these modified str

ing counts:

For the i-th string type of species A we use

ii

ii aa

aa

0

0

Page 23: bai2

Composition Distance

• Define correlation between two compositional vectors by the cosine of angle– From two complete proteomes:

A : {a1,a2,……,an} n=205 = 3 200 000B : {b1,b2,……,bn}

C(A,B) [-1,1]∈• Distance

– D(A,B) [0,1]∈

jj

jj

iii

ba

baBAC

21

22 )(),(

21),( CBAD

Page 24: bai2

Materials: Genomes from NCBI(ftp.ncbi.nih.gov/genomes/Bacteria/)

Not the original GenBank files

Phyla Classes Orders Families Genera Species Strains Archaea 2 8 11 11 11 16 16 Bacteria 13 18 37 46 58 88 110 Total 15 26 48 57 69 104 126

6 Eucaryote genomes were included for reference

Tree construction: Neighbor-Joining in Phylip

Page 25: bai2

Protein Tree for 132 species(K=5)

16 Archaea (11 genera, 16 species)110 Bacteria (57 genera, 88 species)

6 Eukaryotes

Page 26: bai2
Page 27: bai2

Protein Tree for 132 speciesK=6

16 Archaea (11 genera, 16 species)110 Bacteria (57 genera, 88 species)

6 Eukaryotes

Page 28: bai2
Page 29: bai2

Protein Class vs. Whole Proteome

• Trees based on collection of ribosomal proteins (SSU + LSU): ribosomal proteins are interwoven with rRNA to form functioning complex; results consistent with SSU rRNA trees

• Trees based on collection of aminoacyl-tRNA synthetases (AARS). Trees based on single AARS were not good. Trees based on all 20 AARSs much better but not as good as that based on rProteins.

Page 30: bai2

Genus Tree based on Ribosomal

Proteins

Page 31: bai2

A Genus Tree based on Aminoacyl tRNA synthetases

Page 32: bai2

Chloroplast Tree

• Sequences of about 100 000 bp

• Tree of the endosymbiont partners

• Paper accepted by Molecular Biology and Evolution on 12 August 2003

Page 33: bai2

Chloroplast tree

Page 34: bai2

Coronaviruses includingHuman SARS-CoV

• Sequences of tens kilo bases

• SARS squence: about 29730 bases

• Paper published in Chinese Science Bulletin on 26 June 2003

Page 35: bai2

Coronavirus tree

Page 36: bai2

Understanding the Subtraction Procedure:Analysis of Extreme Cases in E. coli

• There are 1 343 887 5-strings belonging to 841832 different types.

• Maximal count before subtraction: 58 for the 5-peptide GKSTL. 58 reduces to 0.646 after subtr

action.• Maximal component after subtraction: 197 for the

5-peptide HAMSC. The number 197 came from a single count 1 before the subtraction.

Page 37: bai2

GKSTL: how 58 reduces to 0.646?

• #(GKST)=113• #(KSTL)=77• #(KST)=247

• Markov prediction: 113*77/247=35.23

• Final result: (58-35.23)/35.23=0.646

Page 38: bai2

HAMSC: how 1 grows to 197?

• #(HAMS)=1• #(AMSC)=1• #(AMS)=198

• Markov prediction: 1*1/198=1/198

• Final result: (1-1/198)/(1/198)=197

Page 39: bai2

6121 Exact Matches of GKSTLIn PIR Rel.1.26 with >1.2 Mil Proteins

• These 6121 matches came from a diverse taxonomic assortment from virus to bacteria to fungi to plants and animals including human being

• In the parlance of classic cladistics GKSTL contributes to plesiomorphic characters that should be eliminated in a strict phylogeny

• The subtraction procedure did the job.

Page 40: bai2

15 Exact Matches of HAMSC:In PIR Rel.1.26 with >1.2 Mil Proteins

• 1 match from Eukaryotic protein• 4 matches (the same protein) from virus• 10 matches from prokaryotes, among which 3 from Shegella and E. coli (HAMSCAPDKE) 3 from Samonella (HAMSCAPERD)

HAMSC is characteristic for prokaryotesHAMSCA is specific for enterobacteria

Page 41: bai2

Stable Topology of the Tree• K=1: makes some sense!• K=2,3,4: topology gradually converges• K=5 and K=6: present calculation• K=7 and more: too high resolution; star-

tree or bush expected

Page 42: bai2

Statistical Test of the Tree

• Bootstrap versus Jack knife• Bootstrap in sequence alignments• “Bootstrap” by random selections from the AA-sequence pool• A time consuming job• 180 bootstraps for 72 species

Page 43: bai2

About 70% genes for

every species were selected

in one bootstrap

Page 44: bai2

“K-string Picture” of Evolution

• K=5 ->3 200 000 points in space of 5-strings• K=6 ->64 000 000 points• In the primordial soup: short polypeptides of

a limited assortment• Evolution by growth, fusion, mutation leads

to diffusion in the string space• String space not saturated yet

Page 45: bai2

The Problem of Higher Taxa

• 1974: Bacteria as a separate kingdom• 1994: Archaea and Bacetria as two domains

• The relation of higher taxa?

Page 46: bai2

Summary As composition vectors do not depend on genome size a

nd gene content. The use of whole genome data is straightforward

Data independent on that of 16S rRNA Method different from that based on SSU rRNA Results agree with SSU rRNA trees and the Bergey’s Ma

nual Hint on groupings of higher taxa A method without “free parameters”: data in, tree out Possibility of an automatic and objective classification to

ol for prokaryotes

Page 47: bai2

Conclusion: The Tree of Life is saved!

There is phylogenetic information in the prokaryotic proteomes.

Time to work on molecular definition of taxa.

Thank you!

Page 48: bai2
Page 49: bai2
Page 50: bai2

Protein Tree for 132 species(K=5)

16 Archaea (11 genera, 16 species)110 Bacteria (57 genera, 88 species)

6 Eukaryotes

Page 51: bai2
Page 52: bai2
Page 53: bai2

A Failed Attempt UsingAvoidance Sinatures

Page 54: bai2
Page 55: bai2

Comparison with the Bergey’s Manual

Page 56: bai2

• Tree Construction – phylip package of J. Felsenstein (Neighbor-Joining)– The Fitch method is not feasible here,

– Nondistance-matrix method (MP, ML et al)

• Material

– ftp://ncbi.nlm.nih.gov/genomes/Bacteria/

  Phyla Classes Orders Families Genera Species Strains

Archaea 2 7 9 9 9 13 13

Bacteria 9 14 23 28 37 46 57

Total 11 21 32 37 46 59 70

)!3(2)!52(

3

nnN nn120

72 10N

Page 57: bai2

Early expectation from genome data

• Was there intensive lateral gene transfer?• Gene tree cannot be equated to the real tree

of life• Genome data: 106 to 107

• Difficult to align whole genome data

Page 58: bai2

• Prokaryote and Eukaryote• Three Kingdoms( Carl Woese ,16S rRNA )

– Archaea– Eubacteria– Eukarya

• Five Kingdoms ( Lynn Margulis )– Bacteria (Archaea, Eubacteria)– Protoctista– Animalia– Fungi– Plantae

Page 59: bai2

Common features of Archaea and Eubacteria: Small cells, no nucleus membrane, ring DNA, no CAP at 5’end of mRNA, presence of S-D segments Many proteins associated with replication, transcri

ption, and translation are common in Archaea and Eukaryote

Features of Archaea: lack of some enzymes, insensitive to some antibiotics

Page 60: bai2

《 Compositional Representation of Protein Sequences and the Number of Eulerian Loops 》 by Bailin Hao, Huimin Xie, Shuyu Zhang

K=5: 76.7% proteins have unique reconstructionK=6: 94.0%K=10: >99%

Checked 2820 AA-seqs from pdb.seq, a special selection of SWISS-PROT

See Los Alamos National Lab E-Archive: physics/0103028

Page 61: bai2

Subtraction of Random Background• Using a (K-2)-order Markov Model

• K=2: genomic signature by Karlin and Burge• May be justified by using Maximal Entropy Prin

ciple with appropriate constraints (Hu & Wang, 2001)

),,,(),,,(),,,(),,,(

132

3212121

K

KKK aaaP

aaaPaaaPaaaP

Page 62: bai2

What to do next• Detailed comparison with traditional taxono

my• Add more eukaryotes• Elucidation of the foundatrion and limitatio

n of compositional approach• Software and web interface• Problem of lateral gene transfer• Viruses ?

Page 63: bai2

• Confusion brought by the hyperthermophiles– Aquifex aeolicus (Aqua) 1998: 1551335– Thermotoga maritima (Tmar) 1999: 1860725– “Genome Data Shake tree of life”

Science 280 (1 May 1998) 672

– “Is it time to uproot the tree of life?” Science 284 (21 May 1999) 130

– “Uprooting the tree of life” Sci. Amer. (February 2000) 9

• Problem of Lateral Gene Transfer (LGT): tree or network

• Problem of higher taxa