15
7/15/01 1 WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES Gary W. Stuart Department of Life Sciences Indiana State University Terre Haute, IN 47809 [email protected] [email protected] Jeffery J. Leader Department of Mathematics Rose-Hulman Institute of Technology Terre Haute, IN 47803

WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

Embed Size (px)

DESCRIPTION

WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES. Jeffery J. Leader Department of Mathematics Rose-Hulman Institute of Technology Terre Haute, IN 47803. Gary W. Stuart Department of Life Sciences Indiana State University Terre Haute, IN 47809. - PowerPoint PPT Presentation

Citation preview

Page 1: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 1

WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF

PROTEIN SEQUENCES

Gary W. StuartDepartment of Life SciencesIndiana State UniversityTerre Haute, IN 47809

[email protected] [email protected]

Jeffery J. LeaderDepartment of MathematicsRose-Hulman Institute of TechnologyTerre Haute, IN 47803

Page 2: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 2

INTRODUCTION Whole Genome Sequences

Whole genome sequences are now available for 51 prokaryotic bacterial genomes, 5 eukaryotic genomes (yeast, roundworm, fruit fly, human, and A. thaliana), and 160 animal mitochondrial genomes.

Molecular Phylogeny Gene or protein sequence similarity is an accurate

measure of species similarity. Methods that involve a comparison of whole genomes

should improve estimates of phylogenetic relatedness. Comprehensive methods for calculating, archiving, and

accessing gene and species similarity estimates using whole genomes are needed.

Page 3: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 3

METHODS: Whole Genome Target Sets

Mitochondrial: 13 gene families in 64 species = 832 proteins Microbial: 84 COG’s in 44 species = 5,871 proteins Mixed: 19 whole genomes & organelles = 28,653 proteins

Peptide Frequency Matrix 160,000 overlapping 4-grams x 832 = 133 million elements 8,000 overlapping 3-grams x 5,871 = 46 million elements 160,000 overlapping 4-grams x 28,653 = 4.5 billion elements

SVD and Dimension Reduction (DR) SVD-DR using bls-1:SVDPACK (Berry)

Gene Trees and Species Trees Gene to species data by vector addition (below) Cosine-based distances obtained from SVD-DR vectors Gene/species trees via NJ: PHYLIP (Felsenstein)

Page 4: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 4

METHODS: 13 n-dimensional Vectors Define each Mitochondrial Genome 84+ n-dimensional Vectors Define each COG Subset Genome 102-3 n-dimensional Vectors Define each Complete Microbial Genome

Separate vectors: gene trees

Vector sums: species trees

Page 5: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 5

RESULTS: Optimal Dimension Reduction for Mitochondrial Dataset

832 proteins in 13 families placed into 16 groups @ k=73-74

Page 6: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 6

RESULTS: Gene Tree Example for Mitochodrial Dataset

832 proteins from 64 species 13 families formed 16 groups @ k=74 dimensions Only ND3, ND5, and CYTB failed to group exclusively

CYTB Dsem

ND5 Amis

COX3 (64)COX2 (64)ATP6 (64)ATP8 (64)

ND3 (31)

ND1 (64)ND4L (64)ND6 (64)ND2 (64)COX1 (64)CYTB (63)ND5 (29)ND5 (18)ND5 (16)ND4 (64)ND3 (33)

Page 7: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 7

RESULTS: Consensus Species Tree for Mitochondrial Dataset

Optimal trees @ k=64-80 dimensions

Page 8: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 8

RESULTS: Optimal Dimension Reduction for Microbial COG Dataset

5, 871 proteins from 44 species (including 2 fungi) 84 gene families placed into less than 1040 groups @ k>256

Page 9: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 9

RESULTS: Gene Tree Example for Microbial COG Dataset

84 COG families formed 997 groups @ k=259 Too big to show!

Consensus Species Tree for COG Dataset Optimal trees @ 257-280 dimensions ALL 24 TREES - IDENTICAL TOPOLOGY

Challenging Relationships Hyperthermophiles: Aaeo & Tmar

eubacteria with archaebacteria Archaebacterium: Hnrc

archaebacteria with eubacteria Spirochetes: Tpal & Bbur

widely separated Bbur (& Baps) like Rpro mitochondria like?

circles: archaebacteriatriangles: eubacteriasquares: eukaryotesblack: hyperthermophilesgreen: Halobacteriumblue: spirochetes

LlacSpyrBhalBsub

CpneCtra

Sssp

HinfPmul

Vcho

EcooEcol

NmenNmez

TpalXfas

HnrcMtubMlep

CcreMlotPaerDrad

Cjej

HpyjHpyl

Uure

MpneMgen

BapsBbur

Rpro

CalbScer

TaciTvolAperMtheTmarAaeoAfulMjanPabyPhor

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

24

48

24

24

24

2424

2424

24

24

24

24

24

2424

24

24

24

24

24

24

24

24

2424

2424

2424

24

24

24

24

24

24 24

2424

24

24

2424

Page 10: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 10

RESULTS: Consensus Species Tree for Total Genome Dataset

Optimal gene trees too big to analyze (for now?) For species trees, k = 210-220 large enough?

Challenging Relationships Chloroplast (Noli) not with Cyanobacteria (Sspp) Eubacterial thermophiles with archaea Spirochetes Tpal & Bbur widely separated Mitochondria (Recl) with Rpro, Bbur

Bbur

Rpro

Recl

Mgen

Mpne

Hpyl

Sspp

HinF

Cpne

Ctra

Noli

Mtub

Tpal

Mthe

Mjan

Phor

Aful

Tmar

Aaeo

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

11

4

7

1111

11

11

circles: archaebacteriatriangles: eubacteriablack: hyperthermophilesred: mitochondriagreen: chloroplastblue: spirochetes

Page 11: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 11

DISCUSSION: How Does It Work?

Singular Value Decomposition (SVD): Ar = Ur r VrT

Matrix “Ar” is the original matrix (peptides x proteins) of rank “r”

Matrix “Ur” is the “peptide” matrix (peptides x singular vectors)

Matrix “Vr” is the “protein matrix (proteins x singular vectors)

Matrix “r” is a diagonal matrix of singular values in descending order

Graphically, SVD represents a rotation of description space In the “old” spaces, either...

• peptides are defined by protein vectors (protein vectors = columns of “A”)• proteins are defined by peptide vectors (peptide vectors = rows of “A”)

In the “new” space,• peptides are defined by singular vectors (singular vectors = columns of “U”)• proteins are defined by singular vectors (singular vectors = columns of “V”)• singular values () and their vectors are ordered by “importance” (variance)

Dimension Reduction removes “noise” : Ak = Uk k VkT

• if smaller singular values set to zero, then k<r

• Ak is always the most accurate reconstruction of Ak for a given k

• How is the “correct” value of “k” determined? (where does “signal” become “noise”?)

Page 12: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 12

DISCUSSION: What are the Singular Vectors? What do they represent?

Singular Vectors are Orthonormal Basis Vectors for the “New” Space the sum of the squares of the elements of each vector sum to exactly “1”.

Singular Vectors of “U” Define Peptides by Presence in Conserved Motifs the individual elements of these vectors indicate contributions of each peptide a positive value for an element indicates a positive correlation with a given motif a negative value indicates a negative correlation with that motif singular vectors are therefore precisely defined by relative peptide contributions

Singular Vectors of “V” Define Proteins by Presence of Conserved Motifs the individual elements of these vectors indicate contributions of each protein a positive value indicates protein contains the motif a negative value indicates protein has nothing like this motif

Singular Vectors Represent and Define Conserved Motifs motifs are described as correlated peptides (copep motifs) in the “U” matrix copep motifs are NOT REQUIRED to have peptides in extended alignments

Page 13: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 13

DISCUSSION: SVD produces a new descriptive space ...

Matrix A: “old” space with peptide bases

'

copep 1

copep 2

copep k

peptide 1

peptide 2

peptide n

Matrices U & V: “new” space with copep bases

Page 14: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 14

ND3_Sscr:MNIMLTLLTNVTLASLLVLIAFWLPQLNAYSEKTSPYECGFDPMGSARLPFSMKFFLVAITFLLFDLEIALLLPLPWASQTNNLKTMLTMALFLLILLAASLAYEWTQKGLEWAE 2:LLVLIAFWLPQLN 8:FDPMGSA 5:VAITFLL 4:SLAYEW QGGLEWAE:1 LLVL 0.03683679 FDPM 0.04752085 VAIT 0.06136436 SLAY 0.04349018 LVLI 0.04179308 DPMG 0.04346968 AITF 0.06688452 LAYE 0.07069019 VLIA 0.03430444 PMGS 0.04377919 ITFL 0.06649574 AYEW 0.07069019 LIAF 0.04204521 MGSA 0.04145973 TFLL 0.06773809 IAFW 0.06443857 AFWL 0.07452322 FWLP 0.10143487 WLPQ 0.09049735 LPQL 0.05343578 PQLN 0.04795852

ND3_Pdol:MNLLIVMIISTALPIILMLLGFWLPNLNPDNEKVSPYECGFDPLGSARLPFSLKFFLVAILFLLFDLEIAILLPLPWALQYDTPTTAFLIALLILILLTLGLIYEWLQGGLEWAE 0:EKLSPYECGFDPLGSARLPFSMKFFLVAILFLLFDLEIALLLPLPWASQ 3:TLLTLGLIYEWTQ EKLS 0.05915245 TLLT 0.03798160 KLSP 0.06205812 LLTL 0.07589711 LSPY 0.06205812 LTLG 0.05564693 SPYE 0.12096941 TLGL 0.05957197 PYEC 0.13026607 LGLI 0.04020463 YECG 0.13224916 GLIY 0.04413720 ECGF 0.13224916 LIYE 0.04413720 CGFD 0.13080502 IYEW 0.04558135 GFDP 0.13080502 YEWT 0.06092579 FDPL 0.07362187 EWTQ 0.05599396 DPLG 0.06971443 PLGS 0.06871255 LGSA 0.06846775 GSAR 0.11228549 SARL 0.12589498 ARLP 0.12589498 RLPF 0.12456818 LPFS 0.12617765 PFSM 0.06863905 FSMK 0.06474277 SMKF 0.06460109 MKFF 0.06688452 KFFL 0.06847234 FFLV 0.10783106 FLVA 0.10640073 LVAI 0.10613140 VAIL 0.04388013 AILF 0.06119889 ILFL 0.06509992 LFLL 0.07745415 FLLF 0.12525181 LLFD 0.12614454 LFDL 0.12960829 FDLE 0.12960829 DLEI 0.12746527 LEIA 0.12925820 EIAL 0.12133975 IALL 0.13983325 ALLL 0.18434782 LLLP 0.13276736 LLPL 0.10135227 LPLP 0.12080467 PLPW 0.11729774 LPWA 0.08805169 PWAS 0.03825444 WASQ 0.03348659

DISCUSSION: An Example Copep Motif ...

Singular Vector 10 defines copep 10 as square-root

probabilities of tetrapeptides all possible peptides are included

in an exhaustive description only positive values above 0.03

shown here describes conserved elements

within the NADH Dehydrogenase 3 proteins

many peptides appear in aligned sub-motifs (but some do not)

alternative versions of the motif are represented by overlapping sub-motifs

copep motifs are IDEALIZED motifs that may not exist in nature.

Page 15: WHOLE GENOME PHYLOGENIES USING VECTOR REPRESENTATIONS OF PROTEIN SEQUENCES

7/15/01 15

Future Goals

Use a transformation on the input data. normalization of input protein vectors word entropy transformation or “df” transformation improves performance in text-retrieval applications

Are n-grams larger than 4 feasible? 3.2 million possible pentapeptides

Are multiple eukaryotic genomes feasible? millions of genes? networked computers parallel code for SVD (with M. Berry, Tennessee-Knoxville)