30
mosaic mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

Embed Size (px)

Citation preview

Page 1: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

mosaicmosaicexploring reticulate protein family evolution

UQ, COMBIOAU, Brisbane02-03-09Maetschke/Kassahn

Page 2: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

2

motivationmotivation

evolution is complex (horizontal gene transfer, hybridization, genetic recombination, ...)

describing reticulate (non-tree like) phylogenetic relationships as trees maybe an oversimplification

phylogenetic tree inference gets increasingly complex is not suitable phylogenetic networks are even more complex and visualization is difficult

traditional methodstraditional methods

fast method to analyze and visualize (phylogenetic) sequence relationships applied to identify and study non-tree like protein families aim to perform whole proteome scans for reticulate proteins

mosaicmosaic

the problemthe problem

Page 3: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

3

n-grams & dot plotsn-grams & dot plots

MSKRRMSVGQQTW...MSKRRMSVGQQTW...

"alignment free" methods Split sequence in overlapping

subsequences of length n

MSKRSKRR

KRRMRRMS

...

4-grams 4-grams

phylogenetics: alignment is corner stone classical alignment may fail for reticulate proteins

M S K R R M Q Q V T Q

MSKRRMKRRM

n-gram dot plot

A B

B A

S1

S2

Page 4: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

4

some real n-gram dot plotssome real n-gram dot plots

4-grams are "unique" for a sequence we talk about '4' later...

c=10c=10n=4n=4

>AR_Pt MEVQLGLGRVYPRPPSKTYRGAFQNLFQSVREVIQNPGPRHPEAASAAPPGASLLLQQQQQQQQQQQQQQQQQQQQQQETSPRQQQQQGEDGSPQAHRRGPTGYLVLDEEQQPSQPQSAPECHPERGCVPEPGAAVAASKGLPQQLPAPPDEDDSAAPSTLSLLGPTFPGLSSCSADLKDILSEASTMQLLQQQQQEAVSEGSSSGRAREASGAPTSSKDNYLGGTSTISDSAKELCKAV...

c=10c=10n=4n=4

c=2c=2n=1n=1

Page 5: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

5

another n-gram dot plot another n-gram dot plot nuclear receptors

DBD: DNA binding, two zinc finger motifs LBD: Ligand binding domain AF-1/AF-2: Transcriptional activation domains DBDDBD

LBDLBD

Page 6: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

6

n-gram sequence similarity sn-gram sequence similarity s

21

21

,min SS

SSs

max: global alignmentmax: global alignmentmin: local alignmentmin: local alignment

s s [0...1] [0...1]

number of shared n-gramsnumber of shared n-gramsS = set of n-grams, S = set of n-grams, e.g. {AAGR, AGRK, GRKQ, ...}e.g. {AAGR, AGRK, GRKQ, ...}

given two sequences and their n-gram given two sequences and their n-gram setssets S S11 and S and S22::

{AAG,AGQ,GQQ} {AAG,AGQ,GQQ} { GQQ, QQQ} = { GQQ } { GQQ, QQQ} = { GQQ }

5.02,3min

1s

Page 7: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

7

n-gram similarityn-gram similarity

fast: linear wrt. size of n-gram sets(classical alignment is quadratic wrt. sequence length)

easy to interpret(0.5 = half of the n-grams are shared)

no parameters (gap penalty, gap extension penalty, ...)

can deal with shuffling of conserved segments and other "strange" cases (Are they actually strange?)

better or worse than BLAST/FASTA? Who knows?(Hoehl 2008: alignment free can be as good as classical alignment for inference of phylogeny, Edgar 2004: MUSCLE: n-gram based alignment method)

Page 8: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

8

why 4 and not 42why 4 and not 42 Hoehl 2008: n= 3...5 correlation between n-gram sequence

similarity and species divergence times standard deviation of sequence similarities maximum AUC when distinguish related

and randomly shuffled sequences

MR, r=0.93

44

Page 9: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

9

phylogenetic networksphylogenetic networks

different node and edge types Identification of reticulate events

(e.g. recombination) is error prone computational expensive larger networks become messy

T-RexT-Rex

Makarenkov et al. 2001

NeighborNet/SplitsTreeNeighborNet/SplitsTree

Bryant et al. 2004, Huson et al. 1998

NewickNewick

Cardona et al. 2008

Page 10: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

10

larger networks - examplelarger networks - example

Huson et al. 2005 Bryant et al. 2004

Page 11: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

11

graph = ridiculugramgraph = ridiculugram

layout dependent distorted distances random initialization local minima slow

GRGR

MRMR

PRPR

ARAR

nuclear receptorsnuclear receptors

spring layout

Page 12: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

12

mosaic plot mosaic plot

point size is similarity no distortions no random initialization preserve full information automatic clustering

(spectral rearrangement) no hard decision about

number of clusters

Page 13: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

13

spectral clusteringspectral clustering22 2/)1( ijseA

k

ikaD

ADL

)(, Leigve

vv22: eigenvector for 2nd smallest eigenvalue (Fiedler vector): eigenvector for 2nd smallest eigenvalue (Fiedler vector) indicates clusters and how well they are separated indicates clusters and how well they are separated

"Degree" matrix"Degree" matrix

Laplacian matrixLaplacian matrix

ssijij :n-gram similarity between sequences :n-gram similarity between sequences

Affinity matrixAffinity matrix

σσ : defines neighborhood radius : defines neighborhood radius

eigenvector eigenvector decompositiondecompositione : eigenvaluese : eigenvaluesv : eigenvectorsv : eigenvectors

A = exp(-(1-S)**2/sig)A = exp(-(1-S)**2/sig)D = diag(A.sum(axis=0))D = diag(A.sum(axis=0))L = D-AL = D-Ae,v = eigh(L)e,v = eigh(L)

Page 14: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

14

spectral rearrangementspectral rearrangement

Page 15: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

15

recursive spectral rearrangementrecursive spectral rearrangement

Page 16: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

16

spectral clusteringspectral clustering takes "global" properties into account fast and scales well no random initialization

=> single run global minimum

=> single, unique solution few parameters: L, σ

σ <= mean of distance matrix "better" than k-means (works for non-spherical clusters)

or single linkage hierarchical clustering (no chaining problem) clustering is NP-hard and spectral clustering is

"just another approximation" recursive spectral clustering to improve cluster quality

Page 17: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

17

mosaic - demomosaic - demo

Page 18: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

18

the endthe end

fast technique to visualize/analyze reticulate protein family evolution

matrix representation spectral clustering n-gram similarity many other applications

PerlPerlfree! free!

Page 19: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

19

questionsquestions

??

Page 20: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

20

SCOPSCOP SCOP five families randomly selected

Page 21: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

21

Nuclear receptorsNuclear receptorsLigand binding domain N-terminal section Zinc-finger domain

Page 22: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

22

mosaic - examplesmosaic - examples

Page 23: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

23

Full length sequence:Full length sequence:

G

R

MR

P

R

A

R

MrBayes v3.1.2106 generations, 4 chains240 CPU-hrs

Page 24: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

24

Zinc finger domainZinc finger domain

AR

GR

MR

P

R

MrBayes v3.1.2106 generations, 4 chains9 CPU-hrs

Page 25: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

25

Ligand-binding domainLigand-binding domain

PR

AR

M

R

GR

MrBayes v3.1.2106 generations, 4 chains27 CPU-hrs

Page 26: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

26

Upstream regionUpstream region

?MrBayes v3.1.2106 generations, 4 chains87 CPU-hrs

Page 27: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

27

quality qquality q

21

21

,min

,max

nn

SSdiagq

max: global alignmentmax: global alignmentmin: local alignmentmin: local alignment

diagdiag = set of dot sums along diagonals = set of dot sums along diagonals

qq [0...1] [0...1]

given two sequences and their n-gram dot plot:given two sequences and their n-gram dot plot:

nn = length of sequence = length of sequence

66.08,6min

0,1,2,4maxq

Page 28: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

28

q over sq over s

Page 29: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

29

q-spectrumq-spectrum

Page 30: Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09 Maetschke/Kassahn

30

n-gram dot plotsn-gram dot plots