Mosaic exploring reticulate protein family evolution UQ, COMBIO AU, Brisbane 02-03-09...

Preview:

Citation preview

mosaicmosaicexploring reticulate protein family evolution

UQ, COMBIOAU, Brisbane02-03-09Maetschke/Kassahn

2

motivationmotivation

evolution is complex (horizontal gene transfer, hybridization, genetic recombination, ...)

describing reticulate (non-tree like) phylogenetic relationships as trees maybe an oversimplification

phylogenetic tree inference gets increasingly complex is not suitable phylogenetic networks are even more complex and visualization is difficult

traditional methodstraditional methods

fast method to analyze and visualize (phylogenetic) sequence relationships applied to identify and study non-tree like protein families aim to perform whole proteome scans for reticulate proteins

mosaicmosaic

the problemthe problem

3

n-grams & dot plotsn-grams & dot plots

MSKRRMSVGQQTW...MSKRRMSVGQQTW...

"alignment free" methods Split sequence in overlapping

subsequences of length n

MSKRSKRR

KRRMRRMS

...

4-grams 4-grams

phylogenetics: alignment is corner stone classical alignment may fail for reticulate proteins

M S K R R M Q Q V T Q

MSKRRMKRRM

n-gram dot plot

A B

B A

S1

S2

4

some real n-gram dot plotssome real n-gram dot plots

4-grams are "unique" for a sequence we talk about '4' later...

c=10c=10n=4n=4

>AR_Pt MEVQLGLGRVYPRPPSKTYRGAFQNLFQSVREVIQNPGPRHPEAASAAPPGASLLLQQQQQQQQQQQQQQQQQQQQQQETSPRQQQQQGEDGSPQAHRRGPTGYLVLDEEQQPSQPQSAPECHPERGCVPEPGAAVAASKGLPQQLPAPPDEDDSAAPSTLSLLGPTFPGLSSCSADLKDILSEASTMQLLQQQQQEAVSEGSSSGRAREASGAPTSSKDNYLGGTSTISDSAKELCKAV...

c=10c=10n=4n=4

c=2c=2n=1n=1

5

another n-gram dot plot another n-gram dot plot nuclear receptors

DBD: DNA binding, two zinc finger motifs LBD: Ligand binding domain AF-1/AF-2: Transcriptional activation domains DBDDBD

LBDLBD

6

n-gram sequence similarity sn-gram sequence similarity s

21

21

,min SS

SSs

max: global alignmentmax: global alignmentmin: local alignmentmin: local alignment

s s [0...1] [0...1]

number of shared n-gramsnumber of shared n-gramsS = set of n-grams, S = set of n-grams, e.g. {AAGR, AGRK, GRKQ, ...}e.g. {AAGR, AGRK, GRKQ, ...}

given two sequences and their n-gram given two sequences and their n-gram setssets S S11 and S and S22::

{AAG,AGQ,GQQ} {AAG,AGQ,GQQ} { GQQ, QQQ} = { GQQ } { GQQ, QQQ} = { GQQ }

5.02,3min

1s

7

n-gram similarityn-gram similarity

fast: linear wrt. size of n-gram sets(classical alignment is quadratic wrt. sequence length)

easy to interpret(0.5 = half of the n-grams are shared)

no parameters (gap penalty, gap extension penalty, ...)

can deal with shuffling of conserved segments and other "strange" cases (Are they actually strange?)

better or worse than BLAST/FASTA? Who knows?(Hoehl 2008: alignment free can be as good as classical alignment for inference of phylogeny, Edgar 2004: MUSCLE: n-gram based alignment method)

8

why 4 and not 42why 4 and not 42 Hoehl 2008: n= 3...5 correlation between n-gram sequence

similarity and species divergence times standard deviation of sequence similarities maximum AUC when distinguish related

and randomly shuffled sequences

MR, r=0.93

44

9

phylogenetic networksphylogenetic networks

different node and edge types Identification of reticulate events

(e.g. recombination) is error prone computational expensive larger networks become messy

T-RexT-Rex

Makarenkov et al. 2001

NeighborNet/SplitsTreeNeighborNet/SplitsTree

Bryant et al. 2004, Huson et al. 1998

NewickNewick

Cardona et al. 2008

10

larger networks - examplelarger networks - example

Huson et al. 2005 Bryant et al. 2004

11

graph = ridiculugramgraph = ridiculugram

layout dependent distorted distances random initialization local minima slow

GRGR

MRMR

PRPR

ARAR

nuclear receptorsnuclear receptors

spring layout

12

mosaic plot mosaic plot

point size is similarity no distortions no random initialization preserve full information automatic clustering

(spectral rearrangement) no hard decision about

number of clusters

13

spectral clusteringspectral clustering22 2/)1( ijseA

k

ikaD

ADL

)(, Leigve

vv22: eigenvector for 2nd smallest eigenvalue (Fiedler vector): eigenvector for 2nd smallest eigenvalue (Fiedler vector) indicates clusters and how well they are separated indicates clusters and how well they are separated

"Degree" matrix"Degree" matrix

Laplacian matrixLaplacian matrix

ssijij :n-gram similarity between sequences :n-gram similarity between sequences

Affinity matrixAffinity matrix

σσ : defines neighborhood radius : defines neighborhood radius

eigenvector eigenvector decompositiondecompositione : eigenvaluese : eigenvaluesv : eigenvectorsv : eigenvectors

A = exp(-(1-S)**2/sig)A = exp(-(1-S)**2/sig)D = diag(A.sum(axis=0))D = diag(A.sum(axis=0))L = D-AL = D-Ae,v = eigh(L)e,v = eigh(L)

14

spectral rearrangementspectral rearrangement

15

recursive spectral rearrangementrecursive spectral rearrangement

16

spectral clusteringspectral clustering takes "global" properties into account fast and scales well no random initialization

=> single run global minimum

=> single, unique solution few parameters: L, σ

σ <= mean of distance matrix "better" than k-means (works for non-spherical clusters)

or single linkage hierarchical clustering (no chaining problem) clustering is NP-hard and spectral clustering is

"just another approximation" recursive spectral clustering to improve cluster quality

17

mosaic - demomosaic - demo

18

the endthe end

fast technique to visualize/analyze reticulate protein family evolution

matrix representation spectral clustering n-gram similarity many other applications

PerlPerlfree! free!

19

questionsquestions

??

20

SCOPSCOP SCOP five families randomly selected

21

Nuclear receptorsNuclear receptorsLigand binding domain N-terminal section Zinc-finger domain

22

mosaic - examplesmosaic - examples

23

Full length sequence:Full length sequence:

G

R

MR

P

R

A

R

MrBayes v3.1.2106 generations, 4 chains240 CPU-hrs

24

Zinc finger domainZinc finger domain

AR

GR

MR

P

R

MrBayes v3.1.2106 generations, 4 chains9 CPU-hrs

25

Ligand-binding domainLigand-binding domain

PR

AR

M

R

GR

MrBayes v3.1.2106 generations, 4 chains27 CPU-hrs

26

Upstream regionUpstream region

?MrBayes v3.1.2106 generations, 4 chains87 CPU-hrs

27

quality qquality q

21

21

,min

,max

nn

SSdiagq

max: global alignmentmax: global alignmentmin: local alignmentmin: local alignment

diagdiag = set of dot sums along diagonals = set of dot sums along diagonals

qq [0...1] [0...1]

given two sequences and their n-gram dot plot:given two sequences and their n-gram dot plot:

nn = length of sequence = length of sequence

66.08,6min

0,1,2,4maxq

28

q over sq over s

29

q-spectrumq-spectrum

30

n-gram dot plotsn-gram dot plots

Recommended