58
Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

  • Upload
    vominh

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Machine Learning for Personalized Medicine

Jean-Philippe Vert

MascotNum 2014, ETH Zurich, April 24, 2014

Page 2: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Complexity of life

1 body = 1014 cells1 cell = 6× 109 ACGT coding for 20,000 genes

Page 3: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Sequencing revolution

Page 4: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

A flood of omics data

Genome

Interactome

Mutations Structural variations

Smoothing parameter chosen to maximize agreement with breakpoint annotations, to learn breakpoints on other chromosomes

position/1e+07

LogR

atio

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1

0 5 10 15 20

2

0 5 10 15 20

3

0 5 10 15

4

0 5 10 15

5

5 10 15

6

5 10 15

7

5 10 15

8

2468101214

9

2468101214

10

24681012

11

24681012

12

24681012

13

246810

14

246810

15

246810

16

2468

17

2468

18

1234567

19

12345

20

123456

21

1.52.02.53.03.54.04.5

22

1.52.02.53.03.54.04.55.0

X

2468101214

Y

0.51.01.52.02.5

0.01100

1e+08

labelbreakpointnormal

Status Lots Trisomy Normal Monosomy Loss

Outlier inlier outlier

Epigenome

Phenome

Transcriptome

Publications

Page 5: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Cancer

Page 6: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

A cancer cell

Page 7: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

A cancer cell

Page 8: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

A cancer cell

Page 9: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Opportunities

What is your risk of developing a cancer? (prevention)After diagnosis and treatment, what is the risk of relapse?(prognosis)What specific treatment will cure your cancer? (personalizedmedicine)

Page 10: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Outline

1 Learning molecular classifiers with network information

2 Kernel bilinear regression for toxicogenomics

Page 11: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Outline

1 Learning molecular classifiers with network information

2 Kernel bilinear regression for toxicogenomics

Page 12: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Breast cancer prognosis

Page 13: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Learning with regularization

Given a training set (xi , yi)i=1,...,n where xi ∈ Rp (typically,n = 200,p = 20,000), we estimate a linear predictor

fβ(x) = β>x

by solvingminβ∈Rp

R(β) + λΩ(β)

where:R(β) is a convex empirical risk, typically

R(β) =1n

n∑

i=1

`(β>xi , yi)

for some loss function ` (squared error, logistic loss, hinge loss...)Ω(β) is a regularization term, typically ‖β ‖2 (ridge regression,SVM...) or ‖β ‖1 (lasso...)

Page 14: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Gene selection, molecular signature

The ideaWe look for a limited set of genes that are sufficient for prediction.Selected genes should inform us about the underlying biology

Page 15: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Lack of stability of signatures

0.56 0.58 0.6 0.62 0.64 0.660

0.05

0.1

0.15

0.2

AUC

Sta

bility

RandomT testEntropyBhatt.WilcoxonRFEGFSLassoE−Net

Single−runEnsemble−meanEnsemble−expEnsemble−ss

Haury et al. (2011)

Page 16: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Gene networks

N

-

Glycan biosynthesis

Protein kinases

DNA and RNA polymerase subunits

Glycolysis / Gluconeogenesis

Sulfurmetabolism

Porphyrinand chlorophyll metabolism

Riboflavin metabolism

Folatebiosynthesis

Biosynthesis of steroids, ergosterol metabolism

Lysinebiosynthesis

Phenylalanine, tyrosine andtryptophan biosynthesis Purine

metabolism

Oxidative phosphorylation, TCA cycle

Nitrogen,asparaginemetabolism

Page 17: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Gene networks and expression data

MotivationBasic biological functions usually involve the coordinated action ofseveral proteins:

Formation of protein complexesActivation of metabolic, signalling or regulatory pathways

Many pathways and protein-protein interactions are already knownHypothesis: the weights of the classifier should be “coherent” withrespect to this prior knowledge

Page 18: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Graph based penalty

fβ(x) = β>x minβ

R(fβ) + λΩ(β)

Prior hypothesisGenes near each other on the graph should have similar weigths.

An idea (Rapaport et al., 2007)

Ω(β) =∑

i∼j

(βi − βj)2 ,

minβ∈Rp

R(fβ) + λ∑

i∼j

(βi − βj)2 .

Page 19: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Graph based penalty

fβ(x) = β>x minβ

R(fβ) + λΩ(β)

Prior hypothesisGenes near each other on the graph should have similar weigths.

An idea (Rapaport et al., 2007)

Ω(β) =∑

i∼j

(βi − βj)2 ,

minβ∈Rp

R(fβ) + λ∑

i∼j

(βi − βj)2 .

Page 20: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Graph Laplacian

DefinitionThe Laplacian of the graph is the matrix L = D − A.

1

2

3

4

5

L = D − A =

1 0 −1 0 00 1 −1 0 0−1 −1 3 −1 00 0 −1 2 −10 0 0 1 1

Page 21: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Spectral penalty as a kernel

TheoremThe function f (x) = β>x where β is solution of

minβ∈Rp

1n

n∑

i=1

`(β>xi , yi

)+ λ

i∼j

(βi − βj

)2

is equal to g(x) = γ>Φ(x) where γ is solution of

minγ∈Rp

1n

n∑

i=1

`(γ>Φ(xi), yi

)+ λγ>γ ,

and whereΦ(x)>Φ(x ′) = x>KGx ′

for KG = L∗, the pseudo-inverse of the graph Laplacian.

Page 22: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Example

1

2

3

4

5

L∗ =

0.88 −0.12 0.08 −0.32 −0.52−0.12 0.88 0.08 −0.32 −0.52

0.08 0.08 0.28 −0.12 −0.32−0.32 −0.32 −0.12 0.48 0.28−0.52 −0.52 −0.32 0.28 1.08

Page 23: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

ClassifiersRapaport et al

N

-

Glycan biosynthesis

Protein kinases

DNA and RNA polymerase subunits

Glycolysis / Gluconeogenesis

Sulfurmetabolism

Porphyrinand chlorophyll metabolism

Riboflavin metabolism

Folatebiosynthesis

Biosynthesis of steroids, ergosterol metabolism

Lysinebiosynthesis

Phenylalanine, tyrosine andtryptophan biosynthesis Purine

metabolism

Oxidative phosphorylation, TCA cycle

Nitrogen,asparaginemetabolism

Fig. 4. Global connection map of KEGG with mapped coefficients of the decision function obtained by applying a customary linear SVM

(left) and using high-frequency eigenvalue attenuation (80% of high-frequency eigenvalues have been removed) (right). Spectral filtering

divided the whole network into modules having coordinated responses, with the activation of low-frequency eigen modes being determined by

microarray data. Positive coefficients are marked in red, negative coefficients are in green, and the intensity of the colour reflects the absolute

values of the coefficients. Rhombuses highlight proteins participating in the Glycolysis/Gluconeogenesis KEGG pathway. Some other parts of

the network are annotated including big highly connected clusters corresponding to protein kinases and DNA and RNA polymerase sub-units.

5 DISCUSSION

Our algorithm groups predictor variables according to highly

connected "modules" of the global gene network. We assume

that the genes within a tightly connected network module

are likely to contribute similarly to the prediction function

because of the interactions between the genes. This motivates

the filtering of gene expression profile to remove the noisy

high-frequencymodes of the network.

Such grouping of variables is a very useful feature of the

resulting classification function because the function beco-

mes meaningful for interpreting and suggesting biological

factors that cause the class separation. This allows classifi-

cations based on functions, pathways and network modules

rather than on individual genes. This can lead to a more robust

behaviour of the classifier in independent tests and to equal if

not better classification results. Our results on the dataset we

analysed shows only a slight improvement, although this may

be due to its limited size. Thereforewe are currently extending

our work to larger data sets.

An important remark to bear in mind when analyzing pictu-

res such as fig.4 and 5 is that the colors represent the weights

of the classifier, and not gene expression levels. There is

of course a relationship between the classifier weights and

the typical expression levels of genes in irradiated and non-

irradiated samples: irradiated samples tend to have expression

profiles positively correlated with the classifier, while non-

irradiated samples tend to be negatively correlated. Roughly

speaking, the classifier tries to find a smooth function that

has this property. If more samples were available, better

non-smooth classifier might be learned by the algorithm, but

constraining the smoothness of the classifier is away to reduce

the complexity of the learning problem when a limited num-

ber of samples are available. This means in particular that the

pictures provide virtually no information regarding the over-

8

Page 24: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

ClassifierSpectral analysis of gene expression profiles using gene networks

a) b)Fig. 5. Theglycolysis/gluconeogenesis pathways ofKEGGwithmapped coefficients of the decision function obtained by applying a customary

linear SVM (a) and using high-frequency eigenvalue attenuation (b). The pathways are mutually exclusive in a cell, as clearly highlighted by

our algorithm.

or under-expression of individual genes, which is the cost to

pay to obtain instead an interpretation in terms of more glo-

bal pathways. Constraining the classifier to rely on just a few

genes would have a similar effect of reducing the complexity

of the problem,butwould lead to amoredifficult interpretation

in terms of pathways.

An advantage of our approach over other pathway-based

clustering methods is that we consider the network modules

that naturally appear from spectral analysis rather than a histo-

rically defined separation of the network into pathways. Thus,

pathways cross-talking is taken into account, which is diffi-

cult to do using other approaches. It can however be noticed

that the implicit decomposition into pathways that we obtain

is biased by the very incomplete knowledge of the network

and that certain regions of the network are better understood,

leading to a higher connection concentration.

Like most approaches aiming at comparing expression data

with gene networks such as KEGG, the scope of this work

is limited by two important constraints. First the gene net-

work we use is only a convenient but rough approximation to

describe complex biochemical processes; second, the trans-

criptional analysis of a sample can not give any information

regarding post-transcriptional regulation and modifications.

Nevertheless, we believe that our basic assumptions remain

valid, in that we assume that the expression of the genes

belonging to the same metabolic pathways module are coor-

dinately regulated. Our interpretation of the results supports

this assumption.

Another important caveat is that we simplify the network

description as an undirected graph of interactions. Although

this would seem to be relevant for simplifying the descrip-

tion of metabolic networks, real gene regulation networks are

influenced by the direction, sign and importance of the interac-

tion. Although the incorporationof weights into the Laplacian

(equation 1) is straightforward and allows the extension of the

approach to weighted undirected graphs, the incorporation

of directions and signs to represent signalling or regulatory

pathways requires more work but could lead to important

advances for the interpretation of microarray data in cancer

studies, for example.

ACKNOWLEDGMENTS

This work was supported by the grant ACI-IMPBIO-2004-47

of the French Ministry for Research and New Technologies.

9

Page 25: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Other penalties with kernels

Φ(x)>Φ(x ′) = x>KGx ′

with:KG = (c + L)−1 leads to

Ω(β) = cp∑

i=1

β2i +

i∼j

(βi − βj

)2.

The diffusion kernel:

KG = expM(−2tL) .

penalizes high frequencies of β in the Fourier domain.

Page 26: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Other penalties without kernels

Gene selection + Piecewise constant on the graph

Ω(β) =∑

i∼j

∣∣βi − βj∣∣+

p∑

i=1

|βi |

Gene selection + smooth on the graph

Ω(β) =∑

i∼j

(βi − βj

)2+

p∑

i=1

|βi |

Page 27: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Graph lasso

Two solutions

Ωintersection(β) =∑

i∼j

√β2

i + β2j ,

Ωunion(β) = supα∈Rp:∀i∼j,‖α2

i +α2j ‖≤1

α>β .

Page 28: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Generalization: Group lasso with overlapping groups

ΩGlatent (w)∆=

minv

g∈G‖vg‖2

w =∑

g∈G vg

supp(vg)⊆ g.

PropertiesResulting support is a union of groups in G.Possible to select one variable without selecting all the groupscontaining it.Equivalent to group lasso when there is no overlap

Page 29: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Theoretical results

Consistency in group support (Jacob et al., 2009)Let w be the true parameter vector.Assume that there exists a unique decomposition vg such thatw =

∑g vg and ΩGlatent (w) =

∑ ‖vg‖2.Consider the regularized empirical risk minimization problemL(w) + λΩGlatent (w).

Thenunder appropriate mutual incoherence conditions on X ,as n→∞,with very high probability,

the optimal solution w admits a unique decomposition (vg)g∈G suchthat

g ∈ G|vg 6= 0

=

g ∈ G|vg 6= 0.

Page 30: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Theoretical results

Consistency in group support (Jacob et al., 2009)Let w be the true parameter vector.Assume that there exists a unique decomposition vg such thatw =

∑g vg and ΩGlatent (w) =

∑ ‖vg‖2.Consider the regularized empirical risk minimization problemL(w) + λΩGlatent (w).

Thenunder appropriate mutual incoherence conditions on X ,as n→∞,with very high probability,

the optimal solution w admits a unique decomposition (vg)g∈G suchthat

g ∈ G|vg 6= 0

=

g ∈ G|vg 6= 0.

Page 31: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Experiments

Synthetic data: overlapping groups10 groups of 10 variables with 2 variables of overlap between twosuccessive groups :1, . . . ,10, 9, . . . ,18, . . . , 73, . . . ,82.Support: union of 4th and 5th groups.Learn from 100 training points.

Frequency of selection of each variable with the lasso (left) and ΩGlatent (.)

(middle), comparison of the RMSE of both methods (right).

Page 32: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Lasso signature (accuracy 0.61)

Page 33: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Graph Lasso signature (accuracy 0.64)

Page 34: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Outline

1 Learning molecular classifiers with network information

2 Kernel bilinear regression for toxicogenomics

Page 35: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Pharmacogenomics / Toxicogenomics

Page 36: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

DREAM8 Toxicogenetics challenge

Genotypes from the 1000 genome projectRNASeq from the Geuvadis project

Page 37: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Bilinear regression

Cell line X , chemical Y , toxicity Z .Bilinear regression model:

Z = f (X ,Y ) + b(Y ) + ε ,

Estimation by kernel ridge regression:

minf∈H,b∈Rp

n∑

i=1

p∑

j=1

(f (xi , yj) + bj − zij

)2+ λ‖f‖2 ,

Page 38: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Solving in O(max(n,p)3)

2 The kernel bilinear regression model

Let X and Y denote abstract vector space to represent, respectively, cell lines and chemicals. Forexample, if each cell line is characterized by a measure of d genetic markers, then we may takeX = Rd to represent each cell line as a vector of markers, but to keep generality we will simplyassume that X and Y are endowed with positive definite kernels, respectively KX and KY . Givena set of n cell lines x1, . . . , xn 2 X and p chemicals y1, . . . , yp 2 Y, we assume that a quantitativemeasure of toxicity response zi,j 2 R has been measured when cell line xi is exposed to chemical yj ,for i = 1, . . . , n and j = 1, . . . , p. Our goal is to estimate, from this data, a function h : X Y ! Rto predict the response h(x, y) if a cell line x is exposed to a chemical y.

We propose to model the response with a simple bilinear regression model of the form:

Z = f(X, Y ) + b(Y ) + , (1)

where f is a bilinear function, b is a chemical-specific bias term and is some Gaussian noise. Weadd the chemical-specific bias term to adjust for the large di↵erences in absolute toxicity responsevalues between chemicals, while the bilinear term f(X, Y ) can capture some patterns of variationsbetween cell lines shared by di↵erent chemicals. We will only focus on the problem of predictingthe action of known and tested chemicals on new cell lines, meaning that we will not try to estimateb(Y ) on new cell lines.

If x and y are finite dimensional vectors, then the bilinear term f(x, y) has the simple formx>My for some matrix M , with Frobenius norm kMk2 = Tr(M>M). The natural generalizationof this bilinear model to possibly infinite-dimensional spaces X and Y is to consider a function fin the product reproducing kernel Hilbert space H associated to the product kernel KXKY , withHilbert-Schmitt norm kfk2. To estimate model (1), we solve a standard ridge regression problem:

minf2H,b2Rp

nX

i=1

pX

j=1

(f(xi, yj) + bj zij)2 + kfk2 , (2)

where is a regularization parameter to be optimized. As shown in the next theorem, (2) has ananalytical solution. Note that 1n refers to the n-dimensional vector of ones, Diag(u) for a vectoru 2 Rn refers to the n n diagonal matrix whose diagonal is u, and A B for two matrices of thesame size refers to their Hadamard (or entrywise) product.

Theorem 1. Let Z 2 Rnp be the response matrix, and KX 2 Rnn and KY 2 Rpp be the kernelGram matrices of the n cell lines and p chemicals, with respective eigenvalue decompositions KX =

UXDXU>X and KY = UY DY U>

Y . Let = U>X1n and S 2 Rnp be defined by Sij = 1/

+ Di

XDjY

,

where DiX (resp. Di

Y ) denotes the i-th diagonal term of DX (resp. DY ). Then the solution (f, b)of (2) is given by

b = UY DiagS>2

1 S>

U>

Y Z>UX

(3)

and

8(x, y) 2 X Y , f(x, y) =

nX

i=1

pX

j=1

↵i,jKX(xi, x)KY (yi, y) , (4)

where↵ = UX

S

U>

X

Z 1nb>

UY

U>

Y . (5)

2

Page 39: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Kernel Trick

cell line descriptors

drug descriptors

mercredi 6 novembre 13

Page 40: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Kernel Trick

cell line descriptors

drug descriptors

Kcell

kernelizedKdrug

jeudi 7 novembre 13

Page 41: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Kernel Trick

cell line descriptors

drug descriptors

Kcell

kernelizedKdrug

kernelbilinear

regressionf

jeudi 7 novembre 13

Page 42: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Kernel Trick

cell line descriptors

drug descriptors

Kcell

kernelizedKdrug

kernelbilinear

regression fKernel choice?. descriptors. data integration. missing data

jeudi 7 novembre 13

Page 43: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Kernel choice

1 Kcell :=⇒ 29 cell line kernels tested=⇒ 1 kernel that integrate all information=⇒ deal with missing data

2 Kdrug :=⇒ 48 drug kernels tested=⇒ multi-task kernels

Page 44: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Kernel choice

1 Kcell :=⇒ 29 cell line kernels tested=⇒ 1 kernel that integrate all information=⇒ deal with missing data

2 Kdrug :=⇒ 48 drug kernels tested=⇒ multi-task kernels

Page 45: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Cell line data integration

Covariates. linear kernel

SNPs. 10 gaussian

kernels

RNA-seq. 10 gaussian

kernels

mercredi 6 novembre 13

Page 46: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Cell line data integration

Covariates. linear kernel

SNPs. 10 gaussian

kernels

RNA-seq. 10 gaussian

kernels

Integrated kernel

mercredi 6 novembre 13

Page 47: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Multi-task drug kernels

1 Dirac2 Multi-Task3 Feature-based4 Empirical5 Integrated

independent regression for each drug

Page 48: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Multi-task drug kernels

1 Dirac2 Multi-Task3 Feature-based4 Empirical5 Integrated

sharing information across drugs

Page 49: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Multi-task drug kernels

1 Dirac2 Multi-Task3 Feature-based4 Empirical5 Integrated

Linear kernel and 10 gaussian kernelsbased on features:

CDK (160 descriptors) and SIRMS(9272 descriptors)Graph kernel for molecules (2D walkkernel)Fingerprint of 2D substructures (881descriptors)Ability to bind human proteins (1554descriptors)

Page 50: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Multi-task drug kernels

1 Dirac2 Multi-Task3 Feature-based4 Empirical5 Integrated

Empirical correlation

0 0.4 0.8Value

010

0

Color Keyand Histogram

Cou

nt

Page 51: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Multi-task drug kernels

1 Dirac2 Multi-Task3 Feature-based4 Empirical5 Integrated

Integrated kernel:Combine all information on drugs

Page 52: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

29x48 kernel combinations: CV results

Km

ultit

ask1

1K

subs

truc

ture

.txt

Kch

emcp

p.tx

tK

mul

titas

k1K

sirm

sRbf

1.tx

tK

cdkR

bf1.

txt

Kpr

edta

rget

Rbf

1.tx

tK

sirm

sRbf

2.tx

tK

mul

titas

k3K

mul

titas

k2K

pred

targ

etR

bf2.

txt

Kcd

kRbf

2.tx

tK

pred

targ

etR

bf3.

txt

Kcd

kRbf

3.tx

tK

mul

titas

k4K

sirm

sRbf

3.tx

tK

mul

titas

k5K

cdkR

bf8.

txt

Ksi

rmsR

bf8.

txt

Kpr

edta

rget

Rbf

8.tx

tK

pred

targ

etR

bf6.

txt

Kpr

edta

rget

Rbf

7.tx

tK

mul

titas

k7K

mul

titas

k10

Kpr

edta

rget

Rbf

4.tx

tK

pred

targ

etR

bf5.

txt

Kcd

kRbf

4.tx

tK

cdkR

bf6.

txt

Kcd

kRbf

5.tx

tK

cdkR

bf7.

txt

Ksi

rmsR

bf6.

txt

Ksi

rmsR

bf7.

txt

Kem

piric

alK

int

Km

ultit

ask6

Km

ultit

ask8

Km

ultit

ask9

Ksi

rmsR

bf4.

txt

Ksi

rmsR

bf5.

txt

Kpr

edta

rget

Mea

nK

cdkM

ean

Ksi

rmsM

ean

Kcovariates.txtKintKcovariatesPopulation.txtKcovariatesBatch.txtKrnaseqMeanKrnaseqRbf4.txtKrnaseqRbf9.txtKrnaseqRbf10.txtKrnaseqRbf8.txtKrnaseqRbf5.txtKrnaseqRbf7.txtKrnaseqRbf6.txtdiracuniformKsnpMeanKrnaseqRbf2.txtKrnaseqRbf3.txtKsnpRbf1.txtKsnpRbf3.txtKsnpRbf2.txtKcovariatesSex.txtKsnpRbf4.txtKsnpRbf5.txtKrnaseqRbf1.txtKsnpRbf8.txtKsnpRbf7.txtKsnpRbf6.txt

CI

0.5 0.51 0.52 0.53Value

015

0Color Key

and HistogramC

ount

Page 53: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

29x48 kernel combinations: CV results

Kmultitask11

Ksubstructure.txt

Kchemcpp.txt

Kmultitask1

Ksirm

sRbf1.txt

KcdkRbf1.txt

KpredtargetRbf1.txt

Ksirm

sRbf2.txt

Kmultitask3

Kmultitask2

KpredtargetRbf2.txt

KcdkRbf2.txt

KpredtargetRbf3.txt

KcdkRbf3.txt

Kmultitask4

Ksirm

sRbf3.txt

Kmultitask5

KcdkRbf8.txt

Ksirm

sRbf8.txt

KpredtargetRbf8.txt

KpredtargetRbf6.txt

KpredtargetRbf7.txt

Kmultitask7

Kmultitask10

KpredtargetRbf4.txt

KpredtargetRbf5.txt

KcdkRbf4.txt

KcdkRbf6.txt

KcdkRbf5.txt

KcdkRbf7.txt

Ksirm

sRbf6.txt

Ksirm

sRbf7.txt

Kempirical

Kint

Kmultitask6

Kmultitask8

Kmultitask9

Ksirm

sRbf4.txt

Ksirm

sRbf5.txt

KpredtargetMean

KcdkMean

Ksirm

sMean

Kcovariates.txtKintKcovariatesPopulation.txtKcovariatesBatch.txtKrnaseqMeanKrnaseqRbf4.txtKrnaseqRbf9.txtKrnaseqRbf10.txtKrnaseqRbf8.txtKrnaseqRbf5.txtKrnaseqRbf7.txtKrnaseqRbf6.txtdiracuniformKsnpMeanKrnaseqRbf2.txtKrnaseqRbf3.txtKsnpRbf1.txtKsnpRbf3.txtKsnpRbf2.txtKcovariatesSex.txtKsnpRbf4.txtKsnpRbf5.txtKrnaseqRbf1.txtKsnpRbf8.txtKsnpRbf7.txtKsnpRbf6.txt

CI

0.5 0.51 0.52 0.53Value

0150

Color Keyand Histogram

Count

integrated and

covariateskernels

jeudi 7 novembre 13

Page 54: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

29x48 kernel combinations: CV results

Kmultitask11

Ksubstructure.txt

Kchemcpp.txt

Kmultitask1

Ksirm

sRbf1.txt

KcdkRbf1.txt

KpredtargetRbf1.txt

Ksirm

sRbf2.txt

Kmultitask3

Kmultitask2

KpredtargetRbf2.txt

KcdkRbf2.txt

KpredtargetRbf3.txt

KcdkRbf3.txt

Kmultitask4

Ksirm

sRbf3.txt

Kmultitask5

KcdkRbf8.txt

Ksirm

sRbf8.txt

KpredtargetRbf8.txt

KpredtargetRbf6.txt

KpredtargetRbf7.txt

Kmultitask7

Kmultitask10

KpredtargetRbf4.txt

KpredtargetRbf5.txt

KcdkRbf4.txt

KcdkRbf6.txt

KcdkRbf5.txt

KcdkRbf7.txt

Ksirm

sRbf6.txt

Ksirm

sRbf7.txt

Kempirical

Kint

Kmultitask6

Kmultitask8

Kmultitask9

Ksirm

sRbf4.txt

Ksirm

sRbf5.txt

KpredtargetMean

KcdkMean

Ksirm

sMean

Kcovariates.txtKintKcovariatesPopulation.txtKcovariatesBatch.txtKrnaseqMeanKrnaseqRbf4.txtKrnaseqRbf9.txtKrnaseqRbf10.txtKrnaseqRbf8.txtKrnaseqRbf5.txtKrnaseqRbf7.txtKrnaseqRbf6.txtdiracuniformKsnpMeanKrnaseqRbf2.txtKrnaseqRbf3.txtKsnpRbf1.txtKsnpRbf3.txtKsnpRbf2.txtKcovariatesSex.txtKsnpRbf4.txtKsnpRbf5.txtKrnaseqRbf1.txtKsnpRbf8.txtKsnpRbf7.txtKsnpRbf6.txt

CI

0.5 0.51 0.52 0.53Value

0150

Color Keyand Histogram

Count

covariates kernelon cell lines

sightly multi-taskon drugs

jeudi 7 novembre 13

Page 55: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Kernel on cell lines: CV results

diracuniform

KsnpRbf7.txtKsnpRbf6.txtKsnpRbf8.txtKsnpRbf5.txt

KrnaseqRbf1.txtKcovariatesSex.txt

KsnpRbf4.txtKsnpMean

KrnaseqRbf2.txtKsnpRbf1.txtKsnpRbf3.txtKsnpRbf2.txt

KrnaseqRbf3.txtKrnaseqMean

KrnaseqRbf4.txtKrnaseqRbf5.txtKrnaseqRbf7.txtKrnaseqRbf6.txtKrnaseqRbf8.txtKrnaseqRbf9.txtKrnaseqRbf10.txt

KcovariatesPopulation.txtKcovariatesBatch.txt

Kcovariates.txtKint

Mean CI for cell line kernels

0.50 0.51 0.52 0.53 0.54

integrated kernel

batch effect

jeudi 7 novembre 13

Page 56: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Kernel on drugs: CV results

Kchemcpp.txtKmultitask11

Ksubstructure.txtKpredtargetRbf1.txt

KcdkRbf1.txtKsirmsRbf2.txt

Kmultitask1KpredtargetRbf2.txt

KsirmsRbf1.txtKcdkRbf3.txt

KsirmsRbf3.txtKcdkRbf2.txtKmultitask3Kmultitask5

KsirmsRbf4.txtKint

Kmultitask4Kmultitask2KcdkMean

Kmultitask6KsirmsMean

KsirmsRbf5.txtKmultitask8Kmultitask9

KcdkRbf4.txtKmultitask7

KcdkRbf6.txtKcdkRbf5.txt

KpredtargetMeanKpredtargetRbf3.txt

KcdkRbf7.txtKempirical

KpredtargetRbf4.txtKsirmsRbf6.txtKsirmsRbf7.txt

KpredtargetRbf5.txtKpredtargetRbf6.txt

KcdkRbf8.txtKsirmsRbf8.txt

KpredtargetRbf7.txtKmultitask10

KpredtargetRbf8.txt

Mean CI for chemicals kernels

0.50 0.51 0.52 0.53 0.54

Page 57: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Final Submission (ranked 2nd)

Kchemcpp.txtKmultitask11

Ksubstructure.txtKpredtargetRbf1.txt

KcdkRbf1.txtKsirmsRbf2.txt

Kmultitask1KpredtargetRbf2.txt

KsirmsRbf1.txtKcdkRbf3.txt

KsirmsRbf3.txtKcdkRbf2.txtKmultitask3Kmultitask5

KsirmsRbf4.txtKint

Kmultitask4Kmultitask2KcdkMeanKmultitask6KsirmsMean

KsirmsRbf5.txtKmultitask8Kmultitask9KcdkRbf4.txtKmultitask7KcdkRbf6.txtKcdkRbf5.txt

KpredtargetMeanKpredtargetRbf3.txt

KcdkRbf7.txtKempirical

KpredtargetRbf4.txtKsirmsRbf6.txtKsirmsRbf7.txt

KpredtargetRbf5.txtKpredtargetRbf6.txt

KcdkRbf8.txtKsirmsRbf8.txt

KpredtargetRbf7.txtKmultitask10

KpredtargetRbf8.txt

Mean CI for chemicals kernels

0.50 0.51 0.52 0.53 0.54

Empirical kernel on drugs

Empirical correlation

0 0.4 0.8Value

0100

Color Keyand Histogram

Count

Integrated kernel on cell lines

vendredi 8 novembre 13

Page 58: Machine Learning for Personalized Medicine · Machine Learning for Personalized Medicine Jean-Philippe Vert MascotNum 2014, ETH Zurich, April 24, 2014

Thanks

jeudi 7 novembre 13