83
Integrating Tara Oceans datasets using unsupervised multiple kernel learning Nathalie Villa-Vialaneix Joint work with Jérôme Mariette http://www.nathalievilla.org Séminaire de Probabilité et Statistique Laboratoire J.A. Dieudonné, Université de Nice Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 1/41

Integrating Tara Oceans datasets using unsupervised multiple kernel learning

  • Upload
    tuxette

  • View
    413

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Integrating Tara Oceans datasets using unsupervisedmultiple kernel learning

Nathalie Villa-VialaneixJoint work with Jérôme Mariettehttp://www.nathalievilla.org

Séminaire de Probabilité et StatistiqueLaboratoire J.A. Dieudonné, Université de Nice

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 1/41

Page 2: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 2/41

Page 3: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 3/41

Page 4: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

What are metagenomic data?

Source: [Sommer et al., 2010]

abundance data sparsen × p-matrices with count dataof samples in rows anddescriptors (species, OTUs,KEGG groups, k-mer, ...) incolumns. Generally p � n.

philogenetic tree (evolutionhistory between species,OTUs...). One tree with p leavesbuilt from the sequencescollected in the n samples.

Source: Wikimedia Commons, Donovan.parks

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41

Page 5: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

What are metagenomic data?

Source: [Sommer et al., 2010]

abundance data sparsen × p-matrices with count dataof samples in rows anddescriptors (species, OTUs,KEGG groups, k-mer, ...) incolumns. Generally p � n.

philogenetic tree (evolutionhistory between species,OTUs...). One tree with p leavesbuilt from the sequencescollected in the n samples.

Source: Wikimedia Commons, Donovan.parks

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41

Page 6: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

What are metagenomic data?

Source: [Sommer et al., 2010]

abundance data sparsen × p-matrices with count dataof samples in rows anddescriptors (species, OTUs,KEGG groups, k-mer, ...) incolumns. Generally p � n.

philogenetic tree (evolutionhistory between species,OTUs...). One tree with p leavesbuilt from the sequencescollected in the n samples.

Source: Wikimedia Commons, Donovan.parks

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 4/41

Page 7: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

What are metagenomic data used for?

produce a profile of the diversity of a given sample⇒ allows tocompare diversity between various conditions

used in various fields: environmental science, microbiote, ...

Processed by computing a relevant dissimilarity between samples(standard Euclidean distance is not relevant) and by using this dissimilarityin subsequent analyses.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 5/41

Page 8: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

What are metagenomic data used for?

produce a profile of the diversity of a given sample⇒ allows tocompare diversity between various conditions

used in various fields: environmental science, microbiote, ...

Processed by computing a relevant dissimilarity between samples(standard Euclidean distance is not relevant) and by using this dissimilarityin subsequent analyses.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 5/41

Page 9: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

β-diversity data: dissimilarities between count data

Compositional dissimilarities: (nig) count of species g for sample iJaccard: the fraction of species specific of either sample i or j:

djac =

∑g I{nig>0,njg=0} + I{njg>0,nig=0}∑

j I{nig+njg>0}

Bray-Curtis: the fraction of the sample which is specific of eithersample i or j

dBC =

∑g |nig − njg |∑g(nig + njg)

Other dissimilarities available in the R package philoseq, most of themnot Euclidean.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 6/41

Page 10: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

β-diversity data: phylogenetic dissimilarities

Phylogenetic dissimilarities

For each branch e, note le its length and pei

the fraction of counts in sample icorresponding to species below branch e.

Unifrac: the fraction of the tree specific toeither sample i or sample j.

dUF =

∑e le(I{pei>0,pej=0} + I{pej>0,pei=0})∑

e leI{pei+pej>0}

Weighted Unifrac: the fraction of thediversity specific to sample i or to sample j.

dwUF =

∑e le |pei − pej |∑e(pei + pej)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41

Page 11: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

β-diversity data: phylogenetic dissimilarities

Phylogenetic dissimilaritiesFor each branch e, note le its length and pei

the fraction of counts in sample icorresponding to species below branch e.

Unifrac: the fraction of the tree specific toeither sample i or sample j.

dUF =

∑e le(I{pei>0,pej=0} + I{pej>0,pei=0})∑

e leI{pei+pej>0}

Weighted Unifrac: the fraction of thediversity specific to sample i or to sample j.

dwUF =

∑e le |pei − pej |∑e(pei + pej)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41

Page 12: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

β-diversity data: phylogenetic dissimilarities

Phylogenetic dissimilaritiesFor each branch e, note le its length and pei

the fraction of counts in sample icorresponding to species below branch e.

Unifrac: the fraction of the tree specific toeither sample i or sample j.

dUF =

∑e le(I{pei>0,pej=0} + I{pej>0,pei=0})∑

e leI{pei+pej>0}

Weighted Unifrac: the fraction of thediversity specific to sample i or to sample j.

dwUF =

∑e le |pei − pej |∑e(pei + pej)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41

Page 13: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

β-diversity data: phylogenetic dissimilarities

Phylogenetic dissimilaritiesFor each branch e, note le its length and pei

the fraction of counts in sample icorresponding to species below branch e.

Unifrac: the fraction of the tree specific toeither sample i or sample j.

dUF =

∑e le(I{pei>0,pej=0} + I{pej>0,pei=0})∑

e leI{pei+pej>0}

Weighted Unifrac: the fraction of thediversity specific to sample i or to sample j.

dwUF =

∑e le |pei − pej |∑e(pei + pej)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 7/41

Page 14: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 8/41

Page 15: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

TARA Oceans datasets

The 2009-2013 expedition

Co-directed by Étienne Bourgoisand Éric Karsenti.

7,012 datasets collected from35,000 samples of plankton andwater (11,535 Gb of data).

Study the plankton: bacteria,protists, metazoans and virusesrepresenting more than 90% of thebiomass in the ocean.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 9/41

Page 16: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

TARA Oceans datasets

Science (May 2015) - Studies on:eukaryotic plankton diversity[de Vargas et al., 2015],

ocean viral communities[Brum et al., 2015],

global plankton interactome[Lima-Mendez et al., 2015],

global ocean microbiome[Sunagawa et al., 2015],

. . . .

→ datasets from different types anddifferent sources analyzed separately.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 10/41

Page 17: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Background of this talk

ObjectivesUntil now: many papers using many methods. No integrated analysisperformed.

What do the datasets reveal if integrated in a single analysis?

Our purpose: develop a generic method to integrate phylogenetic,taxonomic and functional community composition together withenvironmental factors.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 11/41

Page 18: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

TARA Oceans datasets that we used

[Sunagawa et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41

Page 19: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

TARA Oceans datasets that we used

[Sunagawa et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41

Page 20: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

TARA Oceans datasets that we used

[Sunagawa et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41

Page 21: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

TARA Oceans datasets that we used

[de Vargas et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41

Page 22: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

TARA Oceans datasets that we used

[Brum et al., 2015]

Datasets usedenvironmental dataset: 22 numeric features (temperature, salinity, . . . ).

bacteria phylogenomic tree: computed from ∼ 35,000 OTUs.

bacteria functional composition: ∼ 63,000 KEGG orthologous groups.

eukaryotic plankton composition splited into 4 groups pico (0.8 − 5µm),

nano (5 − 20µm), micro (20 − 180µm) and meso (180 − 2000µm).

virus composition: ∼ 867 virus clusters based on shared gene content.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 12/41

Page 23: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

TARA Oceans datasets that we used

Common samples48 samples,

2 depth layers: surface(SRF) and deep chlorophyllmaximum (DCM),

31 different samplingstations.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 13/41

Page 24: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 14/41

Page 25: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Kernel methods

Kernel viewed as the dot product in an implicit Hilbert spaceK : X × X → R st: K(xi , xj) = K(xj , xi) and ∀m ∈ N, ∀x1, ..., xm ∈ X,∀α1, ..., αm ∈ R,

∑mi,j=1 αiαjK(xi , xj) ≥ 0.

⇒ [Aronszajn, 1950]

∃!(H , 〈., .〉), φ : X → H st: K(xi , xj) = 〈φ(xi), φ(xj)〉

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 15/41

Page 26: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Kernel methods

Kernel viewed as the dot product in an implicit Hilbert spaceK : X × X → R st: K(xi , xj) = K(xj , xi) and ∀m ∈ N, ∀x1, ..., xm ∈ X,∀α1, ..., αm ∈ R,

∑mi,j=1 αiαjK(xi , xj) ≥ 0.

⇒ [Aronszajn, 1950]

∃!(H , 〈., .〉), φ : X → H st: K(xi , xj) = 〈φ(xi), φ(xj)〉

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 15/41

Page 27: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41

Page 28: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41

Page 29: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

If (αk )k=1,...,N ∈ RN and (λk )k=1,...,N are the eigenvectors and eigenvalues,

PC axes are:

ak =N∑

i=1

αkiφ(xi)

and ak = (aki)i=1,...,n are orthonormal in the feature space induced by thekernel:

∀ k , k ′, 〈ak , ak ′〉 = α>k Kαk ′ = δkk ′ with δkk ′ =

{0 if k , k ′

1 otherwise.

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41

Page 30: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

Coordinate of the projection of the observations (φ(xi))i :

〈ak , φ(xi)〉 =n∑

j=1

αkjKji = Ki.αk = λkαki ,

where Ki. is the i-th row of K .

No representation for the variables (no real variables...).

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41

Page 31: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Exploratory analysis with kernelsA well know example: kernel PCA [Schölkopf et al., 1998]PCA analysis performed in the feature space induced by the kernel K .

In practice:

K is centered: K ← K − 1N K IN + 1

N2 I>NK IN;

K-PCA is performed by the eigen-decomposition of (centered) K

Other unsupervised kernel methods: kernel SOM[Olteanu and Villa-Vialaneix, 2015, Mariette et al., 2017]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 16/41

Page 32: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Usefulness of K-PCANon linear PCA

Source: By Petter Strandmark - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=3936753

[Mariette et al., 2017] K-PCA for non numeric datasets - here aquantitative time series: job trajectories after graduation from the Frenchsurvey “Generation 98” [Cottrell and Letrémy, 2005]

color is the mode of the trajectories

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 17/41

Page 33: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Usefulness of K-PCA

[Mariette et al., 2017] K-PCA for non numeric datasets - here aquantitative time series: job trajectories after graduation from the Frenchsurvey “Generation 98” [Cottrell and Letrémy, 2005]

color is the mode of the trajectories

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 17/41

Page 34: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

From multiple dissimilarities to multiple kernels

1 several (non Euclidean) dissimilarities D1, . . . , DM , transformed intosimilarities with [Lee and Verleysen, 2007]:

Km(xi , xj) = −12

Dm(xi , xj) −2N

N∑k=1

Dm(xi , xk ) +1

N2

N∑k , k ′=1

Dm(xk , xk ′)

2 if non positive, clipping or flipping (removing the negative part of theeigenvalues decomposition or taking its opposite) produce kernels[Chen et al., 2009].

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 18/41

Page 35: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

From multiple kernels to an integrated kernel

How to combine multiple kernels?

naive approach: K ∗ = 1M

∑m Km

supervised framework: K ∗ =∑

m βmKm with βm ≥ 0 and∑

m βm = 1with βm chosen so as to minimize the prediction error[Gönen and Alpaydin, 2011]

unsupervised framework but input space is Rd [Zhuang et al., 2011]K ∗ =

∑m βmKm with βm ≥ 0 and

∑m βm = 1 with βm chosen so as to

I minimize the distortion between all training data∑

ij K ∗(xi , xj)‖xi − xj‖2;

I AND minimize the approximation of the original data by the kernelembedding

∑i

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

Our proposal: 2 UMKL frameworks which do not require data to havevalues in Rd .

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41

Page 36: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

From multiple kernels to an integrated kernel

How to combine multiple kernels?

naive approach: K ∗ = 1M

∑m Km

supervised framework: K ∗ =∑

m βmKm with βm ≥ 0 and∑

m βm = 1with βm chosen so as to minimize the prediction error[Gönen and Alpaydin, 2011]

unsupervised framework but input space is Rd [Zhuang et al., 2011]K ∗ =

∑m βmKm with βm ≥ 0 and

∑m βm = 1 with βm chosen so as to

I minimize the distortion between all training data∑

ij K ∗(xi , xj)‖xi − xj‖2;

I AND minimize the approximation of the original data by the kernelembedding

∑i

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

Our proposal: 2 UMKL frameworks which do not require data to havevalues in Rd .

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41

Page 37: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

From multiple kernels to an integrated kernel

How to combine multiple kernels?

naive approach: K ∗ = 1M

∑m Km

supervised framework: K ∗ =∑

m βmKm with βm ≥ 0 and∑

m βm = 1with βm chosen so as to minimize the prediction error[Gönen and Alpaydin, 2011]

unsupervised framework but input space is Rd [Zhuang et al., 2011]K ∗ =

∑m βmKm with βm ≥ 0 and

∑m βm = 1 with βm chosen so as to

I minimize the distortion between all training data∑

ij K ∗(xi , xj)‖xi − xj‖2;

I AND minimize the approximation of the original data by the kernelembedding

∑i

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

Our proposal: 2 UMKL frameworks which do not require data to havevalues in Rd .

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41

Page 38: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

From multiple kernels to an integrated kernel

How to combine multiple kernels?

naive approach: K ∗ = 1M

∑m Km

supervised framework: K ∗ =∑

m βmKm with βm ≥ 0 and∑

m βm = 1with βm chosen so as to minimize the prediction error[Gönen and Alpaydin, 2011]

unsupervised framework but input space is Rd [Zhuang et al., 2011]K ∗ =

∑m βmKm with βm ≥ 0 and

∑m βm = 1 with βm chosen so as to

I minimize the distortion between all training data∑

ij K ∗(xi , xj)‖xi − xj‖2;

I AND minimize the approximation of the original data by the kernelembedding

∑i

∥∥∥xi −∑

j K ∗(xi , xj)xj

∥∥∥2.

Our proposal: 2 UMKL frameworks which do not require data to havevalues in Rd .

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 19/41

Page 39: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

STATIS like framework[L’Hermier des Plantes, 1976, Lavit et al., 1994]Similarities between kernels:

Cmm′ =〈Km,Km′〉F

‖Km‖F‖Km′‖F=

Trace(KmKm′)√Trace((Km)2)Trace((Km′)2)

.

(Cmm′ is an extension of the RV-coefficient [Robert and Escoufier, 1976] tothe kernel framework)

maximizeM∑

m=1

⟨K ∗(v),

Km

‖Km‖F

⟩F

= v>Cv

for K ∗(v) =M∑

m=1

vmKm and v ∈ RM such that ‖v‖2 = 1.

Solution: first eigenvector of C⇒ Set β = v∑Mm=1 vm

(consensual kernel).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41

Page 40: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

STATIS like framework[L’Hermier des Plantes, 1976, Lavit et al., 1994]Similarities between kernels:

Cmm′ =〈Km,Km′〉F

‖Km‖F‖Km′‖F=

Trace(KmKm′)√Trace((Km)2)Trace((Km′)2)

.

(Cmm′ is an extension of the RV-coefficient [Robert and Escoufier, 1976] tothe kernel framework)

maximizeM∑

m=1

⟨K ∗(v),

Km

‖Km‖F

⟩F

= v>Cv

for K ∗(v) =M∑

m=1

vmKm and v ∈ RM such that ‖v‖2 = 1.

Solution: first eigenvector of C⇒ Set β = v∑Mm=1 vm

(consensual kernel).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41

Page 41: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

STATIS like framework[L’Hermier des Plantes, 1976, Lavit et al., 1994]Similarities between kernels:

Cmm′ =〈Km,Km′〉F

‖Km‖F‖Km′‖F=

Trace(KmKm′)√Trace((Km)2)Trace((Km′)2)

.

(Cmm′ is an extension of the RV-coefficient [Robert and Escoufier, 1976] tothe kernel framework)

maximizeM∑

m=1

⟨K ∗(v),

Km

‖Km‖F

⟩F

= v>Cv

for K ∗(v) =M∑

m=1

vmKm and v ∈ RM such that ‖v‖2 = 1.

Solution: first eigenvector of C⇒ Set β = v∑Mm=1 vm

(consensual kernel).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 20/41

Page 42: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

A kernel preserving the original topology of the data IFrom an idea similar to that of [Lin et al., 2010], find a kernel such that thelocal geometry of the data in the feature space is similar to that of theoriginal data.

Proxy of the local geometry

Km −→ Gmk︸ ︷︷ ︸

k−nearest neighbors graph

−→ Amk︸ ︷︷ ︸

adjacency matrix

⇒W =∑

m I{Amk >0} or W =

∑m Am

k

Feature space geometry measured by

∆i(β) =

⟨φ∗β(xi),

φ∗β(x1)...

φ∗β(xN)

=

K ∗β (xi , x1)

...

K ∗β (xi , xN)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41

Page 43: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

A kernel preserving the original topology of the data I

From an idea similar to that of [Lin et al., 2010], find a kernel such that thelocal geometry of the data in the feature space is similar to that of theoriginal data.

Proxy of the local geometry

Km −→ Gmk︸ ︷︷ ︸

k−nearest neighbors graph

−→ Amk︸ ︷︷ ︸

adjacency matrix

⇒W =∑

m I{Amk >0} or W =

∑m Am

kAdjacency matrix image from: By S. Mohammad H. Oloomi, CC BY-SA 3.0,https://commons.wikimedia.org/w/index.php?curid=35313532

Feature space geometry measured by ∆i (β) =

⟨φ∗β(xi ),

φ∗β(x1)

.

.

.φ∗β(xN )

=

K∗β (xi , x1)

.

.

.K∗β (xi , xN )

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41

Page 44: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

A kernel preserving the original topology of the data IFrom an idea similar to that of [Lin et al., 2010], find a kernel such that thelocal geometry of the data in the feature space is similar to that of theoriginal data.

Proxy of the local geometry

Km −→ Gmk︸ ︷︷ ︸

k−nearest neighbors graph

−→ Amk︸ ︷︷ ︸

adjacency matrix

⇒W =∑

m I{Amk >0} or W =

∑m Am

k

Feature space geometry measured by

∆i(β) =

⟨φ∗β(xi),

φ∗β(x1)...

φ∗β(xN)

=

K ∗β (xi , x1)

...

K ∗β (xi , xN)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 21/41

Page 45: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

A kernel preserving the original topology of the data IISparse version

minimizeN∑

i,j=1

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗β =M∑

m=1

βmKm and β ∈ RM st βm ≥ 0 andM∑

m=1

βm = 1.

⇔ minimizeM∑

m,m′=1

βmβm′Smm′

β ∈ RM such that βm ≥ 0 andM∑

m=1

βm = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

i =

Km(xi , x1)

...

Km(xi , xN)

.

Non sparse

version

minimizeN∑

i,j=1

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗v =M∑

m=1

vmKm and v ∈ RM st vm ≥ 0 and ‖v‖2 = 1.

⇔ minimizeM∑

m,m′=1

vmvm′Smm′

v ∈ RM such that vm ≥ 0 and ‖v‖2 = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

i =

Km(xi , x1)

...

Km(xi , xN)

.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41

Page 46: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

A kernel preserving the original topology of the data IISparse version

minimizeN∑

i,j=1

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗β =M∑

m=1

βmKm and β ∈ RM st βm ≥ 0 andM∑

m=1

βm = 1.

⇔ minimizeM∑

m,m′=1

βmβm′Smm′

β ∈ RM such that βm ≥ 0 andM∑

m=1

βm = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

i =

Km(xi , x1)

...

Km(xi , xN)

.

Non sparse

version

minimizeN∑

i,j=1

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗v =M∑

m=1

vmKm and v ∈ RM st vm ≥ 0 and ‖v‖2 = 1.

⇔ minimizeM∑

m,m′=1

vmvm′Smm′

v ∈ RM such that vm ≥ 0 and ‖v‖2 = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

i =

Km(xi , x1)

...

Km(xi , xN)

.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41

Page 47: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

A kernel preserving the original topology of the data IINon sparse version

minimizeN∑

i,j=1

Wij∥∥∥∆i(β) −∆j(β)

∥∥∥2

for K ∗v =M∑

m=1

vmKm and v ∈ RM st vm ≥ 0 and ‖v‖2 = 1.

⇔ minimizeM∑

m,m′=1

vmvm′Smm′

v ∈ RM such that vm ≥ 0 and ‖v‖2 = 1,

for Smm′ =∑N

i,j=1 Wij‖∆mi −∆m

j ‖2 and ∆m

i =

Km(xi , x1)

...

Km(xi , xN)

.Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 22/41

Page 48: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Optimization issuesSparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖1 =

∑m βm = 1⇒

standard QP problem with linear constrains (ex: package quadprogin R).

Non sparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖2 = 1⇒ QPQCproblem (hard to solve).

Relaxed into to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 with:

I X =

(1 β>

β B

)is positive semi-definite

I A =

(0 0>M

0M IM

)I Aj =

(0 1>j1j 0MM

)Semi-definite programming⇒ efficient solvers exist.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41

Page 49: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Optimization issuesSparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖1 =

∑m βm = 1⇒

standard QP problem with linear constrains (ex: package quadprogin R).

Non sparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖2 = 1⇒ QPQCproblem (hard to solve).

Relaxed into to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 with:

I X =

(1 β>

β B

)is positive semi-definite

I A =

(0 0>M

0M IM

)I Aj =

(0 1>j1j 0MM

)Semi-definite programming⇒ efficient solvers exist.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41

Page 50: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Optimization issuesSparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖1 =

∑m βm = 1⇒

standard QP problem with linear constrains (ex: package quadprogin R).

Non sparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖2 = 1⇒ QPQCproblem (hard to solve).

Equivalent to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 and B = β>β with:

I X =

(1 β>

β B

)I A =

(0 0>M

0M IM

)I Aj =

(0 1>j1j 0MM

)

Relaxed into to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 with:

I X =

(1 β>

β B

)is positive semi-definite

I A =

(0 0>M

0M IM

)I Aj =

(0 1>j1j 0MM

)Semi-definite programming⇒ efficient solvers exist.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41

Page 51: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Optimization issuesSparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖1 =

∑m βm = 1⇒

standard QP problem with linear constrains (ex: package quadprogin R).

Non sparse version writes minβ βT Sβ st β ≥ 0 and ‖β‖2 = 1⇒ QPQCproblem (hard to solve).

Relaxed into to the following problem: minβ,B Trace(S2X) stTrace(AX) = 1, Trace(AjX) ≥ 0 with:

I X =

(1 β>

β B

)is positive semi-definite

I A =

(0 0>M

0M IM

)I Aj =

(0 1>j1j 0MM

)Semi-definite programming⇒ efficient solvers exist.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 23/41

Page 52: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

A proposal to improve interpretability of K-PCA in ourframework

Issue: How to assess the importance of a given species in the K-PCA?

our datasets are either numeric (environmental) or are built from an × p count matrix

⇒ for a given species, randomly permute counts and re-do theanalysis (kernel computation - with the same optimized weights - andK-PCA)

the influence of a given species in a given dataset on a given PCsubspace is accessed by computing the Crone-Crosby distancebetween these two PCA subspaces [Crone and Crosby, 1995] (∼Frobenius norm between the projectors)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41

Page 53: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

A proposal to improve interpretability of K-PCA in ourframework

Issue: How to assess the importance of a given species in the K-PCA?

our datasets are either numeric (environmental) or are built from an × p count matrix

⇒ for a given species, randomly permute counts and re-do theanalysis (kernel computation - with the same optimized weights - andK-PCA)

the influence of a given species in a given dataset on a given PCsubspace is accessed by computing the Crone-Crosby distancebetween these two PCA subspaces [Crone and Crosby, 1995] (∼Frobenius norm between the projectors)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41

Page 54: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

A proposal to improve interpretability of K-PCA in ourframework

Issue: How to assess the importance of a given species in the K-PCA?

our datasets are either numeric (environmental) or are built from an × p count matrix

⇒ for a given species, randomly permute counts and re-do theanalysis (kernel computation - with the same optimized weights - andK-PCA)

the influence of a given species in a given dataset on a given PCsubspace is accessed by computing the Crone-Crosby distancebetween these two PCA subspaces [Crone and Crosby, 1995] (∼Frobenius norm between the projectors)

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 24/41

Page 55: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Sommaire

1 Metagenomic datasets and associated questions

2 A typical (and rich) case study: TARA Oceans datasets

3 A UMKL framework for integrating multiple metagenomic data

4 Application to TARA Oceans datasets

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 25/41

Page 56: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Integrating ’omics data using kernels

M TARA Oceans datasets(xm

i )i=1,...,n,m=1,...,M measured on the sameocean samples (1, . . . ,N) which takevalues in an arbitrary space (Xm)m:

environmental dataset,

bacteria phylogenomic tree,

bacteria functional composition,

eukaryote pico-plankton composition,

. . .

virus composition.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Page 57: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Integrating ’omics data using kernels

Environmental dataset: standard euclideandistance, given by K(xi , xj) = xT

i xj .

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Page 58: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Integrating ’omics data using kernels

Bacteria phylogenomic tree: the weightedUnifrac distance, given by

dwUF (xi , xj) =

∑e le |pei − pej |∑

e pei + pej.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Page 59: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Integrating ’omics data using kernels

All composition based datasets: bacteriafunctional composition, eukaryote (pico,nano, micro, meso)-plankton compositionand virus composition calculated using theBray-Curtis dissimilarity,

dBC(xi , xj) =

∑g |nig − njg |∑g nig + njg

,

nig: gene g abundances summarized at theKEGG orthologous groups level in samplei.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Page 60: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Integrating ’omics data using kernels

Combinaison of M kernels by a weightedsum

K ∗ =M∑

m=1

βmKm,

where βm ≥ 0 and∑M

m=1 βm = 1.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Page 61: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Integrating ’omics data using kernels

Apply standard data mining methods(clustering, linear model, PCA, . . . ) in thefeature space.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 26/41

Page 62: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Correlation between kernels (STATIS)

Low correlations between the bacteria functional composition andother datasets.

Strong correlation between environmental variables and smallorganisms (bacteria, eukarote pico-plankton and virus).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41

Page 63: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Correlation between kernels (STATIS)

Low correlations between the bacteria functional composition andother datasets.

Strong correlation between environmental variables and smallorganisms (bacteria, eukarote pico-plankton and virus).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41

Page 64: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Correlation between kernels (STATIS)

Low correlations between the bacteria functional composition andother datasets.

Strong correlation between environmental variables and smallorganisms (bacteria, eukarote pico-plankton and virus).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 27/41

Page 65: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Influence of k (nb of neighbors) on (βm)m

k ≥ 5 provides stable results

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 28/41

Page 66: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

(βm)m values returned by graph-MKL

The dataset the less correlated to the others: the bacteria functionalcomposition has the highest coefficient.

Three kernels have a weight equal to 0 (sparse version).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41

Page 67: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

(βm)m values returned by graph-MKL

The dataset the less correlated to the others: the bacteria functionalcomposition has the highest coefficient.

Three kernels have a weight equal to 0 (sparse version).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41

Page 68: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

(βm)m values returned by graph-MKL

The dataset the less correlated to the others: the bacteria functionalcomposition has the highest coefficient.

Three kernels have a weight equal to 0 (sparse version).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 29/41

Page 69: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Proof of concept: using [Sunagawa et al., 2015]

Datasets139 samples, 3 layers (SRF, DCM and MES)

kernels: phychem, pro-OTUs and pro-OGs

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 30/41

Page 70: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Proof of concept: using [Sunagawa et al., 2015]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 31/41

Page 71: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Proof of concept: using [Sunagawa et al., 2015]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 32/41

Page 72: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Proof of concept: using [Sunagawa et al., 2015]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 33/41

Page 73: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Proof of concept: using [Sunagawa et al., 2015]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 34/41

Page 74: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Proof of concept: using [Sunagawa et al., 2015]

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 35/41

Page 75: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Proof of concept: using [Sunagawa et al., 2015]

Proteobacteria (clade SAR11 (Alphaproteobacteria) and SAR86)dominate the sampled areas of the ocean in term of relativeabundance and taxonomic richness.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 36/41

Page 76: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

K-PCA on K ∗

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 37/41

Page 77: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

K-PCA on K ∗ - environmental dataset

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 38/41

Page 78: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

K-PCA on K ∗ - environmental dataset

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 39/41

Page 79: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Conclusion et perspectives

Summary

an integrative exploratory method

... particularly well suited for multi metagenomic datasets

with enhanced interpretability

Perspectives

implement SDP solution and test it

improve biological interpretation

soon-to-be-released R package

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 40/41

Page 80: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Questions?

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41

Page 81: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

ReferencesAronszajn, N. (1950).Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337–404.

Brum, J., Ignacio-Espinoza, J., Roux, S., Doulcier, G., Acinas, S., Alberti, A., Chaffron, S., Cruaud, C., de Vargas, C., Gasol, J.,Gorsky, G., Gregory, A., Guidi, L., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Poulos, B., Schwenck, S., Speich, S.,Dimier, C., Kandels-Lewis, S., Picheral, M., Searson, S., Tara Oceans coordinators, Bork, P., Bowler, C., Sunagawa, S., Wincker,P., Karsenti, E., and Sullivan, M. (2015).Patterns and ecological drivers of ocean viral communities.Science, 348(6237).

Chen, Y., Garcia, E., Gupta, M., Rahimi, A., and Cazzanti, L. (2009).Similarity-based classification: concepts and algorithm.Journal of Machine Learning Research, 10:747–776.

Cottrell, M. and Letrémy, P. (2005).How to use the Kohonen algorithm to simultaneously analyse individuals in a survey.Neurocomputing, 63:193–207.

Crone, L. and Crosby, D. (1995).Statistical applications of a metric on subspaces to satellite meteorology.Technometrics, 37(3):324–328.

de Vargas, C., Audic, S., Henry, N., Decelle, J., Mahé, P., Logares, R., Lara, E., Berney, C., Le Bescot, N., Probert, I.,Carmichael, M., Poulain, J., Romac, S., Colin, S., Aury, J., Bittner, L., Chaffron, S., Dunthorn, M., Engelen, S., Flegontova, O.,Guidi, L., Horák, A., Jaillon, O., Lima-Mendez, G., Lukeš, J., Malviya, S., Morard, R., Mulot, M., Scalco, E., Siano, R., Vincent, F.,Zingone, A., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Acinas, S., Bork, P., Bowler, C.,Gorsky, G., Grimsley, N., Hingamp, P., Iudicone, D., Not, F., Ogata, H., Pesant, S., Raes, J., Sieracki, M. E., Speich, S.,Stemmann, L., Sunagawa, S., Weissenbach, J., Wincker, P., and Karsenti, E. (2015).Eukaryotic plankton diversity in the sunlit ocean.Science, 348(6237).

Gönen, M. and Alpaydin, E. (2011).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41

Page 82: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

Multiple kernel learning algorithms.Journal of Machine Learning Research, 12:2211–2268.

Lavit, C., Escoufier, Y., Sabatier, R., and Traissac, P. (1994).The ACT (STATIS method).Computational Statistics and Data Analysis, 18(1):97–119.

Lee, J. and Verleysen, M. (2007).Nonlinear Dimensionality Reduction.Information Science and Statistics. Springer, New York; London.

L’Hermier des Plantes, H. (1976).Structuration des tableaux à trois indices de la statistique.PhD thesis, Université de Montpellier.Thèse de troisième cycle.

Lima-Mendez, G., Faust, K., Henry, N., Decelle, J., Colin, S., Carcillo, F., Chaffron, S., Ignacio-Espinosa, J., Roux, S., Vincent, F.,Bittner, L., Darzi, Y., Wang, B., Audic, S., Berline, L., Bontempi, G., Cabello, A., Coppola, L., Cornejo-Castillo, F., d’Oviedo, F.,de Meester, L., Ferrera, I., Garet-Delmas, M., Guidi, L., Lara, E., Pesant, S., Royo-Llonch, M., Salazar, F., Sánchez, P.,Sebastian, M., Souffreau, C., Dimier, C., Picheral, M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Gorsky, G.,Not, F., Ogata, H., Speich, S., Stemmann, L., Weissenbach, J., Wincker, P., Acinas, S., Sunagawa, S., Bork, P., Sullivan, M.,Karsenti, E., Bowler, C., de Vargas, C., and Raes, J. (2015).Determinants of community structure in the global plankton interactome.Science, 348(6237).

Lin, Y., Liu, T., and CS., F. (2010).Multiple kernel learning for dimensionality reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1147–1160.

Mariette, J., Olteanu, M., and Villa-Vialaneix, N. (2017).Efficient interpretable variants of online SOM for large dissimilarity data.Neurocomputing, 225:31–48.

Olteanu, M. and Villa-Vialaneix, N. (2015).

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41

Page 83: Integrating Tara Oceans datasets using unsupervised multiple kernel learning

On-line relational and multiple relational SOM.Neurocomputing, 147:15–30.

Robert, P. and Escoufier, Y. (1976).A unifying tool for linear multivariate statistical methods: the rv-coefficient.Applied Statistics, 25(3):257–265.

Schölkopf, B., Smola, A., and Müller, K. (1998).Nonlinear component analysis as a kernel eigenvalue problem.Neural Computation, 10(5):1299–1319.

Sommer, M., Church, G., and Dantas, G. (2010).A functional metagenomic approach for expanding the synthetic biology toolbox for biomass conversion.Molecular Systems Biology, 6(360).

Sunagawa, S., Coelho, L., Chaffron, S., Kultima, J., Labadie, K., Salazar, F., Djahanschiri, B., Zeller, G., Mende, D., Alberti, A.,Cornejo-Castillo, F., Costea, P., Cruaud, C., d’Oviedo, F., Engelen, S., Ferrera, I., Gasol, J., Guidi, L., Hildebrand, F., Kokoszka,F., Lepoivre, C., Lima-Mendez, G., Poulain, J., Poulos, B., Royo-Llonch, M., Sarmento, H., Vieira-Silva, S., Dimier, C., Picheral,M., Searson, S., Kandels-Lewis, S., Tara Oceans coordinators, Bowler, C., de Vargas, C., Gorsky, G., Grimsley, N., Hingamp, P.,Iudicone, D., Jaillon, O., Not, F., Ogata, H., Pesant, S., Speich, S., Stemmann, L., Sullivan, M., Weissenbach, J., Wincker, P.,Karsenti, E., Raes, J., Acinas, S., and Bork, P. (2015).Structure and function of the global ocean microbiome.Science, 348(6237).

Zhuang, J., Wang, J., Hoi, S., and Lan, X. (2011).Unsupervised multiple kernel clustering.Journal of Machine Learning Research: Workshop and Conference Proceedings, 20:129–144.

Nathalie Villa-Vialaneix | Unsupervised multiple kernel learning 41/41