Download pdf - Semi-Supervised Learning and Data Explorationwebia.lip6.fr/~phw/content/C003_enseignements/C104_SPLEX/... · 2019-12-06 · Semi-Supervised Learning and Data Exploration Nataliya

Semi-Supervised Learningand Data Exploration

Nataliya Sokolovska

Sorbonne UniversityParis, France

Master 2 BIMNovember, 29, 2019

Outline

Patients Stratification and Methods of Personalized Medicine

An application: Obesity stratification based on metagenomicsWeight Loss Prediction

Some (Fancy) Clustering MethodsSome widely used Clustering MethodsSpectral ClusteringBiclusteringRobust Clustering

Semi-Supervised Learning

Canonical Correlation: Correlation between Sets of Variables


An application: Obesity stratification based on metagenomics

Some (Fancy) Clustering Methods



What is Metagenomics?

I MetagenomeI can be defined as the ensemble of the microbes from a given

ecological niche

I MetagenomicsI allows to characterize composition, properties, and dynamics of

a microbiome by studying the metagenome

Obesity stratification based on metagenomics

MicroObese Study

MetaHIT Study

MicroObese Study

Obesity stratification based on metagenomics

I Gut microbial gene richness can influence the outcome of adietary intervention

I A quantitative metagenomic analysis stratified patients intotwo groups: a group with low gene count (LGC) and a highgene count (HGC) group

I The LGC individuals appeared to have increased bloodtriglycerides, higher insulin-resistance and low-gradeinflammation, and therefore the gene richness is stronglyassociated with obesity-driven diseases.

I The individuals from a low gene count group seemed to havean increased risk to develop obesity-related cardiometabolicrisk compared to the patients from the high gene count group.

Stratification of Dutch individuals

I E. Le Chatelier et al., 2011 conducted a similar study withDutch individuals, and made a similar conclusion: there is ahope that a diet can be used to induce a permanent change ofgut microbiota, and that treatment should bephenotype-specific.

I A particular diet is able to increase the gene richness: anincrease of genes was observed with the LGC patients after a6-weeks energy-restricted diet

PCA: example on real data

Weight Loss Prediction

Weight Loss Prediction






The clustering problem

I Motivation: find patterns in a sea of data

I InputI A large number of data pointsI A measure of distance between any two points

I OutputI Grouping (clustering) of the elements int K similarity clusters

I Clustering is useful forI Similarity/dissimilarity analysisI Dimensionality reduction






Some widely used Clustering Methods

I Hierarchical clustering

I K-means

I Probabilistic methods of clustering (Mixtures of Gaussians,EM)






Spectral Clustering

I U. von Luxburg, “A tutorial on spectral clustering”, Stat.Comp., 2007

I One of the most popular clustering algorithms

I It can be proved that under very mild conditions, spectralclustering algorithms are statistically consistent. This meansthat is we assume that the data has been sampled randomlyaccording to some probability distribution from someunderlying space, and if we let the sample size increase toinfinity, then the results of clustering converge (these resultsdo not necessary hold of unnormalized spectral clustering).

Graph notation and similarity graphs

If we do not have more information than similarities between datapoints, a nice way of representing the data is in form of similaritygraph. The vertices represent the data points. Two vertices areconnected if the similarity between the corresponding data pointsis positive (or larger than a certain threshold), and the edge isweighted by the similarity.

Graphs and Cluster Assumption

The problem of clustering: we want to find a partition of the graphsuch that the edges between different groups have a very lowweight.

“Cluster assumption”: two points are likely to have the same classlabel if there is a path connecting them passing through regions ofhigh density only. Or, the decision boundary should lie in regions oflow density.

Graph notations

I G = (V ,E ) is an undirected graph

I the graph is weighted: each edge between two vertices vi andvj has a weight wij > 0

I The weighted adjacency matrix W (wij = 0 mean that thevertices are not connected)

I Graph is undirected, wij = wji

I The degree of a vertex vi is defined as di =∑n

j=1 wij

I The degree matrix D

Graph notations Cont’d

I A subset of vertices AI Two ways of measuring the size of A

I |A| – the number of vertices in AI vol(A) =

∑i∈A dij – measure the size of A by the weights of

its edges

I a subset A is connected is any two vertices in A cab be joinedby a path such that all intermediate points also lie in A.

Different similarity graphs (used in Spectral Clustering)

There are several popular constructions to transform a given set ofdata points into a graph. Most of them lead to a sparserepresentation ⇒ computational advantages.

I The ε-neighborhood graph. We connect all points whosepairwise distances are smaller than ε. Usually considered as anunweighted graph.

I k-nearest neighbor graphs. We connect vertex vi with vertexvj if vj is among the k nearest neighbors of vi .

I The fully connected graph. We connect all points withpositive similarity with each other, and we weight the edges bysij . The graph should model the local neighborhoodrelationships. An example of similarity function is the

Gaussian similarity function s(xi , xj) = exp(−‖xi−xj‖2

2σ2 ). Theparameters σ controls the width of the neighborhoods.

Graph Laplacians

I The main tool for spectral clustering are graph Laplacianmatrices

I In the literature, there is no unique convention which matrixexactly is called “graph Laplacian”

I The unnormalized graph Laplacian matrix is defined as

L = D −W

.

I The normalized Laplacian

L = D−1/2(D −W )D−1/2

Properties of L

I For every vector f ∈ Rn we have

f ′Lf =1

2

n∑

i ,j=1

wij(fi − fj)2

I L is symmetric and positive semi-definite

I The smallest eigenvalue of L is 0, the correspondingeigenvector is the constant one vector 1.

I L has n non-negative, real-valued eigenvalues0 = λ1 ≤ λ2 ≤ · · · ≤ λn

Unnormalized Spectral Clustering

I Input: Similarity matrix S ∈ Rn×n, number k of clusters toconstructI Construct a similarity graph; W is its weighted adjacency

matrixI Compute the unnormalized Laplacian LI Compute the first k eigenvectors v1, . . . , vk of L.I Let V ∈ Rn×k be the matrix containing the vectors v1, . . . , vk

as columnsI For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to

the i-th row of VI Cluster the points (yi )i=1,...,n ∈ Rk with the k-means

algorithm into clusters C1, . . . ,Ck

I Output: Clusters A1, . . . ,Ak .

Normalized Spectral Clustering (Shi and Malik, 2000)


matrixI Compute the unnormalized Laplacian LI Compute the first k eigenvectors v1, . . . , vk of the generalized

eigenproblem Lv = λDv .I Let V ∈ Rn×k be the matrix containing the vectors v1, . . . , vk

as columnsI For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to

the i-th row of VI Cluster the points (yi )i=1,...,n ∈ Rk with the k-means

algorithm into clusters C1, . . . ,Ck


Normalized spectral clustering (Ng, Jordan, andWeiss, 2002)


matrixI Compute the normalized Laplacian LsymI Compute the first k eigenvectors v1, . . . , vk of Lsym.I From the matrix U ∈ Rn×k from V by normalizing the row

sums to have norm 1, that uij = vij/(∑

k v2ik)1/2

I For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding tothe i-th row of V

I Cluster the points (yi )i=1,...,n ∈ Rk with the k-meansalgorithm into clusters C1, . . . ,Ck







Biclustering

I Simultaneous clustering of both rows and columns of a datamatrix

I Identifies groups of genes with similar/coherent expressionpatterns under a specific subset of conditions

Why biclustering and not just clustering?

Biclustering is the key technique to use when

I Only a small number of the genes participates in a cellularprocess of interest

I An interesting cellular process is active only in a subset of theconditions

I A single gene may participate in multiple pathways that mayor not be co-active under all conditions

Biclustering: motivation

Gan et al., Discovering biclusters in gene expression data based onhigh-dimensional linear geometries, BMC Bioinformatics 2008

An illustrative example where conventional clustering fails butbiclustering works: (a) A data matrix, which appears randomvisually even after hierarchical clustering. (b) A hidden patternembedded in the data would be uncovered if we permute the rowsor columns appropriately.

The Cheng-Church Algorithm (2000)

The algorithm of Cheng and Church is

I a simple, greedy approach towards finding maximal sizedbiclusters satisfying a certain condition

I The input is a matrix A = (aij)

I The rows represent genes

I The columns represent conditions

I The algorithm attempts to find a submatrix B, representing abicluster.

I The quality of B as a bicluster is measures using the Residuescore.

The Cheng-Church Algorithm Cont’d

The residue score of an element

I aij = aij − ai ,J − aI ,j + aI ,J .

The mean squared residue score for the sub-matrix AI ,J is then

H(I , J) =∑

i∈I ,j∈J(aij − ai ,J − aI ,j + aI ,J)2/(|I ||J|)

.

I I and J are row and column subsets representing a sub-matrix

I aI ,j =∑

i∈I (aij)/|I | (sub-matrix column j average)

I ai ,J =∑

j∈J(aij)/|J| (sub-matrix column i average)

I aI ,J =∑

i∈I ,j∈J(aij)/(|I ||J|) (the entire sub-matrix average)


δ-biclusters

I The most natural goal would be to find a bicuster minimizingthe mean squared residue score

I It is easy to see that the mean squared residue score is 0 iffthe submatrix satisfies the assumption

I It would yield trivial (one gene and one condition) biclusters,and would in general prefer small biclusters

I Therefore, we define AI ,J as a δ-bicluster if H(I , J) ≤ δ, andtry to find larger biclusters


Finding δ-biclusters

I Given A and δ, finding the largest δ-bicluster is NP-hard.

The Cheng-Church algorithm

I Employs a greedy heuristic for detecting a large biclusterI It starts with a sub-matrix identical to the input matrix, and

then proceeds with two phasesI Iterative removal of rows/columns until H(I , J) < δI Iterative addition of rows/columns until no addition is possible

without H exceeding δ

I The remaining sub-matrix will be declared a bicluster

I If the remaining sub-matrix is empty, then no δ-bicluster isfound

I The removal of a row/column is done by choosing (in everyiteration), the row/column which has the maximumcontribution to the score H (in effect, the “worst” one)


Finding more than one bicluster

I Note that the algorithm is completely deterministic:consecutive runs of the algorithm on the same matrix willyield the same bicluster

I To find other biclusters, the complete algorithm repeats theprocess after masking the bicluster found

I Masking is performed by filling the positions of the bisclusterwith random values

I The new random values will probably not form anyrecognizable pattern


Shortcomings of the Cheng-Church algorithm

I The results are not assigned a statistical-significance value

I Since δ is constant, then given a large enough initial matrix,we are almost guaranteed to find a random bicluster, ofarbitrary size satisfying the condition

I The greedy nature of this algorithm clearly does not guaranteethe convergence to global optimal solutions

I The masking technique would seriously reduce the change tofind biclusters with any overlap (these overlaps may be anatural result of a gene having more than one function)






Partitioning around Medoids

PAM (Partitioning around Medoids) is a k-partitioning approach(Kaufman and Rousseuw, 1990).

I The algorithm finds the representative object, medoid, whichis the multidimensional version of the median

I Tries to minimize the total cost

∑

r

d(x̂ , xr )

I PAM finds a local minimum for the objective function

PAM Cont’d

The PAM algorithm

1. Initialize: randomly select k of the n data points as themedoids2. Associate each data point to the closest medoid. (”closest”here is defined using any valid distance metric, most commonlyEuclidean distance, Manhattan distance or Minkowski distance)for For each medoid m do

for For each non-medoid data point o do3. Swap m and o and compute the total cost of theconfiguration

end forend for4. Select the configuration with the lowest cost.Repeat steps 2 to 4 until there is no change in the medoid.

A hierarchy of clinical parameters of MicrObese data







I Traditionally: Unsupervised and supervised learning

I Semi-supervised learning: halfway between supervised andunsupervised learning

I Semi-supervised learning with constraints: “these points have(or do not have) the same target”

I A problem related to SSL was introduced by V.Vapnik:transductive learning: do prediction for the test points only

A Brief History of Semi-Supervised Learning

I Self-learning: the earliest idea of SSLI Use repeatedly a supervised method.I It starts by training on the labeled data only; then label

unlabeled data, etc.

I Transductive inferenceI Vapnik’s principle: When trying to solve some problem, one

should not solve a more difficult problem as an intermediatestep

I No general decision rule is inferredI E.g. a combinatorial optimization on the labels of the test

points in order to maximize the likelihood of their model

I Mixture of GaussiansI The likelihood of the model is maximized using the labeled and

unlabeled data with the help of iterative algorithm such asExpectation-Maximization

I Instead of mixture of Gaussians, use a mixture of multinomialdistributions

A Brief History of Semi-Supervised Learning Cont’d

I Theoretical analysisI Learning rates exist for SSL of a mixture of two Gaussians:

probability of error has an exponential convergence to theBayes risk

I Text applications and natural language processing

When Can Semi-Supervised Learning Work?

In comparison with a supervised algorithm, can one hope to have amore accurate prediction by taking into account the unlabeleddata?

I Prerequisite: the distribution of examples is relevant to theclassification problem

I In a more mathematical formulation: the knowledge of p(x)has to carry information that is useful in the inference ofp(y |x)

When Can Semi-Supervised Learning Work?

The four assumptions:

I Smoothness assumption: If two points x1 and x2 are close,then so should be the corresponding y1 and y2

I Cluster assumption: If points are in the same cluster, they arelikely to be of the same class

I Low density separation: The decision boundary should lie in alow-density region

I The (high-dimensional) data lie (roughly) on alow-dimensional manifold

Classes of Semi-Supervised Learning Algorithm

I Generative modelsI A generative model models p(y , x), and any additional

information on p(x) is usefulI It can be seen as classification with additional information on

the marginal densityI It can be seen as clustering with additional informationI Advantage: Knowledge of the structure can be incorporated

I Low-density separation: an SVMI The most common approach – a maximum margin algorithm

such as SVMI The method of maximizing the margin for unlabeled as well as

labeled points is called the transduction SVMI The corresponding problem is non-convex, and thus difficult to

optimize

I Low-density separation: entropy minimizationI Encourage the class-conditional probability p(y |x) to be close

to 1 or to 0 at labeled and unlabeled points


I Generative modelsI A generative model models p(y , x), and any additional

information on p(x) is usefulI It can be seen as classification with additional information on

the marginal densityI It can be seen as clustering with additional informationI Advantage: Knowledge of the structure can be incorporated

I Low-density separation: an SVMI The most common approach – a maximum margin algorithm

such as SVMI The method of maximizing the margin for unlabeled as well as

labeled points is called the transduction SVMI The corresponding problem is non-convex, and thus difficult to

optimize

I Low-density separation: entropy minimizationI Encourage the class-conditional probability p(y |x) to be close

to 1 or to 0 at labeled and unlabeled points


I Graph-based methodsI Data are represented by the nodes of a graph, the edges of

which are labeled with the pairwise distances of the incidentnodes

I Most graph methods use the graph LaplacianI Many graph methods penalize nonsmoothness along the edgesI Intrinsically transductive and inductive algorithmsI Information propagation on the graph

I Change of Representation: two-step learningI Change representation: perform an unsupervised step on all

data, and construct a new metricI Ignore the unlabeled data and perform supervised learning

using the new data


I Graph-based methodsI Data are represented by the nodes of a graph, the edges of

which are labeled with the pairwise distances of the incidentnodes

I Most graph methods use the graph LaplacianI Many graph methods penalize nonsmoothness along the edgesI Intrinsically transductive and inductive algorithmsI Information propagation on the graph

I Change of Representation: two-step learningI Change representation: perform an unsupervised step on all

data, and construct a new metricI Ignore the unlabeled data and perform supervised learning

using the new data

Hypothesis and Notations

Notations:

I Xi observation

I Yi label

I n the number of observation pairs

I π(x , y) the joint probability

I η(y |x) the conditional probability

I q(x) the marginal probability of observations

The hypothesis:

I The marginal probability q(x) is completely known

I X and Y are finite

Semi-Supervised Probabilistic Criterion

{Xi ,Yi}ni=1 are observations and their labels

Let g(y |x ; θ) be the conditional probability function, parameterizedby θ. Then the standard conditional maximum likelihood estimatoris defined by

θ̂n = arg minθ∈Θ

1

n

n∑

i=1

`(Yi |Xi ; θ),

where `(y |x ; θ) = − log g(y |x ; θ) denotes the negated conditionallog-likelihood function.

The asymptotically optimal semi-supervised estimator θ̂sn isdefined by

θ̂sn = arg minθ∈Θ

n∑

i=1

q(Xi )∑nj=1 1{Xj = Xi}

`(Yi |Xi ; θ),

where q(x) is the marginal probability of observations.

Problem of the Covariate Shift

Covariate ShiftLet us learn an estimator from (X1,Y1), . . . , (Xn,Yn), where thedistribution of Xi is defined by q0(x). How to adapt the estimatorif the test data Xi are distributed according to q1(x) 6= q0(x)?

I Si q1 is known, the weights of the semi-supervisedestimateur(q = q1) are asymototically identical to 1

nq1q0

(Xi )

and the algorithm converges to

θ1? = arg minθ∈Θ

Eπ1 [`(Y |X ; θ)] .

I The covariance matrix is smaller than the matrix of theestimator weighted by an importance ratio

θ̂n = arg minθ∈Θ

n∑

i=1

q1

q0(Xi )`(Yi |Xi ; θ)

(which is supposed to know q0).

Experiments with logistic regression

10 50 100 500 1000 50000

2

4

6

8

10

12

loga

rithm

ic r

isk

(x n

)

n10 50 100 500 1000 5000

0

2

4

6

8

10

12

loga

rithm

ic r

isk

(x n

)

n

Boxplots of the scaled excess risk as a function of the number ofobservations in the presence of the covariate shift.Left: Shimodaira criterion, n (Eπ[`(Y |X ; θ̂n)]− Eπ[`(Y |X ; θ?)]);

Right: semi-supervised estimator, n (Eπ[`(Y |X ; θ̂sn)]− Eπ[`(Y |X ; θ?)]).

Applications to real problems

In the realistic applications (binary text classification), we can notassume that the true q(x) is known.

We propose an approach based on clustering.

How to “estimate q(x)”? The set of unlabeled data is divided intok clusters, and in the expression of the weight

q(Xi )∑nj=1 1{Xj = Xi}

the numerator is replaced by the empirical frequency of the clusterwhich contains Xi ; the denominator is replaced by the number oftraining points which are in the same cluster as Xi .






Canonical Correlation Analysis: Motivation

I Canonical correlations analysis (CCA) is an exploratorystatistical method to highlight correlations between two datasets acquired on the same experimental units

I CCA is most appropriate when a researcher desires to examinethe relationship between two variable set

I The method was first introduced by Harold Hotelling in 1936

Canonical Correlation Analysis: How?

I X and Y are matrices of order n × p and n × q

I The columns correspond to variables and the rows correspondto experimental units (patients)


I Find two vectors a and b that maximize the correlationbetween the linear combinations

U = a1X1 + a2X

2 + · · ·+ apXp

V = b1Y1 + b2Y

2 + · · ·+ bqYq

I The problem consists in solving

ρ = cor(U,V ) = maxa,b

cor(Xa,Yb)

Canonical correlations ρ are the positive square roots of theeigenvalues λ of PXPY (ρ =

√λ), where

PX = X (XTX )−1XT

PY = Y (Y TY )−1Y T


I Find two vectors a and b that maximize the correlationbetween the linear combinations

U = a1X1 + a2X

2 + · · ·+ apXp

V = b1Y1 + b2Y

2 + · · ·+ bqYq

I The problem consists in solving

ρ = cor(U,V ) = maxa,b

cor(Xa,Yb)

Canonical correlations ρ are the positive square roots of theeigenvalues λ of PXPY (ρ =

√λ), where

PX = X (XTX )−1XT

PY = Y (Y TY )−1Y T

How to Interpret the Results?

I Consider canonical correlation valuesI The canonical correlation coefficient is the Pearson relationship

between the two synthetic variables on a given canonicalfunction. Because of the scaling created by the standardizedweights in the linear equations, this value cannot be negativeand only ranges from 0 to 1.

I Consider coefficientsI Visualization of the results of canonical correlation is usually

through bar plots of the coefficients of the two sets of variablesfor the pairs of canonical variates showing significantcorrelation.

Example: 12 sets of features

1. PA, psychological, and three factor eating questionnaires

2. Body composition

3. Metabolic rate and blood pressure

4. Blood lipids

5. Glucose homeostasis and insulin sensibility

6. Adiponekines

7. Kidney function

8. Fecal microbiota abundance, qPCR

9. Systemic inflammation and chemokines

10. Adipose tissue macrophage markers

11. Nutrient intake

12. Food intake

Canonical Correlation Values

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

Psy

cho

Bod

ycom

p

Met

ab_B

lood

Pr

Blo

od_l

ip

Glu

cose

_Ins

Adi

pone

k

Kid

ney

Fec

al_M

icro

Che

mok

in

Adi

pose

Nut

rien

Foo

d

Psycho

Bodycomp

Metab_BloodPr

Blood_lip

Glucose_Ins

Adiponek

Kidney

Fecal_Micro

Chemokin

Adipose

Nutrien

Food

How to interpret the results?

I Structure coefficients are critical for deciding what variablesare useful for the model

I Bar plots of the coefficients of the two sets of variables for thepairs of canonical variates showing significant correlation.

I Coefficients increase in importance when the observedvariables in the model increase in their correlation with eachother

Canonical Correlation Example: Blood lipids/Glucosehomeostasis and insulin sensibility

X Variables Y Variables

Chol_m

eta

TG_meta

HDL_meta

LDL_meta

TC_HDL

NHDL

AG

L_m

eta

Gly

_met

aIn

s_m

eta

HOMA_B

_met

a

HOMA_S_meta

HOMA_IR_meta

Disse_meta

QUICKI_metarevised_QUICKI_meta

IGR_meta

Mcauley_m

eta

FIR

I_clin

Helio Plot

Canonical Variate1