Semi-Supervised Learningand Data Exploration
Nataliya Sokolovska
Sorbonne UniversityParis, France
Master 2 BIMNovember, 29, 2019
Outline
Patients Stratification and Methods of Personalized Medicine
An application: Obesity stratification based on metagenomicsWeight Loss Prediction
Some (Fancy) Clustering MethodsSome widely used Clustering MethodsSpectral ClusteringBiclusteringRobust Clustering
Semi-Supervised Learning
Canonical Correlation: Correlation between Sets of Variables
Patients Stratification and Methods of Personalized Medicine
An application: Obesity stratification based on metagenomics
Some (Fancy) Clustering Methods
Semi-Supervised Learning
Canonical Correlation: Correlation between Sets of Variables
What is Metagenomics?
I MetagenomeI can be defined as the ensemble of the microbes from a given
ecological niche
I MetagenomicsI allows to characterize composition, properties, and dynamics of
a microbiome by studying the metagenome
Obesity stratification based on metagenomics
MicroObese Study
MetaHIT Study
MicroObese Study
Obesity stratification based on metagenomics
I Gut microbial gene richness can influence the outcome of adietary intervention
I A quantitative metagenomic analysis stratified patients intotwo groups: a group with low gene count (LGC) and a highgene count (HGC) group
I The LGC individuals appeared to have increased bloodtriglycerides, higher insulin-resistance and low-gradeinflammation, and therefore the gene richness is stronglyassociated with obesity-driven diseases.
I The individuals from a low gene count group seemed to havean increased risk to develop obesity-related cardiometabolicrisk compared to the patients from the high gene count group.
Stratification of Dutch individuals
I E. Le Chatelier et al., 2011 conducted a similar study withDutch individuals, and made a similar conclusion: there is ahope that a diet can be used to induce a permanent change ofgut microbiota, and that treatment should bephenotype-specific.
I A particular diet is able to increase the gene richness: anincrease of genes was observed with the LGC patients after a6-weeks energy-restricted diet
PCA: example on real data
Weight Loss Prediction
Weight Loss Prediction
Patients Stratification and Methods of Personalized Medicine
An application: Obesity stratification based on metagenomics
Some (Fancy) Clustering MethodsSome widely used Clustering MethodsSpectral ClusteringBiclusteringRobust Clustering
Semi-Supervised Learning
Canonical Correlation: Correlation between Sets of Variables
The clustering problem
I Motivation: find patterns in a sea of data
I InputI A large number of data pointsI A measure of distance between any two points
I OutputI Grouping (clustering) of the elements int K similarity clusters
I Clustering is useful forI Similarity/dissimilarity analysisI Dimensionality reduction
Patients Stratification and Methods of Personalized Medicine
An application: Obesity stratification based on metagenomics
Some (Fancy) Clustering MethodsSome widely used Clustering MethodsSpectral ClusteringBiclusteringRobust Clustering
Semi-Supervised Learning
Canonical Correlation: Correlation between Sets of Variables
Some widely used Clustering Methods
I Hierarchical clustering
I K-means
I Probabilistic methods of clustering (Mixtures of Gaussians,EM)
Patients Stratification and Methods of Personalized Medicine
An application: Obesity stratification based on metagenomics
Some (Fancy) Clustering MethodsSome widely used Clustering MethodsSpectral ClusteringBiclusteringRobust Clustering
Semi-Supervised Learning
Canonical Correlation: Correlation between Sets of Variables
Spectral Clustering
I U. von Luxburg, “A tutorial on spectral clustering”, Stat.Comp., 2007
I One of the most popular clustering algorithms
I It can be proved that under very mild conditions, spectralclustering algorithms are statistically consistent. This meansthat is we assume that the data has been sampled randomlyaccording to some probability distribution from someunderlying space, and if we let the sample size increase toinfinity, then the results of clustering converge (these resultsdo not necessary hold of unnormalized spectral clustering).
Graph notation and similarity graphs
If we do not have more information than similarities between datapoints, a nice way of representing the data is in form of similaritygraph. The vertices represent the data points. Two vertices areconnected if the similarity between the corresponding data pointsis positive (or larger than a certain threshold), and the edge isweighted by the similarity.
Graphs and Cluster Assumption
The problem of clustering: we want to find a partition of the graphsuch that the edges between different groups have a very lowweight.
“Cluster assumption”: two points are likely to have the same classlabel if there is a path connecting them passing through regions ofhigh density only. Or, the decision boundary should lie in regions oflow density.
Graph notations
I G = (V ,E ) is an undirected graph
I the graph is weighted: each edge between two vertices vi andvj has a weight wij > 0
I The weighted adjacency matrix W (wij = 0 mean that thevertices are not connected)
I Graph is undirected, wij = wji
I The degree of a vertex vi is defined as di =∑n
j=1 wij
I The degree matrix D
Graph notations Cont’d
I A subset of vertices AI Two ways of measuring the size of A
I |A| – the number of vertices in AI vol(A) =
∑i∈A dij – measure the size of A by the weights of
its edges
I a subset A is connected is any two vertices in A cab be joinedby a path such that all intermediate points also lie in A.
Different similarity graphs (used in Spectral Clustering)
There are several popular constructions to transform a given set ofdata points into a graph. Most of them lead to a sparserepresentation ⇒ computational advantages.
I The ε-neighborhood graph. We connect all points whosepairwise distances are smaller than ε. Usually considered as anunweighted graph.
I k-nearest neighbor graphs. We connect vertex vi with vertexvj if vj is among the k nearest neighbors of vi .
I The fully connected graph. We connect all points withpositive similarity with each other, and we weight the edges bysij . The graph should model the local neighborhoodrelationships. An example of similarity function is the
Gaussian similarity function s(xi , xj) = exp(−‖xi−xj‖2
2σ2 ). Theparameters σ controls the width of the neighborhoods.
Graph Laplacians
I The main tool for spectral clustering are graph Laplacianmatrices
I In the literature, there is no unique convention which matrixexactly is called “graph Laplacian”
I The unnormalized graph Laplacian matrix is defined as
L = D −W
.
I The normalized Laplacian
L = D−1/2(D −W )D−1/2
Properties of L
I For every vector f ∈ Rn we have
f ′Lf =1
2
n∑
i ,j=1
wij(fi − fj)2
I L is symmetric and positive semi-definite
I The smallest eigenvalue of L is 0, the correspondingeigenvector is the constant one vector 1.
I L has n non-negative, real-valued eigenvalues0 = λ1 ≤ λ2 ≤ · · · ≤ λn
Unnormalized Spectral Clustering
I Input: Similarity matrix S ∈ Rn×n, number k of clusters toconstructI Construct a similarity graph; W is its weighted adjacency
matrixI Compute the unnormalized Laplacian LI Compute the first k eigenvectors v1, . . . , vk of L.I Let V ∈ Rn×k be the matrix containing the vectors v1, . . . , vk
as columnsI For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to
the i-th row of VI Cluster the points (yi )i=1,...,n ∈ Rk with the k-means
algorithm into clusters C1, . . . ,Ck
I Output: Clusters A1, . . . ,Ak .
Normalized Spectral Clustering (Shi and Malik, 2000)
I Input: Similarity matrix S ∈ Rn×n, number k of clusters toconstructI Construct a similarity graph; W is its weighted adjacency
matrixI Compute the unnormalized Laplacian LI Compute the first k eigenvectors v1, . . . , vk of the generalized
eigenproblem Lv = λDv .I Let V ∈ Rn×k be the matrix containing the vectors v1, . . . , vk
as columnsI For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to
the i-th row of VI Cluster the points (yi )i=1,...,n ∈ Rk with the k-means
algorithm into clusters C1, . . . ,Ck
I Output: Clusters A1, . . . ,Ak .
Normalized spectral clustering (Ng, Jordan, andWeiss, 2002)
I Input: Similarity matrix S ∈ Rn×n, number k of clusters toconstructI Construct a similarity graph; W is its weighted adjacency
matrixI Compute the normalized Laplacian LsymI Compute the first k eigenvectors v1, . . . , vk of Lsym.I From the matrix U ∈ Rn×k from V by normalizing the row
sums to have norm 1, that uij = vij/(∑
k v2ik)1/2
I For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding tothe i-th row of V
I Cluster the points (yi )i=1,...,n ∈ Rk with the k-meansalgorithm into clusters C1, . . . ,Ck
I Output: Clusters A1, . . . ,Ak .
Patients Stratification and Methods of Personalized Medicine
An application: Obesity stratification based on metagenomics
Some (Fancy) Clustering MethodsSome widely used Clustering MethodsSpectral ClusteringBiclusteringRobust Clustering
Semi-Supervised Learning
Canonical Correlation: Correlation between Sets of Variables
Biclustering
I Simultaneous clustering of both rows and columns of a datamatrix
I Identifies groups of genes with similar/coherent expressionpatterns under a specific subset of conditions
Why biclustering and not just clustering?
Biclustering is the key technique to use when
I Only a small number of the genes participates in a cellularprocess of interest
I An interesting cellular process is active only in a subset of theconditions
I A single gene may participate in multiple pathways that mayor not be co-active under all conditions
Biclustering: motivation
Gan et al., Discovering biclusters in gene expression data based onhigh-dimensional linear geometries, BMC Bioinformatics 2008
An illustrative example where conventional clustering fails butbiclustering works: (a) A data matrix, which appears randomvisually even after hierarchical clustering. (b) A hidden patternembedded in the data would be uncovered if we permute the rowsor columns appropriately.
The Cheng-Church Algorithm (2000)
The algorithm of Cheng and Church is
I a simple, greedy approach towards finding maximal sizedbiclusters satisfying a certain condition
I The input is a matrix A = (aij)
I The rows represent genes
I The columns represent conditions
I The algorithm attempts to find a submatrix B, representing abicluster.
I The quality of B as a bicluster is measures using the Residuescore.
The Cheng-Church Algorithm Cont’d
The residue score of an element
I aij = aij − ai ,J − aI ,j + aI ,J .
The mean squared residue score for the sub-matrix AI ,J is then
H(I , J) =∑
i∈I ,j∈J(aij − ai ,J − aI ,j + aI ,J)2/(|I ||J|)
.
I I and J are row and column subsets representing a sub-matrix
I aI ,j =∑
i∈I (aij)/|I | (sub-matrix column j average)
I ai ,J =∑
j∈J(aij)/|J| (sub-matrix column i average)
I aI ,J =∑
i∈I ,j∈J(aij)/(|I ||J|) (the entire sub-matrix average)
The Cheng-Church Algorithm Cont’d
δ-biclusters
I The most natural goal would be to find a bicuster minimizingthe mean squared residue score
I It is easy to see that the mean squared residue score is 0 iffthe submatrix satisfies the assumption
I It would yield trivial (one gene and one condition) biclusters,and would in general prefer small biclusters
I Therefore, we define AI ,J as a δ-bicluster if H(I , J) ≤ δ, andtry to find larger biclusters
The Cheng-Church Algorithm Cont’d
Finding δ-biclusters
I Given A and δ, finding the largest δ-bicluster is NP-hard.
The Cheng-Church algorithm
I Employs a greedy heuristic for detecting a large biclusterI It starts with a sub-matrix identical to the input matrix, and
then proceeds with two phasesI Iterative removal of rows/columns until H(I , J) < δI Iterative addition of rows/columns until no addition is possible
without H exceeding δ
I The remaining sub-matrix will be declared a bicluster
I If the remaining sub-matrix is empty, then no δ-bicluster isfound
I The removal of a row/column is done by choosing (in everyiteration), the row/column which has the maximumcontribution to the score H (in effect, the “worst” one)
The Cheng-Church Algorithm Cont’d
Finding more than one bicluster
I Note that the algorithm is completely deterministic:consecutive runs of the algorithm on the same matrix willyield the same bicluster
I To find other biclusters, the complete algorithm repeats theprocess after masking the bicluster found
I Masking is performed by filling the positions of the bisclusterwith random values
I The new random values will probably not form anyrecognizable pattern
The Cheng-Church Algorithm Cont’d
Shortcomings of the Cheng-Church algorithm
I The results are not assigned a statistical-significance value
I Since δ is constant, then given a large enough initial matrix,we are almost guaranteed to find a random bicluster, ofarbitrary size satisfying the condition
I The greedy nature of this algorithm clearly does not guaranteethe convergence to global optimal solutions
I The masking technique would seriously reduce the change tofind biclusters with any overlap (these overlaps may be anatural result of a gene having more than one function)
Patients Stratification and Methods of Personalized Medicine
An application: Obesity stratification based on metagenomics
Some (Fancy) Clustering MethodsSome widely used Clustering MethodsSpectral ClusteringBiclusteringRobust Clustering
Semi-Supervised Learning
Canonical Correlation: Correlation between Sets of Variables
Partitioning around Medoids
PAM (Partitioning around Medoids) is a k-partitioning approach(Kaufman and Rousseuw, 1990).
I The algorithm finds the representative object, medoid, whichis the multidimensional version of the median
I Tries to minimize the total cost
∑
r
d(x̂ , xr )
I PAM finds a local minimum for the objective function
PAM Cont’d
The PAM algorithm
1. Initialize: randomly select k of the n data points as themedoids2. Associate each data point to the closest medoid. (”closest”here is defined using any valid distance metric, most commonlyEuclidean distance, Manhattan distance or Minkowski distance)for For each medoid m do
for For each non-medoid data point o do3. Swap m and o and compute the total cost of theconfiguration
end forend for4. Select the configuration with the lowest cost.Repeat steps 2 to 4 until there is no change in the medoid.
A hierarchy of clinical parameters of MicrObese data
Patients Stratification and Methods of Personalized Medicine
An application: Obesity stratification based on metagenomics
Some (Fancy) Clustering Methods
Semi-Supervised Learning
Canonical Correlation: Correlation between Sets of Variables
Semi-Supervised Learning
I Traditionally: Unsupervised and supervised learning
I Semi-supervised learning: halfway between supervised andunsupervised learning
I Semi-supervised learning with constraints: “these points have(or do not have) the same target”
I A problem related to SSL was introduced by V.Vapnik:transductive learning: do prediction for the test points only
A Brief History of Semi-Supervised Learning
I Self-learning: the earliest idea of SSLI Use repeatedly a supervised method.I It starts by training on the labeled data only; then label
unlabeled data, etc.
I Transductive inferenceI Vapnik’s principle: When trying to solve some problem, one
should not solve a more difficult problem as an intermediatestep
I No general decision rule is inferredI E.g. a combinatorial optimization on the labels of the test
points in order to maximize the likelihood of their model
I Mixture of GaussiansI The likelihood of the model is maximized using the labeled and
unlabeled data with the help of iterative algorithm such asExpectation-Maximization
I Instead of mixture of Gaussians, use a mixture of multinomialdistributions
A Brief History of Semi-Supervised Learning Cont’d
I Theoretical analysisI Learning rates exist for SSL of a mixture of two Gaussians:
probability of error has an exponential convergence to theBayes risk
I Text applications and natural language processing
When Can Semi-Supervised Learning Work?
In comparison with a supervised algorithm, can one hope to have amore accurate prediction by taking into account the unlabeleddata?
I Prerequisite: the distribution of examples is relevant to theclassification problem
I In a more mathematical formulation: the knowledge of p(x)has to carry information that is useful in the inference ofp(y |x)
When Can Semi-Supervised Learning Work?
The four assumptions:
I Smoothness assumption: If two points x1 and x2 are close,then so should be the corresponding y1 and y2
I Cluster assumption: If points are in the same cluster, they arelikely to be of the same class
I Low density separation: The decision boundary should lie in alow-density region
I The (high-dimensional) data lie (roughly) on alow-dimensional manifold
Classes of Semi-Supervised Learning Algorithm
I Generative modelsI A generative model models p(y , x), and any additional
information on p(x) is usefulI It can be seen as classification with additional information on
the marginal densityI It can be seen as clustering with additional informationI Advantage: Knowledge of the structure can be incorporated
I Low-density separation: an SVMI The most common approach – a maximum margin algorithm
such as SVMI The method of maximizing the margin for unlabeled as well as
labeled points is called the transduction SVMI The corresponding problem is non-convex, and thus difficult to
optimize
I Low-density separation: entropy minimizationI Encourage the class-conditional probability p(y |x) to be close
to 1 or to 0 at labeled and unlabeled points
Classes of Semi-Supervised Learning Algorithm
I Generative modelsI A generative model models p(y , x), and any additional
information on p(x) is usefulI It can be seen as classification with additional information on
the marginal densityI It can be seen as clustering with additional informationI Advantage: Knowledge of the structure can be incorporated
I Low-density separation: an SVMI The most common approach – a maximum margin algorithm
such as SVMI The method of maximizing the margin for unlabeled as well as
labeled points is called the transduction SVMI The corresponding problem is non-convex, and thus difficult to
optimize
I Low-density separation: entropy minimizationI Encourage the class-conditional probability p(y |x) to be close
to 1 or to 0 at labeled and unlabeled points
Classes of Semi-Supervised Learning Algorithm
I Graph-based methodsI Data are represented by the nodes of a graph, the edges of
which are labeled with the pairwise distances of the incidentnodes
I Most graph methods use the graph LaplacianI Many graph methods penalize nonsmoothness along the edgesI Intrinsically transductive and inductive algorithmsI Information propagation on the graph
I Change of Representation: two-step learningI Change representation: perform an unsupervised step on all
data, and construct a new metricI Ignore the unlabeled data and perform supervised learning
using the new data
Classes of Semi-Supervised Learning Algorithm
I Graph-based methodsI Data are represented by the nodes of a graph, the edges of
which are labeled with the pairwise distances of the incidentnodes
I Most graph methods use the graph LaplacianI Many graph methods penalize nonsmoothness along the edgesI Intrinsically transductive and inductive algorithmsI Information propagation on the graph
I Change of Representation: two-step learningI Change representation: perform an unsupervised step on all
data, and construct a new metricI Ignore the unlabeled data and perform supervised learning
using the new data
Hypothesis and Notations
Notations:
I Xi observation
I Yi label
I n the number of observation pairs
I π(x , y) the joint probability
I η(y |x) the conditional probability
I q(x) the marginal probability of observations
The hypothesis:
I The marginal probability q(x) is completely known
I X and Y are finite
Semi-Supervised Probabilistic Criterion
{Xi ,Yi}ni=1 are observations and their labels
Let g(y |x ; θ) be the conditional probability function, parameterizedby θ. Then the standard conditional maximum likelihood estimatoris defined by
θ̂n = arg minθ∈Θ
1
n
n∑
i=1
`(Yi |Xi ; θ),
where `(y |x ; θ) = − log g(y |x ; θ) denotes the negated conditionallog-likelihood function.
The asymptotically optimal semi-supervised estimator θ̂sn isdefined by
θ̂sn = arg minθ∈Θ
n∑
i=1
q(Xi )∑nj=1 1{Xj = Xi}
`(Yi |Xi ; θ),
where q(x) is the marginal probability of observations.
Problem of the Covariate Shift
Covariate ShiftLet us learn an estimator from (X1,Y1), . . . , (Xn,Yn), where thedistribution of Xi is defined by q0(x). How to adapt the estimatorif the test data Xi are distributed according to q1(x) 6= q0(x)?
I Si q1 is known, the weights of the semi-supervisedestimateur(q = q1) are asymototically identical to 1
nq1q0
(Xi )
and the algorithm converges to
θ1? = arg minθ∈Θ
Eπ1 [`(Y |X ; θ)] .
I The covariance matrix is smaller than the matrix of theestimator weighted by an importance ratio
θ̂n = arg minθ∈Θ
n∑
i=1
q1
q0(Xi )`(Yi |Xi ; θ)
(which is supposed to know q0).
Experiments with logistic regression
10 50 100 500 1000 50000
2
4
6
8
10
12
loga
rithm
ic r
isk
(x n
)
n10 50 100 500 1000 5000
0
2
4
6
8
10
12
loga
rithm
ic r
isk
(x n
)
n
Boxplots of the scaled excess risk as a function of the number ofobservations in the presence of the covariate shift.Left: Shimodaira criterion, n (Eπ[`(Y |X ; θ̂n)]− Eπ[`(Y |X ; θ?)]);
Right: semi-supervised estimator, n (Eπ[`(Y |X ; θ̂sn)]− Eπ[`(Y |X ; θ?)]).
Applications to real problems
In the realistic applications (binary text classification), we can notassume that the true q(x) is known.
We propose an approach based on clustering.
How to “estimate q(x)”? The set of unlabeled data is divided intok clusters, and in the expression of the weight
q(Xi )∑nj=1 1{Xj = Xi}
the numerator is replaced by the empirical frequency of the clusterwhich contains Xi ; the denominator is replaced by the number oftraining points which are in the same cluster as Xi .
Patients Stratification and Methods of Personalized Medicine
An application: Obesity stratification based on metagenomics
Some (Fancy) Clustering Methods
Semi-Supervised Learning
Canonical Correlation: Correlation between Sets of Variables
Canonical Correlation Analysis: Motivation
I Canonical correlations analysis (CCA) is an exploratorystatistical method to highlight correlations between two datasets acquired on the same experimental units
I CCA is most appropriate when a researcher desires to examinethe relationship between two variable set
I The method was first introduced by Harold Hotelling in 1936
Canonical Correlation Analysis: How?
I X and Y are matrices of order n × p and n × q
I The columns correspond to variables and the rows correspondto experimental units (patients)
Canonical Correlation Analysis: How?
I Find two vectors a and b that maximize the correlationbetween the linear combinations
U = a1X1 + a2X
2 + · · ·+ apXp
V = b1Y1 + b2Y
2 + · · ·+ bqYq
I The problem consists in solving
ρ = cor(U,V ) = maxa,b
cor(Xa,Yb)
Canonical correlations ρ are the positive square roots of theeigenvalues λ of PXPY (ρ =
√λ), where
PX = X (XTX )−1XT
PY = Y (Y TY )−1Y T
Canonical Correlation Analysis: How?
I Find two vectors a and b that maximize the correlationbetween the linear combinations
U = a1X1 + a2X
2 + · · ·+ apXp
V = b1Y1 + b2Y
2 + · · ·+ bqYq
I The problem consists in solving
ρ = cor(U,V ) = maxa,b
cor(Xa,Yb)
Canonical correlations ρ are the positive square roots of theeigenvalues λ of PXPY (ρ =
√λ), where
PX = X (XTX )−1XT
PY = Y (Y TY )−1Y T
How to Interpret the Results?
I Consider canonical correlation valuesI The canonical correlation coefficient is the Pearson relationship
between the two synthetic variables on a given canonicalfunction. Because of the scaling created by the standardizedweights in the linear equations, this value cannot be negativeand only ranges from 0 to 1.
I Consider coefficientsI Visualization of the results of canonical correlation is usually
through bar plots of the coefficients of the two sets of variablesfor the pairs of canonical variates showing significantcorrelation.
Example: 12 sets of features
1. PA, psychological, and three factor eating questionnaires
2. Body composition
3. Metabolic rate and blood pressure
4. Blood lipids
5. Glucose homeostasis and insulin sensibility
6. Adiponekines
7. Kidney function
8. Fecal microbiota abundance, qPCR
9. Systemic inflammation and chemokines
10. Adipose tissue macrophage markers
11. Nutrient intake
12. Food intake
Canonical Correlation Values
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Psy
cho
Bod
ycom
p
Met
ab_B
lood
Pr
Blo
od_l
ip
Glu
cose
_Ins
Adi
pone
k
Kid
ney
Fec
al_M
icro
Che
mok
in
Adi
pose
Nut
rien
Foo
d
Psycho
Bodycomp
Metab_BloodPr
Blood_lip
Glucose_Ins
Adiponek
Kidney
Fecal_Micro
Chemokin
Adipose
Nutrien
Food
How to interpret the results?
I Structure coefficients are critical for deciding what variablesare useful for the model
I Bar plots of the coefficients of the two sets of variables for thepairs of canonical variates showing significant correlation.
I Coefficients increase in importance when the observedvariables in the model increase in their correlation with eachother
Canonical Correlation Example: Blood lipids/Glucosehomeostasis and insulin sensibility
X Variables Y Variables
Chol_m
eta
TG_meta
HDL_meta
LDL_meta
TC_HDL
NHDL
AG
L_m
eta
Gly
_met
aIn
s_m
eta
HOMA_B
_met
a
HOMA_S_meta
HOMA_IR_meta
Disse_meta
QUICKI_metarevised_QUICKI_meta
IGR_meta
Mcauley_m
eta
FIR
I_clin
Helio Plot
Canonical Variate1