Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
UVACS4501:MachineLearning
Lecture21: UnsupervisedClustering(III)
Dr.Yanjun Qi
UniversityofVirginiaDepartmentofComputerScience
Wherearewe?èmajorsectionsofthiscourse
q Regression(supervised)q Classification(supervised)
q Featureselection
q Unsupervisedmodelsq DimensionReduction(PCA)q Clustering(K-means,GMM/EM,Hierarchical )
q Learningtheoryq Graphicalmodels
12/6/18 2
Dr.YanjunQi/UVACS
AnunlabeledDatasetX
• Data/points/instances/examples/samples/records:[rows]• Features/attributes/dimensions/independentvariables/covariates/predictors/regressors:[columns]
12/6/18 3
a data matrix of n observations on p variables x1,x2,…xp
Unsupervisedlearning =learningfromraw(unlabeled,unannotated,etc)data,asopposedtosuperviseddatawhereaclassificationlabelofexamplesisgiven
Dr.YanjunQi/UVACS
• Find groups (clusters) of data points such that data points in a group will be similar (or related) to one another and different from (or unrelated to) the data points in other groups
Whatisclustering?
Inter-cluster distances are maximized
Intra-cluster distances are minimized
12/6/18 4
Dr.YanjunQi/UVACS
12/6/18
Roadmap:clustering
§ Definitionof"groupness”§ Definitionof"similarity/distance"§ Representationforobjects§ Howmanyclusters?§ ClusteringAlgorithms
§ Partitional algorithms§ Hierarchicalalgorithms
§ Formalfoundationandconvergence5
Dr.YanjunQi/UVACS
12/6/18
6
ClusteringAlgorithms
• Partitional algorithms– Usuallystartwitharandom(partial)partitioning
– Refineititeratively• Kmeansclustering• Mixture-Modelbasedclustering
• Hierarchicalalgorithms– Bottom-up,agglomerative– Top-down,divisive
Dr.YanjunQi/UVACS
(2)Partitional Clustering
• Nonhierarchical• Constructapartitionofn objectsintoasetofK clusters
• UserhastospecifythedesirednumberofclustersK.
12/6/18 7
Dr.YanjunQi/UVACS
12/6/18
8
OtherpartitioningMethods• Partitioningaroundmedoids (PAM):insteadofaverages,
usemultidim mediansascentroids(cluster“prototypes”).Dudoit andFreedland (2002).
• Self-organizingmaps(SOM):addanunderlying“topology”(neighboringstructureonalattice)thatrelatesclustercentroidstooneanother.Kohonen (1997),Tamayoetal.(1999).
• Fuzzyk-means:allowfora“gradation” ofpointsbetweenclusters;softpartitions.GashandEisen (2002).
• Mixture-basedclustering:implementedthroughanEM(Expectation-Maximization)algorithm.Thisprovidessoftpartitioning,andallowsformodelingofclustercentroidsandshapes.(Yeung etal.(2001),McLachlanetal.(2002))
Dr.YanjunQi/UVACS
12/6/18
9
OtherpartitioningMethods• Partitioningaroundmedoids (PAM):insteadofaverages,
usemultidim mediansascentroids(cluster“prototypes”).Dudoit andFreedland (2002).
• Self-organizingmaps(SOM):addanunderlying“topology”(neighboringstructureonalattice)thatrelatesclustercentroidstooneanother.Kohonen (1997),Tamayoetal.(1999).
• Fuzzyk-means:allowfora“gradation” ofpointsbetweenclusters;softpartitions.GashandEisen (2002).
• Mixture-basedclustering:implementedthroughanEM(Expectation-Maximization)algorithm.Thisprovidessoftpartitioning,andallowsformodelingofclustercentroidsandshapes.(Yeung etal.(2001),McLachlanetal.(2002))
Dr.YanjunQi/UVACS
Islandsofmusic(Pampalk etal.,KDD’03)
E.g.: SOM Used for Visualization
12/6/18 10
Dr.YanjunQi/UVACS6316/f16
12/6/18
11
OtherpartitioningMethods• Partitioningaroundmedoids (PAM):insteadofaverages,
usemultidim mediansascentroids(cluster“prototypes”).Dudoit andFreedland (2002).
• Self-organizingmaps(SOM):addanunderlying“topology”(neighboringstructureonalattice)thatrelatesclustercentroidstooneanother.Kohonen (1997),Tamayoetal.(1999).
• Fuzzyk-means:allowfora“gradation” ofpointsbetweenclusters;softpartitions.GashandEisen (2002).
• Mixture-basedclustering:implementedthroughanEM(Expectation-Maximization)algorithm.Thisprovidessoftpartitioning,andallowsformodelingofclustercentroidsandshapes.(Yeung etal.(2001),McLachlanetal.(2002))
Dr.YanjunQi/UVACS
12/6/18
12
OtherpartitioningMethods• Partitioningaroundmedoids (PAM):insteadofaverages,
usemultidim mediansascentroids(cluster“prototypes”).Dudoit andFreedland (2002).
• Self-organizingmaps(SOM):addanunderlying“topology”(neighboringstructureonalattice)thatrelatesclustercentroidstooneanother.Kohonen (1997),Tamayoetal.(1999).
• Fuzzyk-means:allowfora“gradation” ofpointsbetweenclusters;softpartitions.GashandEisen (2002).
• Mixture-basedclustering:implementedthroughanEM(Expectation-Maximization)algorithm.Thisprovidessoftpartitioning,andallowsformodelingofclustercentroidsandshapes.(Yeung etal.(2001),McLachlanetal.(2002))
Dr.YanjunQi/UVACS
Partitional :GaussianMixtureModel
• 1.ReviewofGaussianDistribution• 2.GMMforclustering:basicalgorithm• 3.GMMconnectingtoK-means• 4.ProblemsofGMMandK-means
12/6/18
Dr.YanjunQi/UVACS
13
A Gaussian Mixture Model for Clustering
• Assume that data are generated from a mixture of Gaussian distributions
• For each Gaussian distribution– Center: j
– covariance: j
• For each data point– Determine membership
: if belongs to j-th clusterij iz x
12/6/18 14
Dr.YanjunQi/UVACS
⌃µ
GaussianDistribution
Courtesy:http://research.microsoft.com/~cmbishop/PRML/index.htm
CovarianceMatrixMean12/6/18
Dr.YanjunQi/UVACS
15
( )2~ ,X N µ σ
Example: the Bivariate Normal distribution
p( !x) = f x1,x2( ) = 1
2π( ) Σ 1/2 e−1
2!x−!µ( )T Σ−1 !x−
!µ( )
with
!µ =
µ1
µ2
⎡
⎣⎢⎢
⎤
⎦⎥⎥
22 1 2
22 22 22 2 2
σ σ σ ρσ σσ σ ρσ σ σ
11 1 1
×1 1
⎡ ⎤⎡ ⎤Σ = = ⎢ ⎥⎢ ⎥
⎣ ⎦ ⎣ ⎦
and
( )2 2 2 222 12 1 2 1σ σ σ σ σ ρ11Σ = − = −
12/6/18
Dr.YanjunQi/UVACS
16
Scatter Plots of data from the bivariate Normal distribution
Dr.YanjunQi/UVACS
HowtoEstimateGaussian:MLE
µ = 1
n ixi=1
n
∑
• In the 1D Gaussian case, we simply set the mean and the variance to the sample mean and the sample variance:
2
σ = 1n
2
(Xi−µ)i=1
n
∑
Dr.YanjunQi/UVACS
Dr.YanjunQi/UVACS
< X1, X2!, X p >~ N µ
"#,Σ( )
The p-multivariate Normal distribution
Partitional :GaussianMixtureModel
• 1.ReviewofGaussianDistribution• 2.GMMforclustering:basicalgorithm• 3.GMMconnectingtoK-means• 4.ProblemsofGMMandK-means
12/6/18
Dr.YanjunQi/UVACS
20
Application:ThreeSpeakerRecognitionTasks
MIT Lincoln Laboratory7
Speaker Recognition Tasks
?
?
?
?
Whose voice is this?Whose voice is this??
?
?
?
Whose voice is this?Whose voice is this?
Identification
?
Is this Bob’s voice?Is this Bob’s voice?
?
Is this Bob’s voice?Is this Bob’s voice?
Verification/Authentication/Detection
Speaker B
Speaker A
Which segments are from the same speaker?Which segments are from the same speaker?
Where are speaker changes?Where are speaker changes?
Speaker B
Speaker A
Which segments are from the same speaker?Which segments are from the same speaker?
Where are speaker changes?Where are speaker changes?
Segmentation and Clustering (Diarization)
slidefromDouglasReynolds12/6/18
Dr.YanjunQi/UVACS
21
Application:GMMsforspeakerrecognition
• AGaussianmixturemodel(GMM)representsastheweightedsumofmultipleGaussiandistributions
• EachGaussianstatei hasa– Mean– Covariance– Weight
Dr. Yanjun Qi / UVA CS
Dim 1Dim 2
Model j
p(x | j)
µ j
Σ j
wj ≡ p(µ = µ j )
RecognitionSystemsGaussianMixtureModels
Dr. Yanjun Qi / UVA CS
Parameters µ j
Σ j
wj
Dim 1Dim 2
( )p x
12/6/18 23
RecognitionSystemsGaussianMixtureModels
Dr. Yanjun Qi / UVA CS
Model Components
Parameters
Dim 1Dim 2
( )p x
12/6/18 24
j = 1,..., K
Learning a Gaussian Mixture• Probability Model
p( !x = !xi )
= p( !x = !xi ,!µ =!µ j )
j∑
= p(!µ =!µ j ) p( !x = !xi |
!µ =!µ j )
j∑
= p(!µ =!µ j )
1
2π( )p/2Σ j
1/2 e−1
2!x−!µ j( )T Σ j
−1 !x−!µ j( )
j∑
12/6/18 25
Dr.YanjunQi/UVACS
Totallowofprobability
Chainrule
AGaussianmixturemodel (GMM)representsastheweightedsumofmultipleGaussiandistributions
MaxLog-likelihoodofObservedDataSamples
Dr.YanjunQi/UVACS
o Log-likelihoodofdata
ApplyMLEtofind
optimalGaussianparameters
{p(!µ = µ j )}, j = 1...K{ }
{!µ j ,Σ j , j = 1...K}
26
logp(x1, x2, x3, ..., xn) =
logi=1..n∏ p(
!µ =!µ j )
1
2π( )p/2Σ j
1/2e−1
2!xi−!µ j( )T Σ j
−1 !xi−!µ j( )
j=1..K∑
12/6/18
27
Expectation-MaximizationfortrainingGMM
• Start:– "Guess"thecentroidandcovarianceforeachoftheKclusters
– “Guess”theproportionofclusters,e.g.,uniformprob 1/K
• Loop– For each point, revising its proportions belonging to each
of the K clusters – For each cluster, revising both the mean (centroid
position) and covariance (shape)
Dr.YanjunQi/UVACS
12/6/18
Dr.YanjunQi/UVACS
28
each cluster, revising both the mean (centroid position) and covariance (shape)
Another Gaussian Mixture Example: Start
12/6/18 29
Dr.YanjunQi/UVACS
Another GMM Example: After First Iteration
12/6/18 30
Dr.YanjunQi/UVACS
For each cluster, revising its mean (centroid position), covariance (shape) and proportion in the mixture
For each point, revising its proportions belonging to each of the K clusters
Another GMM Example: After 2nd Iteration
12/6/18 31
Dr.YanjunQi/UVACS
For each point, revising its proportions belonging to each of the K clusters
For each cluster, revising its mean (centroid position), covariance (shape) and proportion in the mixture
After 3rd Iteration
12/6/18 32
Dr.YanjunQi/UVACS
For each point, revising its proportions belonging to each of the K clusters
For each cluster, revising its mean (centroid position), covariance (shape) and proportion in the mixture
After 4th Iteration
12/6/18 33
Dr.YanjunQi/UVACS
For each point, revising its proportions belonging to each of the K clusters
For each cluster, revising its mean (centroid position), covariance (shape) and proportion in the mixture
After 5th Iteration
12/6/18 34
Dr.YanjunQi/UVACS
For each point, revising its proportions belonging to each of the K clusters
For each cluster, revising its mean (centroid position), covariance (shape) and proportion in the mixture
After 6th Iteration
12/6/18 35
Dr.YanjunQi/UVACS
For each point, revising its proportions belonging to each of the K clusters
For each cluster, revising its mean (centroid position), covariance (shape) and proportion in the mixture
Another GMM Example: After 20th Iteration
12/6/18 36
Dr.YanjunQi/UVACS
For each point, revising its proportions belonging to each of the K clusters
For each cluster, revising its mean (centroid position), covariance (shape) and proportion in the mixture
Recap:GaussianMixtureModels
Dr. Yanjun Qi / UVA CS
Model Components
Parameters
Dim 1Dim 2
( )p x
12/6/18 37
j = 1,..., K
Dr.YanjunQi/UVACS
TheSimplestGMMassumption
• Each component generates data from a Gaussian with • mean µi
• Shared diagonal covariance matrix σ2I
µ1
µ2
µ3
12/6/18 38
Dr.YanjunQi/UVACS
AnotherSimpleGMMassumption
• Each component generates data from a Gaussian with • mean µi
• Shared covariance matrix as diagonal matrix
µ1
µ2
µ3
12/6/18 39
Dr.YanjunQi/UVACS
TheGeneral GMMassumption
µ1
µ2
µ3
• Each component generates data from a Gaussian with • mean µi
• covariance matrix Σι
12/6/18 40
Dr.YanjunQi/UVACS
AnotherSimpleGMMassumption
• Each component generates data from a Gaussian with • mean µi
• Cluster-specific diagonal covariance matrix as σϕ
2I
µ1
µ2
µ3
σϕ2I
12/6/18 41
Dr.YanjunQi/UVACS
AbitMoreGeneralGMMassumption
• Each component generates data from a Gaussian with • mean µi
• Shared covariance matrix as full matrix
µ1
µ2
µ3
12/6/18 42
Concrete Equations for Learning a Gaussian Mixture(when assuming with known shared covariance)
12/6/18 43
Dr.YanjunQi/UVACS
AssumingKnownandShared
p( !x = !xi )
= p( !x = !xi ,!µ =!µ j )
µ j
∑
= p(!µ =!µ j ) p( !x = !xi |
!µ =!µ j )
j∑
= p(!µ =!µ j )
1
2π( )p/2Σ
1/2 e−1
2!xi−!µ j( )T Σ−1 !xi−
!µ j( )
j∑
when assuming with known shared covariance
Learning a Gaussian Mixture(when assuming with known shared covariance)
=
1
2π( )p/2Σ
1/2 e−1
2!xi−!µ j( )T Σ−1 !xi−
!µ j( ) p(µ = µ j )
1
2π( )p/2Σ
1/2 e−1
2!xi−!µs( )T Σ−1 !xi−
!µs( )
p(µ = µs )s=1
k
∑
E[zij ] = p(
!µ = µ j | x = xi )E-Step
12/6/18 44
Dr.YanjunQi/UVACS
=p(x = xi |µ = µ j ) p(µ = µ j )
p(x = xi |µ = µs ) p(µ = µs )s=1
k
∑
E-step (vs. Assignment Step in K-means)
=
1
2π( )p/2Σ
1/2 e−1
2!x−!µ j( )T Σ−1 !x−
!µ j( ) p(µ = µ j )
1
2π( )p/2Σ
1/2 e−1
2!x−!µs( )T Σ−1 !x−
!µs( )
p(µ = µs )s=1
k
∑
[ ] ( | )ij j iE z p x xµ µ= = =E-Step
12/6/18 45
Dr.YanjunQi/UVACS
=p(x = xi |µ = µ j ) p(µ = µ j )
p(x = xi |µ = µs ) p(µ = µs )s=1
k
∑
when assuming with known shared covariance
Learning a Gaussian Mixture
µ j ←1
E[zij ]i=1
n
∑E[zij ]xi
i=1
n
∑M-Step
p(µ = µ j )←
1n
E[zij ]i=1
n
∑
12/6/18 46
Dr.YanjunQi/UVACS
Covariance: j (j: 1 to K) can also be derived in the M-step under a full setting
⌃
when assuming with known shared covariance
M-step (vs. Centroid Step in K-means)
µ j ←1
E[zij ]i=1
n
∑E[zij ]xi
i=1
n
∑M-Step
p(µ = µ j )←
1n
E[zij ]i=1
n
∑
12/6/18 47
Dr.YanjunQi/UVACS
Covariance: j (j: 1 to K) will also be derived in the M-step under a full setting
⌃
when assuming with known shared covariance
12/6/18
Dr.YanjunQi/UVACS
48
M-stepforEstimatingunknownCovarianceMatrix
(moregeneral,detailsinEM-Extralecture)
!!
!Σ j(t+1) = i=1
n E[zij ](t )(xi − µ j(t+1))(xi − µ j
(t+1))T∑E[zij ](t )
i=1
n
∑
12/6/18
49
Recap:Expectation-MaximizationfortrainingGMM
• Start:– "Guess"thecentroidandcovarianceforeachoftheKclusters
– “Guess”theproportionofclusters,e.g.,uniformprob 1/K
• Loop– For each point, revising its proportions belonging to each
of the K clusters – For each cluster, revising both the mean (centroid
position) and covariance (shape)
Dr.YanjunQi/UVACS
Partitional :GaussianMixtureModel
• 1.ReviewofGaussianDistribution• 2.GMMforclustering:basicalgorithm• 3.GMMconnectingtoK-means• 4.ProblemsofGMMandK-means
12/6/18
Dr.YanjunQi/UVACS
50
Recap: K-means iterative learning
argmin!C j ,mi , j{ }
mi, j!xi −!C j( )2
i=1
n
∑j=1
K
∑
{ } { },Memberships and centers are correlated.i j jm C
Given memberships mi, j{ }, !C j =
mi, j!xi
i=1
n
∑
mi, ji=1
n
∑
Given centers {!C j}, mi, j =
1 j = argmink
(!xi −!C j )
2
0 otherwise
⎧⎨⎪
⎩⎪
M-Step
E-Step
12/6/18 51
Dr.YanjunQi/UVACS
Dr.YanjunQi/UVACS
52
Compare:K-means
• TheEMalgorithmformixturesofGaussiansislikea"softversion"oftheK-meansalgorithm.
• IntheK-means“E-step”wedohardassignment:• IntheK-means“M-step”weupdatethemeansastheweightedsumofthedata,butnowtheweightsare0or1:
12/6/18
12/6/18
Dr.YanjunQi/UVACS
53
argmin!C j ,mi , j{ }
mi, j!xi −!C j( )2
i=1
n
∑j=1
K
∑
log p(x = xi )i=1
n
∏i∑ = log p(µ = µ j )
1
2π( ) Σ 1/2e−1
2!x−!µ j( )T Σ−1 !x−
!µ j( )
µ j
∑⎡
⎣
⎢⎢
⎤
⎦
⎥⎥i
∑
• K-Meanonlydetectsphericalclusters.• GMMcanadjustitsselftoellipticshapeclusters.
(3) GMM Clustering
Clustering
Likelihood
EM algorithm
Each point’s soft membership & mean
/ covariance per cluster
Task
Representation
Score Function
Search/Optimization
Models, Parameters
12/6/18 54
MixtureofGaussian
Dr.YanjunQi/UVACS
log p(x = xi )i=1
n
∏i∑ = log p(µ = µ j )
1
2π( ) Σ j
1/2e−1
2!x−!µ j( )T Σ j
−1 !x−!µ j( )
µ j
∑⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥i
∑
Partitional :GaussianMixtureModel
• 1.ReviewofGaussianDistribution• 2.GMMforclustering:basicalgorithm• 3.GMMconnectingtoK-means• 4.ProblemsofGMMandK-means
12/6/18
Dr.YanjunQi/UVACS
55
Dr.YanjunQi/UVACS
UnsupervisedLearning:notashardasitlooks
Sometimes easy
Sometimes impossible
and sometimes in between
Problems (I)
• Both k-means and mixture models need to compute centers of clusters and explicit distance measurement– Given strange distance measurement, the center of clusters
can be hard to computeE.g.,
!x − !x '
∞= max x1 − x1
' , x2 − x2' ,..., xp − xp
'( )x y
z
∞ ∞− = −x y x z
12/6/18
Dr.YanjunQi/UVACS
57
Problem (II)
• Both k-means and mixture models look for compact clustering structures– In some cases, connected clustering structures are more desirable
Graph based clustering
e.g. MinCut, Spectral
clustering
12/6/18 58
Dr.YanjunQi/UVACS
e.g. Image Segmentation through minCut
12/6/18 59
Dr.YanjunQi/UVACS
References
q Hastie,Trevor,etal.Theelementsofstatisticallearning.Vol.2.No.1.NewYork:Springer,2009.
q BigthankstoProf.EricXing@CMUforallowingmetoreusesomeofhisslides
q BigthankstoProf.Ziv Bar-Joseph@CMUforallowingmetoreusesomeofhisslides
q clusteringslidesfromProf.Rong Jin@MSU
12/6/18 60
Dr.YanjunQi/UVACS