Upload
osborn-johnston
View
224
Download
2
Tags:
Embed Size (px)
Citation preview
k-means clustering
Hongning WangCS@UVa
CS 6501: Text Mining 2
Today’s lecture
• k-means clustering – A typical partitional clustering algorithm– Convergence property• Expectation Maximization algorithm
– Gaussian mixture model
CS@UVa
CS 6501: Text Mining 3
Partitional clustering algorithms
• Partition instances into exactly k non-overlapping clusters– Flat structure clustering– Users need to specify the cluster size k– Task: identify the partition of k clusters that
optimize the chosen partition criterion
CS@UVa
CS 6501: Text Mining 4
Partitional clustering algorithms
• Partition instances into exactly k non-overlapping clusters– Typical criterion
– Optimal solution: enumerate every possible partition of size k and return the one optimizing the criterion
CS@UVa
Unfortunately, this is NP-hard!Let’s approximate this!
Inter-cluster distanceIntra-cluster distance
Optimize this in an alternative way
CS 6501: Text Mining 5
k-means algorithm
Input: cluster size k, instances , distance metric Output: cluster membership assignments 1. Initialize k cluster centroids (randomly if no domain
knowledge available)2. Repeat until no instance changes its cluster
membership:– Decide the cluster membership of instances by assigning
them to the nearest cluster centroid
– Update the k cluster centroids based on the assigned cluster membership
CS@UVa
Minimize intra distance
Maximize inter distance
CS 6501: Text Mining 6
k-means illustration
CS@UVa
CS 6501: Text Mining 7
k-means illustration
CS@UVa
Voronoi diagram
CS 6501: Text Mining 8
k-means illustration
CS@UVa
CS 6501: Text Mining 9
k-means illustration
CS@UVa
CS 6501: Text Mining 10
k-means illustration
CS@UVa
CS 6501: Text Mining 11
Complexity analysis
• Decide cluster membership
• Compute cluster centroid
• Assume k-means stops after iterations
CS@UVa
Don’t forget the complexity of distance computation, e.g., for Euclidean distance
CS 6501: Text Mining 12
Convergence property
• Why k-means will stop?– Answer: it is a special version of Expectation
Maximization (EM) algorithm, and EM is guaranteed to converge
– However, it is only guaranteed to converge to local optimal, since k-means (EM) is a greedy algorithm
CS@UVa
CS 6501: Text Mining 13
Probabilistic interpretation of clustering
• The density model of is multi-modal• Each mode represents a sub-population– E.g., unimodal Gaussian for each group
CS@UVa
Mixture model
𝑝 (𝑥 )=∑𝑧
𝑝 (𝑥|𝑧 )𝑝 (𝑧)
Unimodal distribution
Mixing proportion
𝑝 (𝑥∨𝑧=1)
𝑝 (𝑥∨𝑧=2)
𝑝 (𝑥∨𝑧=3)
CS 6501: Text Mining 14
Probabilistic interpretation of clustering
• If is known for every – Estimating and is easy• Maximum likelihood estimation• This is Naïve Bayes
CS@UVa
Mixture model
𝑝 (𝑥 )=∑𝑧
𝑝 (𝑥|𝑧 )𝑝 (𝑧)
Unimodal distribution
Mixing proportion
𝑝 (𝑥∨𝑧=1)
𝑝 (𝑥∨𝑧=2)
𝑝 (𝑥∨𝑧=3)
CS 6501: Text Mining 15
Probabilistic interpretation of clustering
• But is unknown for all – Estimating and is generally hard
– Appeal to the Expectation Maximization algorithm
CS@UVa
Mixture model
𝑝 (𝑥 )=∑𝑧
𝑝 (𝑥|𝑧 )𝑝 (𝑧)
Unimodal distribution
Mixing proportion
𝑝 (𝑥|𝑧=1 ) ?
𝑝 (𝑥|𝑧=2 ) ?
𝑝 (𝑥|𝑧=3 )?
Usually a constrained optimization problem
CS 6501: Text Mining 16
Recap: Partitional clustering algorithms
• Partition instances into exactly k non-overlapping clusters– Typical criterion
– Optimal solution: enumerate every possible partition of size k and return the one optimizing the criterion
CS@UVa
Unfortunately, this is NP-hard!Let’s approximate this!
Inter-cluster distanceIntra-cluster distance
Optimize this in an alternative way
CS 6501: Text Mining 17
Recap: k-means algorithm
Input: cluster size k, instances , distance metric Output: cluster membership assignments 1. Initialize k cluster centroids (randomly if no domain
knowledge available)2. Repeat until no instance changes its cluster
membership:– Decide the cluster membership of instances by assigning
them to the nearest cluster centroid
– Update the k cluster centroids based on the assigned cluster membership
CS@UVa
Minimize intra distance
Maximize inter distance
CS 6501: Text Mining 18
Recap: Probabilistic interpretation of clustering
• The density model of is multi-modal• Each mode represents a sub-population– E.g., unimodal Gaussian for each group
CS@UVa
Mixture model
𝑝 (𝑥 )=∑𝑧
𝑝 (𝑥|𝑧 )𝑝 (𝑧)
Unimodal distribution
Mixing proportion
𝑝 (𝑥∨𝑧=1)
𝑝 (𝑥∨𝑧=2)
𝑝 (𝑥∨𝑧=3)
CS 6501: Text Mining 19
Introduction to EM
• Parameter estimation– All data is observable• Maximum likelihood estimator• Optimize the analytic form of
– Missing/unobservable data• Data: X (observed) + Z (hidden)• Likelihood: • Approximate it!
CS@UVa
Most of cases are intractable
E.g. cluster membership
CS 6501: Text Mining 20
Background knowledge
• Jensen's inequality– For any convex function and positive weights
CS@UVa
𝑓 (∑𝑖 𝜆𝑖 𝑥𝑖)≤∑𝑖
𝜆𝑖 𝑓 (𝑥 𝑖) ∑𝑖
𝜆𝑖=1
CS 6501: Text Mining 21
Expectation Maximization
• Maximize data likelihood function by pushing the lower bound
CS@UVa
≥∑𝑍
𝑞 ( 𝑍 ) log𝑝 ( 𝑋 ,𝑍|𝜃 )−∑𝑍
𝑞 (𝑍 ) log𝑞 (𝑍 )Jensen's inequality𝑓 (𝐸 [ 𝑥 ] ) ≥𝐸 [ 𝑓 (𝑥)]
Lower bound: easier to compute, many good properties!
Components we need to tune when optimizing : and
Proposal distributions for
CS 6501: Text Mining 22
Intuitive understanding of EM
Data likelihood p(X| )
CS@UVa
Lower bound
Easier to optimize, guarantee to improve data likelihood
CS 6501: Text Mining 23
Expectation Maximization (cont)
• Optimize the lower bound w.r.t.
CS@UVa
¿∑𝑍
𝑞 (𝑍 ) [ log𝑝 ( 𝑍|𝑋 ,𝜃 )+ log𝑝( 𝑋∨𝜃)]−∑𝑍
𝑞 (𝑍 ) log𝑞 (𝑍 )
¿∑𝑍
𝑞 (𝑍 ) log 𝑝 (𝑍|𝑋 ,𝜃 )𝑞(𝑍)
+ log 𝑝 (𝑋∨𝜃)
negative KL-divergence between and Constant with respect to
𝐾𝐿¿
CS 6501: Text Mining 24
Expectation Maximization (cont)
• Optimize the lower bound w.r.t.
– KL-divergence is non-negative, and equals to zero i.f.f. – A step further: when , we will get , i.e., the lower bound is
tight!– Other choice of cannot lead to this tight bound, but might
reduce computational complexity– Note: calculation of is based on current
CS@UVa
CS 6501: Text Mining 25
Expectation Maximization (cont)
• Optimize the lower bound w.r.t. – Optimal solution:
CS@UVa
Posterior distribution of given current model
In k-means: this corresponds to assigning instance to its closest cluster centroid
𝑧𝑖=𝑎𝑟𝑔𝑚𝑖𝑛𝑘𝑑 (𝑐𝑘 ,𝑥 𝑖)
CS 6501: Text Mining 26
Expectation Maximization (cont)
• Optimize the lower bound w.r.t.
CS@UVa
Constant w.r.t.
¿𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝐸𝑍∨𝑋 ,𝜃𝑡 [ log𝑝 (𝑋 ,𝑍∨𝜃)]
Expectation of complete data likelihood
In k-means, we are not computing the expectation, but the most probable configuration, and then
CS 6501: Text Mining 27
Expectation Maximization
• EM tries to iteratively maximize likelihood– “Complete” likelihood: – Starting from an initial guess (0),
1. E-step: compute the expectation of the complete likelihood
2. M-step: compute (t+1) by maximizing the Q-function
CS@UVa
𝑄 (𝜃 ;𝜃𝑡 )=E 𝑍∨𝑋 ,𝜃𝑡 [𝐿𝑐 (𝜃 ) ]=∑𝑍
𝑝 (𝑍|𝑋 ,𝜃 𝑡 ) log p ( X , Z|𝜃 )
𝜃𝑡+1=𝑎𝑟𝑔𝑚𝑎𝑥𝜃𝑄 (𝜃 ;𝜃𝑡 ) Key step!
CS 6501: Text Mining 28
Intuitive understanding of EM
Data likelihood p(X| )
current guess
Lower bound(Q function)
next guess
E-step = computing the lower boundM-step = maximizing the lower bound
CS@UVa
In k-means• E-step: identify the cluster
membership - • M-step: update by
CS 6501: Text Mining 29
Convergence guarantee
• Proof of EM
CS@UVa
log𝑝 ( 𝑋|𝜃 )=log𝑝 (𝑍 , 𝑋|𝜃 )− log𝑝 (𝑍∨𝑋 ,𝜃)
log𝑝 ( 𝑋|𝜃 )=∑𝑍
𝑝(𝑍∨𝑋 , 𝜃𝑡) log𝑝 ( 𝑍 ,𝑋|𝜃 ) −∑𝑍
𝑝 (𝑍∨𝑋 ,𝜃 𝑡) log𝑝 (𝑍∨𝑋 , 𝜃)Taking expectation with respect to of both sides:
log𝑝 ( 𝑋|𝜃 )− log𝑝 (𝑋∨𝜃𝑡)=𝑄 (𝜃 ;𝜃 𝑡 )+𝐻 (𝜃 ;𝜃𝑡 ) −𝑄 (𝜃𝑡 ;𝜃𝑡 ) −𝐻 (𝜃𝑡 ;𝜃𝑡 )Then the change of log data likelihood between EM iteration is:
By Jensen’s inequality, we know , that means
log𝑝 ( 𝑋|𝜃 )− log𝑝 (𝑋∨𝜃𝑡)≥𝑄 (𝜃 ;𝜃 𝑡 ) −𝑄 (𝜃 𝑡 ;𝜃 𝑡 ) ≥0
¿𝑄 (𝜃 ;𝜃𝑡 )+𝐻 (𝜃 ;𝜃𝑡) Cross-entropy
M-step guarantee this
CS 6501: Text Mining 30
• Global optimal is not guaranteed!– Likelihood: is non-convex in most of case– EM boils down to a greedy algorithm• Alternative ascent
• Generalized EM– E-step: – M-step: choose that improves
CS@UVa
What is not guaranteed
CS 6501: Text Mining 31
k-means v.s. Gaussian Mixture
• If we use Euclidean distance in k-means– We have explicitly assumed is Gaussian– Gaussian Mixture Model (GMM)
•
CS@UVa
𝑃 (𝑥|𝑧 )= 1
√ (2𝜋 )𝑘 Σ𝑧
𝑒− 12
(𝑥−𝜇𝑧 )𝑇 Σ z−1 (𝑥−𝜇𝑧 )
Multinomial
𝑃 (𝑥|𝑧 )= 1√2𝜋
𝑒−
(𝑥−𝜇𝑧 )𝑇 (𝑥−𝜇𝑧 )2
In k-means, we assume equal variance across clusters, so we don’t need to estimate them
We do not consider cluster size in k-means
CS 6501: Text Mining 32
k-means v.s. Gaussian Mixture
• Soft v.s., hard posterior assignment
CS@UVa
GMM k-means
CS 6501: Text Mining 33
k-means in practice
• Extremely fast and scalable– One of the most popularly used clustering
methods• Top 10 data mining algorithms – ICDM 2006
– Can be easily parallelized• Map-Reduce implementation
– Mapper: assign each instance to its closest centroid– Reducer: update centroid based on the cluster membership
– Sensitive to initialization• Prone to local optimal
CS@UVa
CS 6501: Text Mining 34
Better initialization: k-means++
1. Choose the first cluster center at uniformly random
2. Repeat until all k centers have been found– For each instance compute – Choose a new cluster center with probability
3. Run k-means with selected centers as initialization
CS@UVa
new center should be far away from existing centers
CS 6501: Text Mining 35
How to determine k
• Vary to optimize clustering criterion– Internal v.s. external validation– Cross validation• Abrupt change in objective function
CS@UVa
CS 6501: Text Mining 36
How to determine k
• Vary to optimize clustering criterion– Internal v.s. external validation– Cross validation• Abrupt change in objective function• Model selection criterion – penalizing too many
clusters– AIC, BIC
CS@UVa
CS 6501: Text Mining 37
What you should know
• k-means algorithm– An alternative greedy algorithm – Convergence guarantee• EM algorithm
– Hard clustering v.s., soft clustering• k-means v.s., GMM
CS@UVa
CS 6501: Text Mining 38
Today’s reading
• Introduction to Information Retrieval– Chapter 16: Flat clustering• 16.4 k-means• 16.5 Model-based clustering
CS@UVa