Cluster Anlysis

8/20/2019 Cluster Anlysis

http://slidepdf.com/reader/full/cluster-anlysis 1/30

Data Mining

Cluster Analysis

Lecture Notes 10

Probabilistic Model-Based Clustering



Probabilistic Model-Based Clustering

!

Clustering methods discussed so far

– Every data object is assigned to exactly one cluster

!

Some applications may need for fuzzy or soft

cluster assignment

– Ex. An e-game could belong to both entertainment

and software

! Methods:

– fuzzy clusters

– probabilistic model-based clusters



Fuzzy Clustering

!

Given a set of objects, X=[x1,!,xn], a fuzzy set S is asubset of X that allows each object in X to have a

membership degree between 0 and 1.

! A fuzzy set, S, can be modeled as a function:

F S : X!

[0, 1]! Example: Popularity of cameras is defined as a fuzzy

mapping

– Then, A(0.05), B(1), C (0.86), D(0.27)

pop (o ) =

1 if 1000 or more units of o are soldi 1000 if i (i < 1000) units of o are sold.

Camera Sales (units)

A 50B 1320

C 860

D 270



Fuzzy Clustering

!

k fuzzy clusters C1, !,Ck , represented as a partitionmatrix M = [wij] ( wij is the membership degree of an

object oi in fuzzy cluster C j)

– for each object oi and cluster C j , 0 " w ij " 1 (fuzzy set)

– for each object oi

, , equal participation in the clustering

– for each cluster C j , ensures there is no empty cluster

! For an object oi , SSE:

– where the parameter p controls the influence of the degrees of membership

!

For a cluster Ci , SSE:

! Measure how well a clustering fits the data:

SSE(o i )=

j =1

w p ij dist (o i , c j )

2,

SSE(C j )=

n

i =1


2.

SSE(C )=

n

i =1

j =1


2.

j =1

w ij = 1

0 <i =1

w ij < n .



Fuzzy Clustering - Example

!

Example: Suppose the store has six reviews

! Let cluster features be

– C1 :digital camera and lens

– C2: computer

! For review, Ri, and cluster, C j, wij defined as

The partition matrix

Here review R4 belongs to

clusters C1 and C2 with

membership degrees 2/3

and 1/3, respectively

w ij =|R i ∩C j |

|R i ∩ (C 1 ∪C 2)|=

|R i ∩C j |

|R i ∩ {digital camera , lens , computer }|.

Review ID Keywords

R 1 digital camera, lens

R 2 digital camera

R 3 lens

R 4 digital camera, lens, computer

R 5 computer, CPU

R 6 computer, computer game M =

1 0

1 0

1 023

13

0 1

0 1

.





Model-Based Clustering

!

A set C of k probabilistic clusters C 1,",C k with probability densityfunctions f 1,", f k , respectively, and their probabilities #1,", #k.

!

Probability of an object o generated by cluster C j is

! Probability of o generated by the set of cluster C is

! Since objects are assumed to be generated independently, for a data

set D = {o1, !, on}, we have,

P (o |C j ) = ω j f j (o ).

P (o |C ) =

k

j =1

ω

j f j (o ).

P (D |C ) =

n

i =1

P (o i |C ) =

n

i =1

j =1

ω j f j (o i ).



Model-Based Clustering

!

Task: Find a set C of k probabilistic clusters such that P(D|C) ismaximized.

!

However, maximizing P(D|C) is often intractable since the probability

density function of a cluster can take an arbitrarily complicated form

! To make it computationally feasible (as a compromise), assume the

probability density functions being some parameterized distributions

!

O = {o1, !, on} (n observed objects), # = {$1, !, $k} (parameters of

the k distributions), and P j(oi| $ j) is the probability that oi is generated

from the j-th distribution using parameter $ j, we have

P (o i |) =

j =1

ω j P j (o i | j ),

P (O|) =

n

i =1

k

j =1

ω j P j (o i | j ).



Univariate Gaussian Mixture Model

!

Assume the probability density function of each cluster follows a 1-d

Gaussian distribution. Suppose that there are k clusters.

! The probability density function of each cluster are centered at µj with

standard deviation % j, $ j, = (µ j, % j), we have

!

Assuming that each cluster has the same probability,

– that is &1 = &2 = ' ' ' = &k = 1 ,

!

Then, we have

P (o i | j ) =1√ 2πσ j

e − (o i −µ j )

2

2σ 2 .

P (O|) =1

k

n

i =1

k

j =1

1√ 2πσ j

e − (o i −µ j )

2

2σ 2 .

P (o i |) =

1

k

k

j =1

1√ 2πσ j e −

(o i −µ j )2

2σ 2 .



The EM (Expectation Maximization) Algorithm

!

The k-means algorithm has two steps at each iteration: – Expectation Step (E-step): Given the current cluster centers,

each object is assigned to the cluster whose center is closest to

the object: An object is expected to belong to the closest cluster

– Maximization Step (M-step): Given the cluster assignment, for

each cluster, the algorithm adjusts the center so that the sum of

distance from the objects assigned to this cluster and the new

center is minimized

! The (EM) algorithm: A framework to approach maximum likelihood

or maximum a posteriori estimates of parameters in statistical

models.

– E-step assigns objects to clusters according to the current fuzzy

clustering or parameters of probabilistic clusters

– M-step finds the new clustering or parameters that maximize the

sum of squared error (SSE) or the expected likelihood



Example: Fuzzy Clustering Using the EM Algorithm

!

Initially, let c1 = a and c2 = b! 1st E-step: assign o to c1,w. wt =

–

!

1st M-step: recalculate the centroids according to the partition matrix,

minimizing the sum of squared error (SSE)

!

Iteratively calculate this until the cluster centers converge or the change

is small enough



Computing Mixture Models with EM

12

!

Given n objects O = {o1, !, on}, we want to mine a set of parameters

# = {#1, !, #k} such that P(O|#) is maximized, where # j = (µ j, % j) are

the mean and standard deviation of the j-th univariate Gaussian

distribution

!

We initially assign random values to parameters # j, then iteratively

conduct the E- and M- steps until converge or sufficiently small change

!

At the E-step, for each object oi, calculate the probability that oi belongsto each distribution,

!

At the M-step, adjust the parameters # j = (µ j, % j) so that the expected

likelihood P(O|#) is maximized

P ( j |o i ,) =P (o i | j )k l =1P (o i |l )

.

µ j =1

k

n i =1

o i P ( j |o i ,)n l =1P ( j |o l ,)

=1

k

n i =1 o i P ( j |o i ,)n i =1P ( j |o i ,)

σ j =

n i =1 P ( j |o i ,)(o i − u j )2n

i =1 P ( j |o i ,).



Advantages and Disadvantages of Mixture Models

!

Strength – Mixture models are more general than partitioning and fuzzy

clustering

– Clusters can be characterized by a small number of parameters

–

The results may satisfy the statistical assumptions of thegenerative models

!

Weakness

– Converge to local optimal (overcome: run multi-times w. random

initialization)

–

Computationally expensive if the number of distributions is large,

or the data set contains very few observed data points

– Need large data sets

– Hard to estimate the number of clusters



Notes about slides

!

The following slides are adapted from“Introduction to Data Mining” book. It describes

the EM in simpler way.

! There may be some notation differences from

previous ones.



Model-based clustering

!

In order to understand our data, we will assume that thereis a generative process (a model) that creates/describes

the data, and we will try to find the model that best fits the

data.

– Models of different complexity can be defined, but we will

assume that our model is a distribution from which data pointsare sampled

– Example: the data is the height of all people in Greece

! In most cases, a single distribution is not good enough todescribe all data points: different parts of the data follow a

different distribution

– Example: the data is the height of all people in Greece and China

– We need a mixture model

– Different distributions correspond to different clusters in the data.



Gaussian Distribution

!

Example: the data is the height of all people inGreece

– Experience has shown that this data follows aGaussian (Normal) distribution

–

Reminder: Normal distribution:

–

! = mean, " = standard deviation



Gaussian Model

!

What is a model? – A Gaussian distribution is fully defined by the mean !

and the standard deviation "

– We define our model as the pair of parameters

# = (!,")

!

This is a general principle: a model is defined as

a vector of parameters !

! Fitting the model: We want to find the normaldistribution that best fits our data

– Find the best values for ! and " and "

– But what does best fit mean?



Maximum Likelihood Estimation (MLE)

!

Suppose that we have a vector $ = (%1, ... , %&) of values

! And we want to fit a Gaussian '(!, ") model to the data

! Probability of observing point %(::

! Probability of observing all points (assume independence)

! We want to find the parameters # = (!, ") that maximizethe probability )( $|#)





Maximum Likelihood Estimation (MLE)

!

Note: these are also the most likely parametersgiven the data

!

If we have no prior information about #, or X,

then maximizing )( $|#) is the same as

maximizing )(#|X)



A Gaussian Distribution



Mixture of Gaussians

! Suppose that you have the heights of people from Greeceand China. In this case the data is the result of the

mixture of two Gaussians

– One for Greek people, and one for Chinese people

– Identifying for each value which Gaussian is most likely to have

generated it will give us a clustering.



Mixture model

!

A value %( is generated according to the followingis generated according to the followingprocess:

– First select the nationality

" With probability +, select Greek, with probability +- select

China (+, + +- = 1) – Given the nationality, generate the point from the

corresponding Gaussian

" )(%(|#,) ~ '(!,,",) if Greece

" )(%(|#C) ~ '(!C,"C) if China



Mixture Model

!

Our model has the following parameters

!

For value %(, we have:, we have:

!

For all values $ =(%1,...,%&)

!

We want to estimate the parameters that

maximize the Likelihood of the data

Mixture probabilities Distribution Parameters



Mixture Models

!

Once we have the parameters# = (+,, +-, !,, !-, ",, "-) we can estimate the

membership probabilities )(,|%() and )(C|%() for each

point %(::

–

This is the probability that point %( belongs to thebelongs to theGreek or the Chinese population (cluster )



EM (Expectation Maximization) Algorithm

!

Initialize the values of the parameters in # to somerandom values

! Repeat until convergence

– E-Step: Given the parameters # estimate the membership

probabilities )(,|%() and )(C|%()

– M-Step: Compute the parameter values that (in expectation)

maximize the data likelihood

MLE Estimates

if

Fraction of

population in G,C



Relationship to K-means

!

E-Step: Assignment of points to clusters – K-means: hard assignment, EM: soft assignment

!

M-Step: Computation of centroids

– K-means assumes common fixed variance (spherical

clusters)

– EM: can change the variance for different clusters or

different dimensions (elipsoid clusters)

!

If the variance is fixed then both minimize the

same error function



EM Clustering



EM Clustering



Mixture Model and K-means Clustering

Documents

Cluster Anlysis