Clustering_KUK.pptx

Introduction to Clustering

Alka AroraSr. Scientist

Indian Agricultural Statistics Research InstituteLibrary Avenue, Pusa, New [email protected], [email protected]

• Organizing data into classes such that there is• high intra-class similarity

• low inter-class similarity

• Finding the class labels and the number of classes directly from the data (in contrast to classification).

• More informally, finding natural groupings among objects.

What is Clustering?Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing

School Employees Simpson's Family Males Females

Clustering is subjective

What is a natural grouping among these objects?

How do we measure similarity?

0.23 3 342.7

Peter Piotr

Peter

Piter

Pioter

Piotr

Substitution (i for e)

Insertion (o)

Deletion (e)

Edit Distance Example It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion.Assume that each of these operators has a cost associated with it.

The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C.

How similar are the names “Peter” and “Piotr”?Assume the following cost function

Substitution 1 UnitInsertion 1 UnitDeletion 1 Unit

D(Peter,Piotr) is 3

Similarity / Dissimilarity

• Similarity between feature i and feature j is denoted by sij. We can measure this quantity in several ways depending on the scale of measurement (or data type) that we have.

• Dissimilarity (dij) is opposite to similarity. There are many types of distance and similarity. Each similarity or dissimilarity has its own characteristics.

• The relationship between dissimilarity and similarity is given by

• dij= 1-sij

Distance for Binary Variables

• Binary variables are those which take binary value such as Yes and No, or Agree and Disagree, True and False, Success and Failure, 0 and 1, Absence or Present, Positive and Negative, etc.

• Binary variables take only two possible values, which can be represented as positive and negative. Similarity or dissimilarity (distance) of two objects can be measured in term of number of occurrence (frequency) of positive and negative in each object.

Example

• Let p = number of variables that positive for both objects .• q = number of variables that positive for the i th objects and negative for

the j th object • r= number of variables that negative for the i th objects and positive for the

j th object • s= number of variables that negative for both objects • t= p+q+r+s = total number of variables. • Apple and Banana have p=1, q=3 and r=0, s=0. Thus, t= p+q+r+s=4.

Feature of Fruit Sphere shape Sweet Sour Crunchy

Apple Yes Yes Yes Yes

Banana No Yes No No

Distance Measures for Binary Variables

• Simple Matching distance – =3/4

• Jaccard's distance

– =3/4• Hamming distance

– The length of different digits is called Hamming distance = =3

http://people.revoledu.com/kardi/tutorial/similarity/HammingDistance.html

Distance for Quantitative Variables

Euclidean Distance examines the root of square differences between coordinates of a pair of objects .

City block (Manhattan) distance examines the absolute differences between coordinates of a pair of objects.

A = (0, 3, 4, 5)

B = (7, 6, 3, -1)


Chebyshev distance is also called Maximum value distance. It examines the absolute magnitude of the differences between coordinates of a pair of objects. This distance can be used for both ordinal and quantitative variables.

Minkowski Distance is the generalized metric distance. When , it becomes city block distance and when , it becomes Euclidean distance. Chebyshev distance is a special case of Minkowski distance with (taking a limit).

A = (0, 3, 4, 5)

B = (7, 6, 3, -1)


• Angular separation :It represents cosine angle between two vectors. It measures similarity rather than distance or dissimilarity. Thus, higher value of Angular separation indicates the two objects are similar. The value of angular separation is [-1, 1] similar to cosine. It is often called as Coefficient of Correlation.

A = (0, 3, 4, 5)

B = (7, 6, 3, -1)


• Correlation coefficient is standardized angular separation by centering the coordinates to its mean value. The value is between -1 and +1. It measures similarity rather than distance or dissimilarity.

A = (0, 3, 4, 5)

B = (7, 6, 3, -1)

Two Types of Clustering

Hierarchical

• Partitional algorithms: Construct various partitions and then evaluate them by some criterion

• Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using some criterion

Partitional

A Useful Tool for Summarizing Similarity Measurements Dendrogram is a tool used for similarity measurement in hierarchical clustering.

Root

Internal Branch

Terminal Branch

Leaf

Internal Node

Root

Internal Branch

Terminal Branch

Leaf

Internal Node

The similarity between two objects in a dendrogram is represented as the height of the lowest internal node they share.

Business & Economy

B2B Finance Shopping Jobs

Aerospace Agriculture… Banking Bonds… Animals Apparel Career Workspace

Note that hierarchies are commonly used to organize information, for example in a web portal.

Yahoo’s hierarchy is manually created, data mining focus on automatic creation of hierarchies.

We can look at the dendrogram to determine the “correct” number of clusters. In this case, the two highly separated subtrees are highly suggestive of two clusters. (Things are rarely this clear cut, unfortunately)

Outlier

One potential use of a dendrogram is to detect outliers

The single isolated branch is suggestive of a data point that is very different to all others

(How-to) Hierarchical Clustering

The number of dendrograms with n leafs = (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms 2 13 34 155 105... …10 34,459,425

Since we cannot test all possible trees we will have to heuristic search of all possible trees. We could do this..

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides.

Process flow of Agglomerative Hierarchical Clustering

• Convert object features to distance matrix. • Set each object as a cluster (thus if we

have 6 objects, we will have 6 clusters in the beginning)

• Iterate until number of cluster is 1 – Merge two closest clusters – Update distance matrix

Linkage criteria to determine the distance between an object and a cluster, or defining the distance between two clusters.

• Single linkage (nearest neighbor): In this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters.

• Complete linkage (furthest neighbor): In this method, the distances between clusters are determined by the greatest distance between

any two objects in the different clusters (i.e., by the "furthest neighbors").

• Group average linkage: In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in

the two different clusters.

• Wards Linkage: In this method, we try to minimize the variance of the merged clusters

Example

• Consider 6 objects (with name A, B, C, D, E and F)

• Two measured features (X1 and X2).

• Distance between object A = (1, 1) and B = (1.5, 1.5)

• D = (3, 4) and F = (3, 3.5)

Example cont..

• Distance between ungrouped clusters will not change from the original distance matrix.

• Now the problem is how to calculate distance between newly grouped clusters (D, F) and other clusters?– Using single linkage, we

specify minimum distance between original objects of the two clusters.

– Using the input distance matrix, distance between cluster (D, F) and cluster A is computed .

Example cont..

• Distance between cluster C and cluster (D, F)

• Distance between cluster (D, F) and cluster (A, B)

Example cont..

• Distance between cluster ((D, F), E) and cluster (A, B)

• No need to specify the number of clusters in advance.

• Hierarchal nature maps nicely onto human intuition for some domains

• They do not scale well: time complexity of at least O(n2), where n is the number of total objects.

• Interpretation of results is (very) subjective.

Summary of Hierarchal Clustering Methods

Partitional Clustering

• Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters.

• Since only one set of clusters is output, the user normally has to input the desired number of clusters K.

Algorithm k-meansIt randomly selects k objects as cluster mean or center. It

works towards optimizing square error criteria function

1. Decide on a value for k.

2. Initialize the k cluster centers (randomly, if necessary).

3. Decide the class memberships of the N objects by assigning them to the nearest cluster center.

4. Re-estimate the k cluster centers, by assuming the memberships found above are correct.

5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.

k

i Cxi

i

mx1

2

0

1

2

3

4

5

0 1 2 3 4 5

K-means Clustering: Step 1Algorithm: k-means, Distance Metric: Euclidean Distance

k1

k2

k3

0

1

2

3

4

5

0 1 2 3 4 5


k1

k2

k3

0

1

2

3

4

5

0 1 2 3 4 5


k1

k2

k3

0

1

2

3

4

5

0 1 2 3 4 5


k1

k2

k3

0

1

2

3

4

5

0 1 2 3 4 5

expression in condition 1

exp

ress

ion

in c

on

dit

ion

2


k1

k2 k3

Example

Object X Y

Medicine A 1 1

Medicine B 2 1

Medicine C 4 3

Medicine D 5 4

K=2, based on the two features X, Y

1. Initial value of centroids : Suppose we use medicine A and medicine B as the first centroids. Let c1 and c2 denote the coordinate of the centroids, then c1=(1,1) and c2=(2,1)

2. Objects-Centroids distance : For distance between cluster centroid to each object use Euclidean distance.

• distance from medicine C = (4, 3) to

the first centroid c1(1,1) is:

• Distance from medicine c=(4,3) to second centroid c2(2,1) is:

distance matrix at iteration 0 is

• Each column in the distance matrix symbolizes the object.

• The first row of the distance matrix corresponds to the distance of each object to the first centroid.

• Second row is the distance of each object to the second centroid.

Example

Object X Y

Medicine A 1 1

Medicine B 2 1

Medicine C 4 3

Medicine D 5 4

K=2, based on the two features X, Y 3. Objects clustering : Each object is assigned

based on the minimum distance. Thus, medicine A is assigned to group 1; medicine B, medicine C and medicine D to group 2.

4. Iteration-1, determine centroids : Again compute the new centroid of each group based on these new memberships.

Group 1 only has one member thus the centroid remains c1=(1,1).

Group 2 has three members thus new centroid:

5. Iteration-1, Objects-Centroids distances : To compute the distance of all objects to the new centroids. Similar to step 2, new distance matrix is obtained.

Group Matrix

• The element of Group matrix is 1, if and only if the object is assigned to that group

Example

6. Iteration-1, Objects clustering: Similar to step 3, we assign each object to groups based on the minimum distance. Medicine

B is moved to group 2. 7. Iteration 2, determine centroids: step 4 to

calculate the new centroids is repeated:

8. Iteration-2, Objects-Centroids distances : Repeat step 2 again, to compute new distance matrix.

9. Iteration-2, Objects clustering: Again each object is assigned to groups.

Comparing the results, G2=G1 reveals that the objects does not move group anymore.

Group Matrix

Comments on the K-Means Method

• Strength – Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.– Scale well for large datasets.

• Weakness– Applicable only when mean is defined, then what

about categorical data?– Need to specify k, the number of clusters, in advance– Unable to handle noisy data and outliers

References• J. Han, and M. Kamber, Data Mining Concepts and Techniques.

Morgan Kaufmann.• A.K, Jain, M.N., Murty, and P.J. Flynn, Data Clustering : A review.

ACM Computing Surveys, 31 ,3, 264-323.• Pavel Berkhin. Survey of Clustering Data Mining Techniques. from

Internet.• B. Mirkin, Clustering for Data Mining: Data Recovery Approach,

Chapman & Hall/CRC, 2005.• S.C. Sharma, Applied multivariate techniques, John Wiley & Sons,

New York, 1995.• http://en.wikipedia.org/wiki/K-means_clustering • http://people.revoledu.com/kardi/tutorial• R. Xu, D. Wunsch, Survey of Clustering Algorithms, IEEE

Transactions on Neural Networks, Vol. 16, No. 3, May 2005• S. Mitra, T. Acharya, Data Mining: Multimedia, Soft Computing, and

Bioinformatics, John Wiley & Sons, 2004, ISBN 9812-53-063-0

Probability-Based Clustering

• The foundation of the probability-based clustering approach is based on a so-called finite mixture model.

• A mixture is a set of k probability distributions, each of which governs the attribute values distribution of a cluster.

• Seek the most likely clusters given the data• Instance belongs to a particular cluster with a certain

probability

Using the Mixture Model• Probability that instance x belongs to cluster A:

with

• Probability of an instance given the clusters:Pr]|Pr[ xA

]Pr[

),;(

Pr[x]

Pr[A]A]|Pr[x]|Pr[

xAPAAxf

xA

22

2)(exp

)2(

1),;(

xxf

i

iclustericlusterxclustersthex ]Pr[]|Pr[]_|Pr[

Learning the Clusters• Assume

– We know there are k clusters.• Learn the Clusters

– Determine their parameters– (mean and standard deviation)

• Performance Criteria– Probability of training data given the clusters

• EM algorithm– Finds the maximum of the likelihood

Pr]|Pr[ xA

EM Algorithm

• EM = Expectation Maximization – Generalize k means to probabilistic setting

• Iterative procedure:– E “expectation” step: Calculate cluster probability for each instance– M “maximization” step:Estimate distribution parameters from cluster probabilities

• Store cluster probabilities as instance weights• Stop when improvement is negligible Pr]|Pr[ xA

A 2-Cluster Example of the Finite Mixture Model

• In this example, it is assumed that there are two clusters and the attribute value distributions in both clusters are normal distributions.

N(2,22)N(1,1

2)

Operation of the EM Algorithm

• The EM algorithm is to figure out the parameters for the finite mixture model.

• In this example, we need to figure out the following 5 parameters:1, 1, 2, 2, P(C1).

• Then, the probabilities that sample si belongs to these two clusters are computed as follow:

. sample of valueattribute is where

,|Prob|Prob

|Prob

2

1

Prob

Prob

Prob

Prob|Prob|Prob

2

1

Prob

Prob

Prob

Prob|Prob|Prob

21

1

2

2

2222

2

1

1111

22

22

21

21

ii

ii

ii

x

ii

ii

x

ii

ii

sx

xCxC

xCp

ex

C

x

CCxxC

ex

C

x

CCxxC

i

i

•The new estimated values of parameters are computed as follows.

n

pCP

p

xp

p

xp

p

xp

p

xp

n

ii

n

ii

n

iii

n

ii

n

iii

n

ii

n

iii

n

ii

n

iii

11

1

1

21

22

1

1

21

21

1

12

1

11

)(

)1(

)1( ;

)1(

)1( ;

• The process is repeated until the clustering results converge.

• Generally, we attempt to maximize the following likelihood function:

.

21*

21

2211

ii

i

iii

ii

ii

i

|xCP|xCPcoff

|xCPxP|xCPxP

|CxPCP|CxPCP

• Once we have figured out the approximate parameter values, then we assign sample si into C1, if

• Otherwise, si is assigned into C2.

.2

1

Prob

Prob

Prob

Prob|Prob|Prob

2

1

Prob

Prob

Prob

Prob|Prob|Prob

,5.0|Prob|Prob

|Prob

22

22

21

21

2

2

2222

2

1

1111

21

1

i

i

x

ii

ii

x

ii

ii

ii

ii

ex

C

x

CCxxC

ex

C

x

CCxxC

xCxC

xCp

ExampleConsider a one dimensional clustering problem in which data

given are: x1=-4, x2=-3, x3=-1, x4=3, x5=5 from two probability distributions.

The initial mean of first Gaussian (first cluster) and second Gaussian (second cluster) are considered to be 0 and 2.

Probability density function for normal Gaussian is:

2)

2(

2

1

8

1),(

x

exf

Example

• First probability of each instance belonging to different cluster is computed using probability density function.

0269.8

1)1,4(

2)

2

04(

2

1

ef

0022.8

1)2,4(

2)

2

24(

2

1

ef

0646.)0,3( f 00874.)2,3( f

176.)0,1( f0646.)2,1( f

0646.)0,3( f 176.)2,3( f

00874.)0,5( f 0646.)2,5( f

Example

• E Step924.

0022.0269.

0269.

),(),(

),(

2211

1111

xfxf

xfh 0756.12 h

8808.21 h 1191.22 h

7315.31 h 2685.32 h

2685.41 h 7315.42 h

1191.51 h 8808.52 h

94.1

11

11

1

n

ii

n

iii

h

xh 39.3

12

12

2

n

ii

n

iii

h

xh

• M Step (Calculate mean again)

2 nd Iteration

• E Step

11733.

2)

2

)94.1(4(

2

1

8

1)94.1,4(

ef

00022.

2)

2

39.34(

2

1

8

1)39.3,4(

ef

17330.)94.1,3( f 00121.)39.3,3( f

17858.)94.1,1( f 01793.)39.3,1( f

00944.)94.1,3( f 19568.)39.3,3( f

00048.)94.1,5( f 14424.)39.3,5( f

99813.00022.11733.

11733.

)2,2()1,1(

)1,1(11

xfxf

xfh 00187.

00022.11733.

00022.12

h

99307.00121.11730.

11730.21

h

00693.00121.17330.

00121.22

h

90876.01793.17858.

17858.31

h

09124.01793.17858.

01793.32

h

04602.19568.00944.

00944.41

h 95398.

19568.00944.

19568.42

h

00332.14424.00048.

00048.51

h 99668.

14424.00048.

14424.52

h

62.2

11

11

1

n

iih

n

iixih

37.3

12

12

2

n

iih

n

iixih

3rd Iteration

• E Step

• M Step

15718.)62.2,4( f 00011.)37.3,4( f

19586.)62.2,3( f 00065.)37.3,3( f

14366.)62.2,1( f 01160.)37.3,1( f

00385.)62.2,3( f 18519.)37.3,3( f

00014.)62.2,5( f 16507.)37.3,5( f

99930.11 h00070.12 h

99669.21 h 00331.22 h

92529.31 h 07471.32 h

02037.41 h 97963.42 h

00085.51 h 99915.52 h

67.2

11

11

1

n

iih

n

iixih

81.3

12

12

2

n

iih

n

iixih

EM Clustering in Weka • The Expectation-Maximization (EM) algorithm is part of the

Weka clustering package. • EM clustering scheme generates probabilistic descriptions of the

clusters in terms of mean and standard deviation for the numeric attributes and value counts (incremented by 1 and modified with a small value to avoid zero probabilities) for the nominal ones.

• EM can selects the number of clusters automatically by maximizing the logarithm of the likelihood of future data, estimated using ten fold cross-validation.

Cross validation steps:(i) The number of cluster is set to 1.(ii) Training set is split randomly into 10 folds.(iii) EM is performed 10 times using 10 fold cross

validation.(iv) The log likelihood is averaged over all 10 results.(v) If log likelihood value has increased then number of

clusters is increased by 1 and program continues at step ii.

Limitation of the Finite Mixture Model and the EM Algorithm

• The finite mixture model and the EM algorithm generally assume that the attributes are independent.

• Approaches have been proposed for handling correlated attributes. However, these approaches are subject to further limitations.

Attribute Information of Soybean Disease Datasetv1 date: april=0, may=1, june=2, july=3, august=4, september=5, october=6

v2 plant-stand: normal=0, lt-normal=1

v3 precip: lt-norm=0, norm=1, gt-norm=2

v4 temp: lt-norm=0, norm=1, gt-norm=2

v5 hail: yes=0, no=1

v6 crop-hist: diff-lst-year=0, same-lst-yr=1, same-lst-two-yrs=2, same-lst-sev-yrs=3

v7 area-damaged: scattered=0, low-areas=1, upper-areas=2, whole-field=3

v8 severity: pot-severe=1, severe=2

v9 seed-tmt: none=0, fungicide=1

v10 germination: '90-100%'=0, '80-89%'=1, 'lt-80%'=2

v12 leaves: norm=0, abnorm=1

v20 lodging: yes=0, no=1

v21 stem-cankers: absent=0, below-soil=1, above-soil=2, above-sec-nde=3

v22 canker-lesion: dna=0, brown=1, dk-brown-blk=2, tan=3

v23 fruiting-bodies: absent=0, present=1

v24 external decay: absent=0, firm-and-dry=1

v25 mycelium: absent=0, present=1

v26 int-discolor: none=0, black=2

v27 sclerotia: absent=0, present=1

v28 fruit-pods: norm=0, dna=3

v35 roots: norm=0, rotted=1

Documents

Clustering_KUK.pptx