Upload
zilangambas4535
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Introduction to Clustering
Alka AroraSr. Scientist
Indian Agricultural Statistics Research InstituteLibrary Avenue, Pusa, New [email protected], [email protected]
• Organizing data into classes such that there is• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes directly from the data (in contrast to classification).
• More informally, finding natural groupings among objects.
What is Clustering?Also called unsupervised learning, sometimes called classification by statisticians and sorting by psychologists and segmentation by people in marketing
School Employees Simpson's Family Males Females
Clustering is subjective
What is a natural grouping among these objects?
How do we measure similarity?
0.23 3 342.7
Peter Piotr
Peter
Piter
Pioter
Piotr
Substitution (i for e)
Insertion (o)
Deletion (e)
Edit Distance Example It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion.Assume that each of these operators has a cost associated with it.
The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C.
How similar are the names “Peter” and “Piotr”?Assume the following cost function
Substitution 1 UnitInsertion 1 UnitDeletion 1 Unit
D(Peter,Piotr) is 3
Similarity / Dissimilarity
• Similarity between feature i and feature j is denoted by sij. We can measure this quantity in several ways depending on the scale of measurement (or data type) that we have.
• Dissimilarity (dij) is opposite to similarity. There are many types of distance and similarity. Each similarity or dissimilarity has its own characteristics.
• The relationship between dissimilarity and similarity is given by
• dij= 1-sij
Distance for Binary Variables
• Binary variables are those which take binary value such as Yes and No, or Agree and Disagree, True and False, Success and Failure, 0 and 1, Absence or Present, Positive and Negative, etc.
• Binary variables take only two possible values, which can be represented as positive and negative. Similarity or dissimilarity (distance) of two objects can be measured in term of number of occurrence (frequency) of positive and negative in each object.
Example
• Let p = number of variables that positive for both objects .• q = number of variables that positive for the i th objects and negative for
the j th object • r= number of variables that negative for the i th objects and positive for the
j th object • s= number of variables that negative for both objects • t= p+q+r+s = total number of variables. • Apple and Banana have p=1, q=3 and r=0, s=0. Thus, t= p+q+r+s=4.
Feature of Fruit Sphere shape Sweet Sour Crunchy
Apple Yes Yes Yes Yes
Banana No Yes No No
Distance Measures for Binary Variables
• Simple Matching distance – =3/4
• Jaccard's distance
– =3/4• Hamming distance
– The length of different digits is called Hamming distance = =3
Distance for Quantitative Variables
Euclidean Distance examines the root of square differences between coordinates of a pair of objects .
City block (Manhattan) distance examines the absolute differences between coordinates of a pair of objects.
A = (0, 3, 4, 5)
B = (7, 6, 3, -1)
Distance for Quantitative Variables
Chebyshev distance is also called Maximum value distance. It examines the absolute magnitude of the differences between coordinates of a pair of objects. This distance can be used for both ordinal and quantitative variables.
Minkowski Distance is the generalized metric distance. When , it becomes city block distance and when , it becomes Euclidean distance. Chebyshev distance is a special case of Minkowski distance with (taking a limit).
A = (0, 3, 4, 5)
B = (7, 6, 3, -1)
Distance for Quantitative Variables
• Angular separation :It represents cosine angle between two vectors. It measures similarity rather than distance or dissimilarity. Thus, higher value of Angular separation indicates the two objects are similar. The value of angular separation is [-1, 1] similar to cosine. It is often called as Coefficient of Correlation.
A = (0, 3, 4, 5)
B = (7, 6, 3, -1)
Distance for Quantitative Variables
• Correlation coefficient is standardized angular separation by centering the coordinates to its mean value. The value is between -1 and +1. It measures similarity rather than distance or dissimilarity.
A = (0, 3, 4, 5)
B = (7, 6, 3, -1)
Two Types of Clustering
Hierarchical
• Partitional algorithms: Construct various partitions and then evaluate them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using some criterion
Partitional
A Useful Tool for Summarizing Similarity Measurements Dendrogram is a tool used for similarity measurement in hierarchical clustering.
Root
Internal Branch
Terminal Branch
Leaf
Internal Node
Root
Internal Branch
Terminal Branch
Leaf
Internal Node
The similarity between two objects in a dendrogram is represented as the height of the lowest internal node they share.
Business & Economy
B2B Finance Shopping Jobs
Aerospace Agriculture… Banking Bonds… Animals Apparel Career Workspace
Note that hierarchies are commonly used to organize information, for example in a web portal.
Yahoo’s hierarchy is manually created, data mining focus on automatic creation of hierarchies.
We can look at the dendrogram to determine the “correct” number of clusters. In this case, the two highly separated subtrees are highly suggestive of two clusters. (Things are rarely this clear cut, unfortunately)
Outlier
One potential use of a dendrogram is to detect outliers
The single isolated branch is suggestive of a data point that is very different to all others
(How-to) Hierarchical Clustering
The number of dendrograms with n leafs = (2n -3)!/[(2(n -2)) (n -2)!]
Number Number of Possibleof Leafs Dendrograms 2 13 34 155 105... …10 34,459,425
Since we cannot test all possible trees we will have to heuristic search of all possible trees. We could do this..
Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.
Top-Down (divisive): Starting with all the data in a single cluster, consider every possible way to divide the cluster into two. Choose the best division and recursively operate on both sides.
Process flow of Agglomerative Hierarchical Clustering
• Convert object features to distance matrix. • Set each object as a cluster (thus if we
have 6 objects, we will have 6 clusters in the beginning)
• Iterate until number of cluster is 1 – Merge two closest clusters – Update distance matrix
Linkage criteria to determine the distance between an object and a cluster, or defining the distance between two clusters.
• Single linkage (nearest neighbor): In this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters.
• Complete linkage (furthest neighbor): In this method, the distances between clusters are determined by the greatest distance between
any two objects in the different clusters (i.e., by the "furthest neighbors").
• Group average linkage: In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in
the two different clusters.
• Wards Linkage: In this method, we try to minimize the variance of the merged clusters
Example
• Consider 6 objects (with name A, B, C, D, E and F)
• Two measured features (X1 and X2).
• Distance between object A = (1, 1) and B = (1.5, 1.5)
• D = (3, 4) and F = (3, 3.5)
Example cont..
• Distance between ungrouped clusters will not change from the original distance matrix.
• Now the problem is how to calculate distance between newly grouped clusters (D, F) and other clusters?– Using single linkage, we
specify minimum distance between original objects of the two clusters.
– Using the input distance matrix, distance between cluster (D, F) and cluster A is computed .
Example cont..
• Distance between cluster C and cluster (D, F)
• Distance between cluster (D, F) and cluster (A, B)
Example cont..
• Distance between cluster ((D, F), E) and cluster (A, B)
• No need to specify the number of clusters in advance.
• Hierarchal nature maps nicely onto human intuition for some domains
• They do not scale well: time complexity of at least O(n2), where n is the number of total objects.
• Interpretation of results is (very) subjective.
Summary of Hierarchal Clustering Methods
Partitional Clustering
• Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters.
• Since only one set of clusters is output, the user normally has to input the desired number of clusters K.
Algorithm k-meansIt randomly selects k objects as cluster mean or center. It
works towards optimizing square error criteria function
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by assigning them to the nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the memberships found above are correct.
5. If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.
k
i Cxi
i
mx1
2
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 1Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 2Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 3Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
K-means Clustering: Step 4Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
exp
ress
ion
in c
on
dit
ion
2
K-means Clustering: Step 5Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2 k3
Example
Object X Y
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
K=2, based on the two features X, Y
1. Initial value of centroids : Suppose we use medicine A and medicine B as the first centroids. Let c1 and c2 denote the coordinate of the centroids, then c1=(1,1) and c2=(2,1)
2. Objects-Centroids distance : For distance between cluster centroid to each object use Euclidean distance.
• distance from medicine C = (4, 3) to
the first centroid c1(1,1) is:
• Distance from medicine c=(4,3) to second centroid c2(2,1) is:
distance matrix at iteration 0 is
• Each column in the distance matrix symbolizes the object.
• The first row of the distance matrix corresponds to the distance of each object to the first centroid.
• Second row is the distance of each object to the second centroid.
Example
Object X Y
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
K=2, based on the two features X, Y 3. Objects clustering : Each object is assigned
based on the minimum distance. Thus, medicine A is assigned to group 1; medicine B, medicine C and medicine D to group 2.
4. Iteration-1, determine centroids : Again compute the new centroid of each group based on these new memberships.
Group 1 only has one member thus the centroid remains c1=(1,1).
Group 2 has three members thus new centroid:
5. Iteration-1, Objects-Centroids distances : To compute the distance of all objects to the new centroids. Similar to step 2, new distance matrix is obtained.
Group Matrix
• The element of Group matrix is 1, if and only if the object is assigned to that group
Example
6. Iteration-1, Objects clustering: Similar to step 3, we assign each object to groups based on the minimum distance. Medicine
B is moved to group 2. 7. Iteration 2, determine centroids: step 4 to
calculate the new centroids is repeated:
8. Iteration-2, Objects-Centroids distances : Repeat step 2 again, to compute new distance matrix.
9. Iteration-2, Objects clustering: Again each object is assigned to groups.
Comparing the results, G2=G1 reveals that the objects does not move group anymore.
Group Matrix
Comments on the K-Means Method
• Strength – Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.– Scale well for large datasets.
• Weakness– Applicable only when mean is defined, then what
about categorical data?– Need to specify k, the number of clusters, in advance– Unable to handle noisy data and outliers
References• J. Han, and M. Kamber, Data Mining Concepts and Techniques.
Morgan Kaufmann.• A.K, Jain, M.N., Murty, and P.J. Flynn, Data Clustering : A review.
ACM Computing Surveys, 31 ,3, 264-323.• Pavel Berkhin. Survey of Clustering Data Mining Techniques. from
Internet.• B. Mirkin, Clustering for Data Mining: Data Recovery Approach,
Chapman & Hall/CRC, 2005.• S.C. Sharma, Applied multivariate techniques, John Wiley & Sons,
New York, 1995.• http://en.wikipedia.org/wiki/K-means_clustering • http://people.revoledu.com/kardi/tutorial• R. Xu, D. Wunsch, Survey of Clustering Algorithms, IEEE
Transactions on Neural Networks, Vol. 16, No. 3, May 2005• S. Mitra, T. Acharya, Data Mining: Multimedia, Soft Computing, and
Bioinformatics, John Wiley & Sons, 2004, ISBN 9812-53-063-0
Probability-Based Clustering
• The foundation of the probability-based clustering approach is based on a so-called finite mixture model.
• A mixture is a set of k probability distributions, each of which governs the attribute values distribution of a cluster.
• Seek the most likely clusters given the data• Instance belongs to a particular cluster with a certain
probability
Using the Mixture Model• Probability that instance x belongs to cluster A:
with
• Probability of an instance given the clusters:Pr]|Pr[ xA
]Pr[
),;(
Pr[x]
Pr[A]A]|Pr[x]|Pr[
xAPAAxf
xA
22
2)(exp
)2(
1),;(
xxf
i
iclustericlusterxclustersthex ]Pr[]|Pr[]_|Pr[
Learning the Clusters• Assume
– We know there are k clusters.• Learn the Clusters
– Determine their parameters– (mean and standard deviation)
• Performance Criteria– Probability of training data given the clusters
• EM algorithm– Finds the maximum of the likelihood
Pr]|Pr[ xA
EM Algorithm
• EM = Expectation Maximization – Generalize k means to probabilistic setting
• Iterative procedure:– E “expectation” step: Calculate cluster probability for each instance– M “maximization” step:Estimate distribution parameters from cluster probabilities
• Store cluster probabilities as instance weights• Stop when improvement is negligible Pr]|Pr[ xA
A 2-Cluster Example of the Finite Mixture Model
• In this example, it is assumed that there are two clusters and the attribute value distributions in both clusters are normal distributions.
N(2,22)N(1,1
2)
Operation of the EM Algorithm
• The EM algorithm is to figure out the parameters for the finite mixture model.
• In this example, we need to figure out the following 5 parameters:1, 1, 2, 2, P(C1).
• Then, the probabilities that sample si belongs to these two clusters are computed as follow:
. sample of valueattribute is where
,|Prob|Prob
|Prob
2
1
Prob
Prob
Prob
Prob|Prob|Prob
2
1
Prob
Prob
Prob
Prob|Prob|Prob
21
1
2
2
2222
2
1
1111
22
22
21
21
ii
ii
ii
x
ii
ii
x
ii
ii
sx
xCxC
xCp
ex
C
x
CCxxC
ex
C
x
CCxxC
i
i
•The new estimated values of parameters are computed as follows.
n
pCP
p
xp
p
xp
p
xp
p
xp
n
ii
n
ii
n
iii
n
ii
n
iii
n
ii
n
iii
n
ii
n
iii
11
1
1
21
22
1
1
21
21
1
12
1
11
)(
)1(
)1( ;
)1(
)1( ;
• The process is repeated until the clustering results converge.
• Generally, we attempt to maximize the following likelihood function:
.
21*
21
2211
ii
i
iii
ii
ii
i
|xCP|xCPcoff
|xCPxP|xCPxP
|CxPCP|CxPCP
• Once we have figured out the approximate parameter values, then we assign sample si into C1, if
• Otherwise, si is assigned into C2.
.2
1
Prob
Prob
Prob
Prob|Prob|Prob
2
1
Prob
Prob
Prob
Prob|Prob|Prob
,5.0|Prob|Prob
|Prob
22
22
21
21
2
2
2222
2
1
1111
21
1
i
i
x
ii
ii
x
ii
ii
ii
ii
ex
C
x
CCxxC
ex
C
x
CCxxC
xCxC
xCp
ExampleConsider a one dimensional clustering problem in which data
given are: x1=-4, x2=-3, x3=-1, x4=3, x5=5 from two probability distributions.
The initial mean of first Gaussian (first cluster) and second Gaussian (second cluster) are considered to be 0 and 2.
Probability density function for normal Gaussian is:
2)
2(
2
1
8
1),(
x
exf
Example
• First probability of each instance belonging to different cluster is computed using probability density function.
0269.8
1)1,4(
2)
2
04(
2
1
ef
0022.8
1)2,4(
2)
2
24(
2
1
ef
0646.)0,3( f 00874.)2,3( f
176.)0,1( f0646.)2,1( f
0646.)0,3( f 176.)2,3( f
00874.)0,5( f 0646.)2,5( f
Example
• E Step924.
0022.0269.
0269.
),(),(
),(
2211
1111
xfxf
xfh 0756.12 h
8808.21 h 1191.22 h
7315.31 h 2685.32 h
2685.41 h 7315.42 h
1191.51 h 8808.52 h
94.1
11
11
1
n
ii
n
iii
h
xh 39.3
12
12
2
n
ii
n
iii
h
xh
• M Step (Calculate mean again)
2 nd Iteration
• E Step
11733.
2)
2
)94.1(4(
2
1
8
1)94.1,4(
ef
00022.
2)
2
39.34(
2
1
8
1)39.3,4(
ef
17330.)94.1,3( f 00121.)39.3,3( f
17858.)94.1,1( f 01793.)39.3,1( f
00944.)94.1,3( f 19568.)39.3,3( f
00048.)94.1,5( f 14424.)39.3,5( f
99813.00022.11733.
11733.
)2,2()1,1(
)1,1(11
xfxf
xfh 00187.
00022.11733.
00022.12
h
99307.00121.11730.
11730.21
h
00693.00121.17330.
00121.22
h
90876.01793.17858.
17858.31
h
09124.01793.17858.
01793.32
h
04602.19568.00944.
00944.41
h 95398.
19568.00944.
19568.42
h
00332.14424.00048.
00048.51
h 99668.
14424.00048.
14424.52
h
62.2
11
11
1
n
iih
n
iixih
37.3
12
12
2
n
iih
n
iixih
3rd Iteration
• E Step
• M Step
15718.)62.2,4( f 00011.)37.3,4( f
19586.)62.2,3( f 00065.)37.3,3( f
14366.)62.2,1( f 01160.)37.3,1( f
00385.)62.2,3( f 18519.)37.3,3( f
00014.)62.2,5( f 16507.)37.3,5( f
99930.11 h00070.12 h
99669.21 h 00331.22 h
92529.31 h 07471.32 h
02037.41 h 97963.42 h
00085.51 h 99915.52 h
67.2
11
11
1
n
iih
n
iixih
81.3
12
12
2
n
iih
n
iixih
EM Clustering in Weka • The Expectation-Maximization (EM) algorithm is part of the
Weka clustering package. • EM clustering scheme generates probabilistic descriptions of the
clusters in terms of mean and standard deviation for the numeric attributes and value counts (incremented by 1 and modified with a small value to avoid zero probabilities) for the nominal ones.
• EM can selects the number of clusters automatically by maximizing the logarithm of the likelihood of future data, estimated using ten fold cross-validation.
Cross validation steps:(i) The number of cluster is set to 1.(ii) Training set is split randomly into 10 folds.(iii) EM is performed 10 times using 10 fold cross
validation.(iv) The log likelihood is averaged over all 10 results.(v) If log likelihood value has increased then number of
clusters is increased by 1 and program continues at step ii.
Limitation of the Finite Mixture Model and the EM Algorithm
• The finite mixture model and the EM algorithm generally assume that the attributes are independent.
• Approaches have been proposed for handling correlated attributes. However, these approaches are subject to further limitations.
Attribute Information of Soybean Disease Datasetv1 date: april=0, may=1, june=2, july=3, august=4, september=5, october=6
v2 plant-stand: normal=0, lt-normal=1
v3 precip: lt-norm=0, norm=1, gt-norm=2
v4 temp: lt-norm=0, norm=1, gt-norm=2
v5 hail: yes=0, no=1
v6 crop-hist: diff-lst-year=0, same-lst-yr=1, same-lst-two-yrs=2, same-lst-sev-yrs=3
v7 area-damaged: scattered=0, low-areas=1, upper-areas=2, whole-field=3
v8 severity: pot-severe=1, severe=2
v9 seed-tmt: none=0, fungicide=1
v10 germination: '90-100%'=0, '80-89%'=1, 'lt-80%'=2
v12 leaves: norm=0, abnorm=1
v20 lodging: yes=0, no=1
v21 stem-cankers: absent=0, below-soil=1, above-soil=2, above-sec-nde=3
v22 canker-lesion: dna=0, brown=1, dk-brown-blk=2, tan=3
v23 fruiting-bodies: absent=0, present=1
v24 external decay: absent=0, firm-and-dry=1
v25 mycelium: absent=0, present=1
v26 int-discolor: none=0, black=2
v27 sclerotia: absent=0, present=1
v28 fruit-pods: norm=0, dna=3
v35 roots: norm=0, rotted=1