49
AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Embed Size (px)

Citation preview

Page 1: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

AMCS/CS 340: Data Mining

Clustering

Xiangliang Zhang

King Abdullah University of Science and Technology

Page 2: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Grouping fruits

2

Grouping apple with apple, orange with orange and banana with banana

Page 3: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Give pictures to a computer

3Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 4: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Change pictures to data

4Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 5: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Change pictures to data

5

x1 x2 x3

x4 x5 x6

x7 x8 x9 ……xn

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 6: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Use clustering methods

6

x1x2x3x4x5x6x7x8x9...…

Clustering Method

121232313...…

Output: clustering indicator

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 7: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Correct?

7

x1x2x3x4x5x6x7x8x9...…

Clustering Method

121232313...…

Output: clustering indicator

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 8: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Cluster Analysis

1. What is Cluster Analysis?

2. Partitioning Methods

3. Hierarchical Methods

4. Density-Based Methods

5. Grid-Based Methods

6. Model-Based Methods

7. Clustering High-Dimensional Data

8. How to decide the number of clusters?

8Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 9: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

What is Cluster Analysis?

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster

distances are maximized

Intra-cluster distances are

minimized

Unsupervised learning: no predefined classes

9Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 10: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Applications of Cluster Analysis

Understanding As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms E.g., group related documents for browsing, group genes and

proteins that have similar functionality, or group stocks with similar price fluctuations

Summarization Reduce the size of large data sets Image segmentation/compression Preserve Privacy (e.g., in medical data)

10Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 11: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Clustering

A clustering is a set of clusters

Notion of a Cluster can be Ambiguous

A set of data points

A clustering with Four Clusters A clustering with Six Clusters

A clustering with Two Clusters

11Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 12: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Quality: What is Good Clustering?

• A good clustering method will produce high quality

clusters with

high intra-class similarity

low inter-class similarity

• The quality of a clustering result depends on both the

implementation of a method and the similarity

measure used by this method

• The quality of a clustering method is also measured

by its ability to discover some or all of the hidden

patterns12Xiangliang Zhang, KAUST AMCS/CS 340: Data

Mining

Page 13: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

What kinds of similarity measure ?

Quality: What is Good Clustering?

• A good clustering method will produce high quality

clusters with

high intra-class similarity

low inter-class similarity

• The quality of a clustering result depends on both the

implementation of a method and the similarity

measure used by this method

• The quality of a clustering method is also measured

by its ability to discover some or all of the hidden

patterns13

What kinds of method ?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 14: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Similarity measure

The similarity measure depends on the characteristics

of the input data

• Attribute type: binary, categorical, continuous

• Sparseness

• Dimensionality

• Type of proximity

14

Center-based Density-based

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 15: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Data Structures

Data matrixn instances, p attributes (features)

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0

npx...nfx...n1x

...............ipx...ifx...i1x

...............1px...1fx...11x

15

• Minkowski distance:

If q = 1, d is Manhattan distance

If q = 2, d is Euclidean distance• Cosine measure• Correlation coefficient

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

Distance matrix (dissimilarity matrix)

||...||||),(2211 pp jxixjxixjxixjid

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 16: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Types of Clustering methods

Partitioning Clustering A division data objects into non-overlapping subsets

(clusters) such that each data object is in exactly one

subset Typical methods: k-means, k-medoids, CLARANS

A Partitioning Clustering

16Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 17: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Types of Clustering methods

Hierarchical clustering A set of nested clusters organized as a hierarchical

tree

Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON

p4 p1

p3

p2

p4p1 p2 p3 Hierarchical Clustering Dendrogram

17Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 18: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Types of Clustering methods

Density-based Clustering Based on connectivity and density functions A cluster is a dense region of points, which is separated by

low-density regions, from other regions of high density. Typical methods: DBSACN, OPTICS, DenClue

18

6 density-based clusters

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 19: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Types of Clustering methods

Grid-based Clustering Based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE

19Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 20: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Types of Clustering methods

Model-based clustering: A model is hypothesized for each of the clusters and

tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB

20Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 21: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Cluster Analysis

1. What is Cluster Analysis?

2. Partitioning Methods

3. Hierarchical Methods

4. Density-Based Methods

5. Grid-Based Methods

6. Model-Based Methods

7. Clustering High-Dimensional Data

8. How to decide the number of clusters?

21Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 22: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Partitioning Algorithms: Basic Concept

Partitioning clustering method: Construct a partition of a

dataset D of n objects into a set of k clusters, s.t., min sum of squared

distance

Averaged center of the cluster

NP-hard when k is a part of input (even for 2-dim)* Given a k, finding a partition of k clusters that optimizes SSD takes

Heuristic methods: k-means (also called Lloyd’s method [Llo82])

22

A Partitioning Clustering k=3

21 )( miCx

km x

mi

# )( )(dkOnO

C3

C2C1

xx

x

μi

* Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. (2009). "The Planar k-Means Problem is NP-Hard". Lecture Notes in Computer Science 5431: 274–285.# Inaba; Katoh; Imai (1994). Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering". Proceedings of 10th ACM Symposium on Computational Geometry.

iCxm

m xC mi

||

1

Page 23: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Partitioning Algorithms: Basic Concept

Partitioning clustering method: Construct a partition of a

dataset D of n objects into a set of k clusters, s.t., min sum of

squared distance

Actual center of the cluster

Global optimal:

exhaustively enumerate all partitions

Heuristic methods: k-medoids or PAM

(Partition around medoids) (Kaufman & Rousseeuw’87)

23

}|{miimCxx

A Partitioning Clustering k=3

C3

C2C1 O

μi

O

O

21 )( miCx

km x

mi

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 24: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Partitioning Algorithms

• k-means Algorithm

Issue of initial centroids , clustering

evaluation

Limitations of k-means

• k-medoids

24Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 25: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

k-means clustering

• Number of clusters, K, must be specified

• Each cluster is associated with an averaged point (centroid)

• Each point is assigned to the cluster with the closest centroid

• The basic algorithm is very simple

25Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 26: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

k-means clustering

• Example:

26

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 27: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

K-means Clustering – Details

• Initial centroids are often chosen randomly. Clusters produced vary from one run to another.

• The centroid is (typically) the mean of the points in the cluster.

• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.

• K-means will converge for common similarity measures mentioned above.

• Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points

change clusters’

• Complexity is O( n * K * t * d )n = number of points, K = number of clusters, t = number of iterations, d = number of attributes 27Xiangliang Zhang, KAUST AMCS/CS 340: Data

Mining

Page 28: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Partitioning Algorithms

• k-means Algorithm

Issue of initial centroids , clustering

evaluation

Limitations of k-means

• k-medoids

28Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 29: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Importance of Choosing Initial Centroid

Clusters produced vary from one run to another.

29-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Original Points

Run 1 Run 2

Page 30: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Evaluating K-means ClusteringMost common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest centroid(the error of representing each point by its nearest centroid)

Given two clustering results, we can choose the one with smaller error

SSE of optimal clustering result reduces when increasing K, the number of clusters

A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

Note:

21 )(),( miCx

km xkCSSE

mi

21 and )2,2()1,1( kkkCSSEkCSSE

C1 is better than C2

30Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 31: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Importance of Choosing Initial Centroid

Clusters produced vary from one run to another.

31-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Original Points

Run 1 Run 2

SSE(Run1) < SSR(Run2)

Page 32: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Solutions to Initial Centroids Problem

• Multiple runs select the one with smallest SSE

• Sample and use hierarchical clustering to determine initial centroids

32Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 33: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Comments on the K-Means Method

Strength: Relatively efficient: O(nktd), where n is # objects, k is #

clusters, t is # iterations , and d is # dimensions. Normally, k,

t ,d<< n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

Comment: Often terminates at a local optimum.

Weakness

Applicable only when mean is defined, then what about categorical

data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with differing sizes, differing

density, non-convex shapes33Xiangliang Zhang, KAUST AMCS/CS 340: Data

Mining

Page 34: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Partitioning Algorithms

• k-means Algorithm

Issue of initial centroids , clustering

evaluation

Limitations of k-means

• k-medoids

34Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 35: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

35Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 36: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

36Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 37: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Limitations of K-means: Non-convex Shapes

Original Points K-means (2 Clusters)

37Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 38: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.Find parts of clusters, but need to put together.

38Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 39: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Original Points K-means Clusters

Overcoming K-means Limitations

One solution is to use many clusters.Find parts of clusters, but need to put together.

39Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 40: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Original Points K-means Clusters

One solution is to use many clusters.Find parts of clusters, but need to put together.

Overcoming K-means Limitations

40Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 41: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

K-medoids, instead of K-means

• The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the distribution of the data

• Means are not able to compute in some cases.• Only similarities among objects are available

• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

41Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 42: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

Partitioning Algorithms

• k-means Algorithm

Issue of initial centroids , clustering evaluation

Limitations of k-means

• k-medoids PAM

CLARA

CLARANS42Xiangliang Zhang, KAUST AMCS/CS 340: Data

Mining

Page 43: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

The K-Medoids Clustering Method

• Find representative objects, called medoids, in clusters

• k-medoids, use the same strategy of k-means

• PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces one of

the medoids by one of the non-medoids if it improves the total

distance of the resulting clustering

PAM works effectively for small data sets, but does not scale well

for large data sets

• CLARA (Kaufmann & Rousseeuw, 1990)

• CLARANS (Ng & Han, 1994): Randomized sampling

43Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 44: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

A Typical K-Medoids Algorithm (PAM)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost ({mi}) = 20

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrary choose k object as initial medoids

(mi,i=1..k) 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids

select a nonmedoid object,Oj

Compute total cost of all possible new set of medoids

(Oj,{mi},i≠t)0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 18

Swapping mt and Oj

If cost({Oj,{mi,i≠t})} is the smallest, and cost ({Oj,{mi,i≠t}}) < cost({mi}).

Do loop

Until no change

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

44

Page 45: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

PAM (Partitioning Around Medoids) (1987)

PAM (Partitioning Around Medoids, Kaufman and Rousseeuw,

1987)

Use real object to represent the cluster

1. Select k representative objects (medoids) arbitrarily

2. For each pair of non-selected object h and selected object i,

calculate the total swapping cost TCih

TCih= total_cost(replace i by h) - total_cost(no replace)

3. If min(TCih )< 0, i is replaced by h

Then assign each non-selected object to the most similar

representative object

4. repeat steps 2-3 until there is no change

45

O(k(n-k)2 ) for each iteration where n is # of data, k is # of clusters

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 46: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

CLARA (Clustering Large Applications) (1990)

CLARA (Clustering LARge Applications, Kaufmann and Rousseeuw)

Sampling based method: It draws multiple samples of the data set,

applies PAM on each sample,

gives the best clustering as the output (minimizing cost/SSE)

Strength: deals with larger data sets than PAM

Weakness: Efficiency depends on the sample size

A good clustering based on samples will not necessarily represent a

good clustering of the whole data set if the sample is biased

46Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 47: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

CLARANS (“Randomized” CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized Search, Ng and Han’94)

The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids

• PAM: checks every neighbor

• CLARA: examines fewer neighbors, searches in subgraphs built from samples

• CLARANS: searches the whole graph but draws sample of neighbors dynamically

47

…….

…….

…….

Each node: k medoids, which correspond to a clustering

!)!(

!#

kkn

nnodes

Two nodes are connectedas neighbors if their sets differ by only one item

each node has k(n-k) neighbors

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 48: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

CLARANS (“Randomized” CLARA) (1994)

• CLARANS: searches the whole graph but draws sample of neighbors dynamically

• It is more efficient and scalable than both PAM and CLARA

• Focusing techniques and spatial access structures may further improve its performance (Ester et al.’95)

48

…….

…….

…….

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Page 49: AMCS/CS 340: Data Mining Clustering Xiangliang Zhang King Abdullah University of Science and Technology

What you should know

• What is clustering?

• What is partitioning clustering method?

• How does k-means work?

• The limitation of k-means

• How does k-mediods work?

• How to solve the scalability problem of k-

mediods?

49Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining