32
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence Research, 4 (1996) 147-179 Presented by: Biyu Liang ('06), Paul Haake ('07)

Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

Embed Size (px)

Citation preview

Page 1: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

Iterative Optimization and Simplification of Hierarchical Clusterings

Doug FisherDepartment of Computer Science,Vanderbilt University

Journal of Artificial Intelligence Research,4 (1996) 147-179

Presented by: Biyu Liang ('06), Paul Haake ('07)

Page 2: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

2

Outline

Introduction Fast but Rough Clustering: Hierarchical

Sorting Iterative Optimization Methods and

Comparison Simplification of Hierarchical Clustering Conclusion

Page 3: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

3

IntroductionOverview of method:

Construct an initial clustering inexpensively Iteratively optimize the clustering using some

control strategy Simplify the clustering

➢Goals: Find high quality clusterings without

overfitting Good CPU efficiency

Page 4: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

4

Introduction (continued)

Properties of any clustering algorithm:• objective function: evaluates the quality of a

particular clustering on a set of data.• control strategy: specifes how the algorithm

searches the space of all possible clusterings, given some objective function.

In this paper, the authors compare different control strategies using the same objective function.

Page 5: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

5

Outline

Introduction Fast but Rough Clustering: Hierarchical

Sorting Iterative Optimization Methods and

Experiments Simplification of Hierarchical Clustering Conclusion

Page 6: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

6

Hierarchical Sorting

Greedy algorithm to quickly build an initial rough clustering.

All three control strategies (discussed later) begin with the clustering generated by hierarchical sorting. By shuffling records around, they improve the clustering.

Page 7: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

7

Hierarchical Sorting

CU(CK) = P(Ck)ij[P(Ai = Vij |CK)2 -P(Ai = Vij)2] Clusters whose data records have similar attribute

values have a higher CU score. Objective function = the “partition utility” (PU),

the average CU value over all clusters.

Page 8: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

8

Hierarchical Sorting

Start with an empty clustering and add each data record one at a time

For each record being added, there are two choices: Place the record in some existing cluster in the

hierarchy Place the record in a new cluster

Select the option that yields the highest quality score (PU)

Page 9: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

9

Page 10: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

10

Page 11: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

11

Outline

Introduction Fast but Rough Clustering: Hierarchical

Sorting Iterative Optimization Methods and

Comparison Simplification of Hierarchical Clustering Conclusion

Page 12: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

12

Iterative Optimization Methods

Important note: The primary goal of clustering in this paper is to

obtain a single-level partitioning of optimal quality. Hierarchical clustering is used only as an intermediate means toward that end. To evaluate the quality of a solution, the authors therefore only apply the objective function to the first-level partition.

Page 13: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

13

Iterative Optimization Methods Reorder-resort (CLUSTER/2): very similar to

k-means Iterative redistribution of single observation:

reassign each record to a better cluster Iterative hierarchical redistribution: reassign

each record or subtree of records to a better cluster

Page 14: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

14

Reorder-resort (k-mean)

k random seeds are selected, and k clusters are growing around these attractors.

The centroids of the clusters are picked as new seeds.

The process iterates until there is no further improvement in the quality of generated clustering.

Page 15: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

15

Reorder-resort (k-mean) con’t Ordering data to make consecutive

observations dissimilar leads to good clusterings.

Extract a “dissimilarity” ordering from the hierarchical sorting: consecutive records will tend to be dissimilar.

Page 16: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

16

Iterative Redistribution of Single Observations

Repeat until the clustering doesn't change: For every record, remove it from the

clustering and resort it beginning at the root

Page 17: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

17

Iterative Hierarchical Redistribution

Problem: The last control strategy resorts only one record at a time.

Solution: Resort entire subtrees of records at a time.

Page 18: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

18

Iterative Hierarchical Redistribution Hierarchical-Redistribute-Recurse(SiblingSet)

Repeat until two consecutive clusterings have the same set of siblings: For each sibling in SiblingSet:

Remove the sibling from the hierarchy and resort

SiblingSet ← remaining siblings For each sibling S in SiblingSet

call Hierarchical-Redistribute-Recurse(S.children)

Repeat until clustering converges: Clustering ← Hierarchical-Redistribute-

Recurse(Clustering.root.children)

Page 19: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

19

Page 20: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

20

Main findings from the experiments Hierarchical redistribution achieves the highest

mean PU scores in most cases Reordering and re-clustering comes closest to

hierarchical redistribution’s performance in all cases Single-observation redistribution modestly improves

an initial sort, and is substantially worse than the other two optimization methods

Page 21: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

21

Outline

Introduction Generating Initial Hierarchical Clustering Iterative Optimization Methods and

Comparison Simplification of Hierarchical Clustering Conclusion

Page 22: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

22

Simplifying Hierarchical Clustering

Higher levels of the hierarchy are meaningful, but lower levels are subject to overfitting.

Solution: post-process the hierarchy with validation and pruning.

Page 23: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

23

Validation

Strategy: Find internal nodes that are most predictive on unseen data (a testing set).

What does “predictive” mean in this case? When a data record is classified into a cluster, we want to know how accurately that cluster, in turn, can predict the data record's attribute values.

In a high-quality clustering, we expect that an unseen data record, classified into some cluster, will have attribute values similar to the attribute values of other data records in the cluster.

Page 24: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

24

Validation

For each variable Ai: For each data record:

Classify the data record through the cluster hierarchy, beginning at the root, and ignoring the value of Ai.

At each node, compare the record's Ai value to the node's expected Ai value; keep a counter of correct predictions for each variable at each node.

Page 25: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

25

Validation

After processing all variables, for each variable, identify a “frontier” in the hierarchy such that the number of correct predictions of that variable is maximized.

If a node lies below the frontier of every variable, then it is pruned.

Page 26: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

26

Page 27: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

27

Validation

The authors' experiments show that their validation method substantially reduces clustering size without diminishing predictive accuracy.

Page 28: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

28

Concluding Remarks

There are three phases in searching the space of hierarchical clusterings: Inexpensive generation of an initial clustering Iterative optimization for clusterings Post-processing simplification of generated

clusterings Experiments found that the new method,

hierarchical redistribution optimization, beats the other iterative optimization methods in most cases.

Page 29: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

29

Final Exam Question #1

The main idea in this paper is to construct clusterings which satisfy two conditions.

Name the conditions: Consistently constructs high-quality clusterings Computationally inexpensive

Name the two steps to satisfy the conditions: Generate a tentative clustering inexpensively, using

hierarchical sorting Iteratively optimize that initial clustering

Page 30: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

30

Final Exam Question #2

Describe the three iterative methods for clustering optimization:

Seed Selection, Reordering, and Reclustering (p. 14-15)

Iterative Redistribution of Single Observations (p. 16)

Iterative Hierarchical Redistribution (p. 17-19)

Page 31: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

31

Final Exam Question #3

The cluster is better when the relative CU score is a) big, b) small, c) equal to 0

Which sorting method is better? a) random sorting, b) similarity sorting

Page 32: Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

Thanks! Question?