Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial

Iterative Optimization and Simplification of Hierarchical Clusterings

Doug FisherDepartment of Computer Science,Vanderbilt University

Journal of Artificial Intelligence Research,4 (1996) 147-179

Presented by: Biyu Liang ('06), Paul Haake ('07)

2

Outline

Introduction Fast but Rough Clustering: Hierarchical

Sorting Iterative Optimization Methods and

Comparison Simplification of Hierarchical Clustering Conclusion

3

IntroductionOverview of method:

Construct an initial clustering inexpensively Iteratively optimize the clustering using some

control strategy Simplify the clustering

➢Goals: Find high quality clusterings without

overfitting Good CPU efficiency

4

Introduction (continued)

Properties of any clustering algorithm:• objective function: evaluates the quality of a

particular clustering on a set of data.• control strategy: specifes how the algorithm

searches the space of all possible clusterings, given some objective function.

In this paper, the authors compare different control strategies using the same objective function.

5

Outline



Experiments Simplification of Hierarchical Clustering Conclusion

6

Hierarchical Sorting

Greedy algorithm to quickly build an initial rough clustering.

All three control strategies (discussed later) begin with the clustering generated by hierarchical sorting. By shuffling records around, they improve the clustering.

7


CU(CK) = P(Ck)ij[P(Ai = Vij |CK)2 -P(Ai = Vij)2] Clusters whose data records have similar attribute

values have a higher CU score. Objective function = the “partition utility” (PU),

the average CU value over all clusters.

8


Start with an empty clustering and add each data record one at a time

For each record being added, there are two choices: Place the record in some existing cluster in the

hierarchy Place the record in a new cluster

Select the option that yields the highest quality score (PU)

9

10

11

Outline




12

Iterative Optimization Methods

Important note: The primary goal of clustering in this paper is to

obtain a single-level partitioning of optimal quality. Hierarchical clustering is used only as an intermediate means toward that end. To evaluate the quality of a solution, the authors therefore only apply the objective function to the first-level partition.

13

Iterative Optimization Methods Reorder-resort (CLUSTER/2): very similar to

k-means Iterative redistribution of single observation:

reassign each record to a better cluster Iterative hierarchical redistribution: reassign

each record or subtree of records to a better cluster

14

Reorder-resort (k-mean)

k random seeds are selected, and k clusters are growing around these attractors.

The centroids of the clusters are picked as new seeds.

The process iterates until there is no further improvement in the quality of generated clustering.

15

Reorder-resort (k-mean) con’t Ordering data to make consecutive

observations dissimilar leads to good clusterings.

Extract a “dissimilarity” ordering from the hierarchical sorting: consecutive records will tend to be dissimilar.

16

Iterative Redistribution of Single Observations

Repeat until the clustering doesn't change: For every record, remove it from the

clustering and resort it beginning at the root

17

Iterative Hierarchical Redistribution

Problem: The last control strategy resorts only one record at a time.

Solution: Resort entire subtrees of records at a time.

18

Iterative Hierarchical Redistribution Hierarchical-Redistribute-Recurse(SiblingSet)

Repeat until two consecutive clusterings have the same set of siblings: For each sibling in SiblingSet:

Remove the sibling from the hierarchy and resort

SiblingSet ← remaining siblings For each sibling S in SiblingSet

call Hierarchical-Redistribute-Recurse(S.children)

Repeat until clustering converges: Clustering ← Hierarchical-Redistribute-

Recurse(Clustering.root.children)

19

20

Main findings from the experiments Hierarchical redistribution achieves the highest

mean PU scores in most cases Reordering and re-clustering comes closest to

hierarchical redistribution’s performance in all cases Single-observation redistribution modestly improves

an initial sort, and is substantially worse than the other two optimization methods

21

Outline

Introduction Generating Initial Hierarchical Clustering Iterative Optimization Methods and


22

Simplifying Hierarchical Clustering

Higher levels of the hierarchy are meaningful, but lower levels are subject to overfitting.

Solution: post-process the hierarchy with validation and pruning.

23

Validation

Strategy: Find internal nodes that are most predictive on unseen data (a testing set).

What does “predictive” mean in this case? When a data record is classified into a cluster, we want to know how accurately that cluster, in turn, can predict the data record's attribute values.

In a high-quality clustering, we expect that an unseen data record, classified into some cluster, will have attribute values similar to the attribute values of other data records in the cluster.

24

Validation

For each variable Ai: For each data record:

Classify the data record through the cluster hierarchy, beginning at the root, and ignoring the value of Ai.

At each node, compare the record's Ai value to the node's expected Ai value; keep a counter of correct predictions for each variable at each node.

25

Validation

After processing all variables, for each variable, identify a “frontier” in the hierarchy such that the number of correct predictions of that variable is maximized.

If a node lies below the frontier of every variable, then it is pruned.

26

27

Validation

The authors' experiments show that their validation method substantially reduces clustering size without diminishing predictive accuracy.

28

Concluding Remarks

There are three phases in searching the space of hierarchical clusterings: Inexpensive generation of an initial clustering Iterative optimization for clusterings Post-processing simplification of generated

clusterings Experiments found that the new method,

hierarchical redistribution optimization, beats the other iterative optimization methods in most cases.

29

Final Exam Question #1

The main idea in this paper is to construct clusterings which satisfy two conditions.

Name the conditions: Consistently constructs high-quality clusterings Computationally inexpensive

Name the two steps to satisfy the conditions: Generate a tentative clustering inexpensively, using

hierarchical sorting Iteratively optimize that initial clustering

30


Describe the three iterative methods for clustering optimization:

Seed Selection, Reordering, and Reclustering (p. 14-15)

Iterative Redistribution of Single Observations (p. 16)

Iterative Hierarchical Redistribution (p. 17-19)

31


The cluster is better when the relative CU score is a) big, b) small, c) equal to 0

Which sorting method is better? a) random sorting, b) similarity sorting

Thanks! Question?

Documents

Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial