Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Clustering lecture 07/12/2015

The clustering task

• Cluster observations into groups so that the observations belonging in the same group are similar, whereas observations in different groups are different

• Basic questions:– What does “similar” mean– What is a good partition of the objects? i.e., how is the

quality of a solution measured– How to find a good partition of the observations

Cluster Organization

• For “small” number of documents simple/flat clustering is acceptable

• Search a smaller set of clusters for relevancy• If cluster is relevant, documents in the cluster

are also relevant• Problem: Look for a broader or more specific

documents• Hierarchical clustering has a tree-like structure

4

Types of Clusterings

• A clustering is a set of clusters

• Important distinction between hierarchical and partitional sets of clusters

• Partitional Clustering– A division data objects into non-overlapping

subsets (clusters) such that each data object is in exactly one subset (simple/flat partitioning)

• Hierarchical clustering– A set of nested clusters organized as a

hierarchical tree

Flat algorithmsUsually start with a random (partial) partitioningRefine it iteratively

K means clusteringModel based clustering

Hierarchical algorithmsBottom-up/agglomerative(Top-down/divisive)

5

Partitional Clustering

Original Points A Partitional Clustering

6

Hierarchical Clustering

p4 p1

p3

p2

p4p1 p2 p3Hierarchical Clustering

Dendrogram

Cluster Parameters• A minimum and maximum size of clusters

– Large cluster sizeone cluster attracting many documentsMulti topic themes

• A matching threshold value for including documents in a cluster– Minimum degree of similarity– Affects the number of clusters– High threshold

Fewer documents can join a clusterLarger number of clusters

• The degree of overlap between clusters– Some documents deal with more than one topic– Low degree of overlap

Greater separation of clusters

• A maximum number of clusters

8

Other Distinctions Between Sets of Clusters

• Exclusive versus non-exclusive– In non-exclusive clusterings, points may belong to multiple clusters.– Can represent multiple classes or ‘border’ points

• Fuzzy versus non-fuzzy– In fuzzy clustering, a point belongs to every cluster with some weight between 0

and 1– Weights must sum to 1– Probabilistic clustering has similar characteristics

• Partial versus complete– In some cases, we only want to cluster some of the data

• Heterogeneous versus homogeneous– Cluster of widely different sizes, shapes, and densities

Aim of Clustering again?• Partitioning data into classes with high intra-class similarity low inter-class similarity• Is it well-defined?

What is Similarity?

• Clearly, subjective measure or problem-dependent

How Similar Clusters are?• Ex1: Two clusters or one clusters?

How Similar Clusters are?• Ex2: Cluster or outliers

Similarity?

• Most cluster methods – use a matrix of similarity computations– Compute similarities between documents

Similarity Matrixtiny little small medium large huge

tiny 1.0 0.8 0.7 0.5 0.2 0.0little 1.0 0.9 0.7 0.3 0.1small 1.0 0.7 0.3 0.2medium 1.0 0.5 0.3large 1.0 0.8huge 1.0

– Diagonal must be 1.0– Monotonicity property must hold– No linearity (value interpolation) assumed– Qualitative Transitive property must hold

Observations to cluster• Real-value attributes/variables

– e.g., salary, height

• Binary attributes

– e.g., gender (M/F), has_cancer(T/F)

• Nominal (categorical) attributes

– e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

• Ordinal/Ranked attributes

– e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

• Variables of mixed types

– multiple attributes with various types

Similarity Measures

• For Ordinal Values– E.g. "small," "medium," "large," "X-large"– Convert to numerical assuming constant …on a

normalized [0,1] scale, where: max(v)=1, min(v)=0, others interpolate

– E.g. "small"=0, "medium"=0.33, etc.– Then, use numerical similarity measures– Or, use similarity matrix

Similarity Measures (cont.)

• For Nominal Values– E.g. "Boston", "LA", "Pittsburgh", or "male", "female", or

"diffuse", "globular", "spiral", "pinwheel"– Binary rule: If di, = dj,k, then sim = 1, else 0– Use underlying sematic property: E.g. Sim(Boston, LA) =

dist(Boston, LA)-1, or Sim(Boston, LA) = (|size(Boston) - size(LA)| ) /Max(size(cities))– Or, use similarity Matrix

Defining Measures of Similarity for text documents

• Semantic similarity -- but this is hard.• Statistical similarity: BOW

– Number of words in common• count

– Weighted words in common• words get additional weight based on inverse document

frequency: more similar if two documents share an infrequent word

– With document as a vector:• Euclidean Distance• Cosine Similarity

Vector Space Model• Each document, d, is considered to be a vector, d, in the term-space (set of

document “words”). • Normally very common words are stripped out completely and different

forms of a word are reduced to one canonical form.• In its simplest form, each document is represented by the (TF) vector d=(tf1,

tf2, ..,tfn) where tfi is the frequency of the ith term in the document. • weights may be based on its inverse document frequency (IDF) in the

document collection. – This discounts frequent words with little discriminating power.

• Finally, in order to account for documents of different lengths, each document vector is normalized so that it is of unit length.

• Given a set, S, of documents and their corresponding vector representations, we define the centroid vector c to be the vector obtained by averaging the weights of the various terms present in the documents of S

Example: Vector space Representation of Documents

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Binary term-document incidence matrixEach document is represented by a binary vector ∈ {0,1}|V|

Sec. 6.2


• Term-document count matrices• Consider the number of occurrences of a term in a

document: – Each document is a count vector in ℕv: a column below


Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Sec. 6.2

Bag of words model

• Vector representation doesn’t consider the ordering of words in a document

• John is quicker than Mary and Mary is quicker than John have the same vectors

• This is called the bag of words model.• How to “recover” positional information

Term frequency tf

• The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

• We want to use tf when computing similarity of two documents. But how?

• Raw term frequency is not what we want:– A document with 10 occurrences of the term is more

relevant than a document with 1 occurrence of the term.– But not 10 times more relevant.

• Relevance does not increase proportionally with term frequency.

Log-frequency weighting• The log frequency weight of term t in d is

• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

• Finding similarity of two documents d1 and d2 : sum over terms t in both d1 ands d2 :

• The score is 0 if the two documents have no common terms.

otherwise 0,

0 tfif, tflog 1 10 t,dt,d

t,dw

21) tflog (1

2,1ddt ddscore

Sec. 6.2

Document frequency

• Rare terms are more informative than frequent terms– Recall stop words

• Consider a term that is rare in the collection (e.g., arachnocentric)

• Two documents containing this term are very likely to be relevant

• → We want a high weight for rare terms like arachnocentric.

Sec. 6.2.1

Document frequency, continued

• Frequent terms are less informative than rare terms• Consider a term that is frequent in the collection (e.g.,

high, increase, line)• Two document containing such a term are more likely to

be relevant than documents that don’t• But it’s not a sure indicator of relevance.• → For frequent terms, we want high positive weights for

words like high, increase, and line• But lower weights than for rare terms.• We will use document frequency (df) to capture this.

Sec. 6.2.1

idf weight

• dft is the document frequency of t: the number of documents that contain t– dft is an inverse measure of the informativeness of t

– dft N

• We define the idf (inverse document frequency) of t by

– We use log (N/dft) instead of N/dft to “dampen” the effect of idf.

)/df( log idf 10 tt N

Will turn out the base of the log is immaterial.

Sec. 6.2.1

idf example, suppose N = 1 million

term dft idft

calpurnia 1

animal 100

sunday 1,000

fly 10,000

under 100,000

the 1,000,000

There is one idf value for each term t in a collection.

Sec. 6.2.1

)/df( log idf 10 tt N

6

4

3

2

1

0

Collection vs. Document frequency• The collection frequency of t is the number of

occurrences of t in the collection, counting multiple occurrences.

• Example:

• Which word is a better search term (and should get a higher weight)?

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Sec. 6.2.1

tf-idf weighting

• The tf-idf weight of a term is the product of its tf weight and its idf weight.

• Best known weighting scheme in information retrieval– Note: the “-” in tf-idf is a hyphen, not a minus sign!– Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

)df/(log)tf1log(w 10,, tdt Ndt

Sec. 6.2.2



Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0

Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

• Binary → count → weight matrix• Each document is now represented by a real-

valued vector of tf-idf weights ∈ R|V|

Sec. 6.3

Documents as vectors

• So we have a |V|-dimensional vector space• Terms are axes of the space• Documents are points or vectors in this space• Very high-dimensional: tens of millions of

dimensions when you apply this to a web search engine

• These are very sparse vectors - most entries are zero.

Sec. 6.3

Formalizing vector space proximity (similarity)

• First cut: distance between two points– ( = distance between the end points of the two

vectors)• Euclidean distance?• Euclidean distance is a bad idea . . .• . . . because Euclidean distance is large for

vectors of different lengths.

Sec. 6.3

Euclidean Distance• Euclidean distance: distance between two

measures summed across each feature

• Squared differences to give more weight to larger difference

dist(xi, xj) = sqrt((xi1-xj1)2+(xi2-xj2)2+..+(xin-xjn)2)

Why distance is a bad idea

The Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 are

very similar.

Sec. 6.3

Similarity Measures (Set-based)

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

21

22

1

1

21

21

21

21

21

21

DD

DD

DD

DD

DDDD

DD

DD

DD

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Assuming that D1 and D2 are the sets of terms associated with two documents:

Let the weights for the index terms assigned to two documents i and j be as follows:

Doci = 3,2,1,0,0,0,1,1

Docj = 1,1,1,0,0,1,0,0

EXAMPLES

=12/12 = 1

Dice Coefficient

= 2 [(3*1)+(2*1)+(1*1)+(0*0)+(0*0)+(0*1)+(1*0)+(1*0)](3+2+1+0+0+0+1+1)+(1+1+1+0+0+1+0+0)

Jackard Coefficient

= (3*1)+(2*1)+(1*1)+(0*0)+(0*0)+(0*1)+(1*0)+(1*0)((3+2+1+0+0+0+1+1)+(1+1+1+0+0+1+0+0)-(3*1)+(2*1)+(1*1)+(0*0)+(0*0)+(0*1)+(1*0)+(1*0))

= 6/(12-6)

Document Clustering Techniques• Similarity or Distance Measure:Alternative Choices

– Cosine similarity – Euclidean distance

– Kernel functions, e.g.,

Measuring Similarity of Documents

• The similarity between two documents must be measured

• There are a number of possible measures for computing the similarity between documents

• The most common one is the cosine measure, which is defined as – cosine( d1, d2 ) = (d1 • d2) / |d1| |d2|,

• where • indicates the vector dot product and |d| is the length of vector d.

Cosine similarity measurement

•Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them.

•The result of the Cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value.

•As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases.

21222

122 )22()11(

2*12*1

.cos),(

yxyx

yyxx

BA

BAineBAsim

http://en.wikipedia.org/wiki/Cosine

Clustering Methods

• Many methods to compute clusters• NP complete problem• Each solution can be evaluated quickly but

exhaustive evaluation of all solutions is not feasible

• Each trial may produce a different cluster organization

Stable Clustering

• Results should be independent of the initial order of documents

• Clusters should not be substantially different when new documents are added to the collection

• Results from consecutive runs should not differ significantly

K-Means

• Heuristic with complexity O(nlogn)– Matrix based algorithms O(n2)

• Begins with an initial set of clusters– Pick the cluster centroids randomly– Use matrix based similarity on a small subset– Use density test to pick cluster centers from sample

data• Di is cluster center if at least n other ds have similarity

greater than threshold• A set of documents that are sufficiently dissimilar must exist

in collection

Use one of these methods

K means Algorithm

1. Select k documents from the collection to form k initial singleton clusters

2. Repeat Until termination conditions are satisfiedi. For every document d, find the cluster i whose

centroid is most similar, assign d to cluster i.ii. For every cluster i, recompute the centroid based on

the current member documentsiii. Check for termination—minimal or no changes in the

assignment of documents to clusters

3. Return a list of clusters

K Means Example(K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reassign clusters

xx xx

Compute centroids

Reassign clusters

Converged!

Termination conditions

• Several possibilities, e.g.,– A fixed number of iterations.– Doc partition unchanged.– Centroid positions don’t change.

Time Complexity

• Computing distance between doc and cluster is O(m) where m is the dimensionality of the vectors.

• Reassigning clusters: O(Kn) distance computations, or O(Knm).

• Computing centroids: Each doc gets added once to some centroid: O(nm).

• Assume these two steps are each done once for I iterations: O(IKnm).

Seed Choice• Results can vary based on

random seed selection.• Some seeds can result in poor

convergence rate, or convergence to sub-optimal clusterings.– Select good seeds using a

heuristic (e.g., doc least similar to any existing mean)

– Try out multiple starting points– Initialize with the results of

another method.

In the above, if you startwith B and E as centroidsyou converge to {A,B,C}and {D,E,F}If you start with D and Fyou converge to {A,B,D,E} {C,F}

Example showingsensitivity to seeds

How Many Clusters?

• Number of clusters K is given– Partition n docs into predetermined number of clusters

• Finding the “right” number of clusters is part of the problem– Given data, partition into an “appropriate” number of

subsets.– E.g., for query results - ideal value of K not known up

front - though UI may impose limits.• Can usually take an algorithm for one flavor and

convert to the other.

K not specified in advance

• Say, the results of a query.• Solve an optimization problem: penalize having

lots of clusters– application dependent, e.g., compressed summary

of search results list.• Tradeoff between having more clusters (better

focus within each cluster) and having too many clusters

K not specified in advance

• Given a clustering, define the Benefit for a doc to be some inverse distance to its centroid

• Define the Total Benefit to be the sum of the individual doc Benefits.

Penalize lots of clusters

• For each cluster, we have a Cost C.• Thus for a clustering with K clusters, the Total Cost

is KC.• Define the Value of a clustering to be =

Total Benefit - Total Cost.• Find the clustering of highest value, over all

choices of K.– Total benefit increases with increasing K. But can stop

when it doesn’t increase by “much”. The Cost term enforces this.

Simulated Annealing

• Avoids local optima by randomly searching– Downhill move

• New solution with higher (better) value than the previous solution

– Uphill move• A worse solution is accepted to avoid local minima• The frequency decreases during “life cycle”

• Analogy for crystal formation

Simulated Annealing Algorithm

1. Get initial set of cluster and set the temperature to T2. Repeat until the temperature is reduced to the

minimum– Run a loop x times

• Find a new set of clusters by altering the membership of some documents

• Compare the difference between the values of the new and old set of clusters. If there is an improvement, accept the new set of clusters, otherwise accept the new set of clusters with probability p.

– Reduce the temperature based on cooling schedule

3. Return the final set of clusters

2.1.2.1.2.

2.1.1.

2.2.

Simulated Annealing

• Simple to implement• Solutions are reasonable good and avoid local

minima• Successful in other optimization tasks• Initial set very important• Adjusting the size of clusters is difficult

Genetic Algorithm• Use a population of solutions

1-Arrange the set of documents in a circle such that documents that are similar to one another are located close to each other2-Find key documents from the circle and build clusters from a neighborhood of these documents .

• Each arrangement of documents is a solution: (chromosome)

• Fitnessr: size of similarity neighborhoodn : number of documents

Genetic Algorithm

• Pick two parent solutions x and y from the set of all solutions with preference for solutions with higher fitness score.

• Use crossover operation to combine x and y to generate a new solution z.

• Periodically mutate a solution by randomly exchanging two documents in a solution.

Genetic Operators

5 7 8 3 2 6 1 9 10 4

3 8 2 6 7 5 9 1 4 10

6

Parent 1

Parent 2

child 72 53 98 1 10 4

Conflict Cross over

5 7 8 3 2 6 1 9 10 4

5 9 8 3 2 6 1 7 10 4

Mutation

• Agglomerative (bottom-up): – Start with each document being a single cluster.– Eventually all documents belong to the same cluster.

• Divisive (top-down): – Start with all documents belong to the same cluster. – Eventually each node forms a cluster on its own.– Could be a recursive application of k-means like algorithms

• Does not require the number of clusters k in advance

• Needs a termination/readout condition

Hierarchical Clustering algorithms

Hierarchical Agglomerative Clustering (HAC)

• Assumes a similarity function for determining the similarity of two instances.

• Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.

• The history of merging forms a binary tree or hierarchy.

• Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

Dendogram: Hierarchical Clustering

Hierarchical Agglomerative Clustering (HAC)

• Starts with each doc in a separate cluster– then repeatedly joins the closest pair of

clusters, until there is only one cluster.• The history of merging forms a binary tree

or hierarchy.How to measure distance of clusters??

Closest pair of clusters

Many variants to defining closest pair of clusters• Single-link

– Distance of the “closest” points (single-link)• Complete-link

– Distance of the “furthest” points• Centroid

– Distance of the centroids (centers of gravity)• (Average-link)

– Average distance between pairs of elements

Linking Methods

Clique

Star

String

How to Combine Clusters?

• Intercluster similarity– Single-link (string)– Complete-link (clique)– Group average link (star)

• Single-link clustering– Each document must have a similarity exceeding a stated

threshold value with at least one other document in the same class.

– similarity between a pair of clusters is taken to be the similarity between the most similar pair of items

– each cluster member will be more similar to at least one member in that same cluster than to any member of another cluster

How to Combine Clusters? (Continued)

• Complete-link Clustering– Each document has a similarity to all other documents

in the same class that exceeds the the threshold value.– similarity between the least similar pair of items from

the two clusters is used as the cluster similarity– each cluster member is more similar to the most

dissimilar member of that cluster than to the most dissimilar member of any other cluster

How to Combine Clusters? (Continued)

• Group-average link clustering– a compromise between the extremes of single-link

and complete-link systems– each cluster member has a greater average

similarity to the remaining members of that cluster than it does to all members of any other cluster

Group Average Link Clustering

• Group average link clustering

– use the average values of the pairwise links within a cluster to determine similarity

– all objects contribute to intercluster similarity– resulting in a structure intermediate between the

loosely bound single link cluster and tightly bound complete link clusters

Comparison

• The Behavior of Single-Link Cluster– The single-link process tends to produce a small number of large clusters that

are characterized by a chaining effect.– Each element is usually attached to only one other member of the same

cluster at each similarity level.– It is sufficient to remember the list of previously clustered single items.

Comparison

• The Behavior of Complete-Link Cluster – Complete-link process produces a much larger number of small, tightly linked

groupings.– Each item in a complete-link cluster is guaranteed to resemble all other items

in that cluster at the stated similarity level.– It is necessary to remember the list of all item pairs previously considered in

the clustering process.• Comparison

– The complete-link clustering system may be better adapted to retrieval than the single-link clusters.

– A complete-link cluster generation is more expensive to perform than a comparable single-link process.

Single Link Agglomerative Clustering

• Use maximum similarity of pairs:

• Can result in “straggly” (long and thin) clusters due to chaining effect.

• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(max),(,

yxsimccsimji cycx

ji

)),(),,(max()),(( kjkikji ccsimccsimcccsim

Single Link Example

Complete Link Agglomerative Clustering• Use minimum similarity of pairs:

• Makes “tighter,” spherical clusters that are typically preferable.

• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(min),(,

yxsimccsimji cycx

ji

)),(),,(min()),(( kjkikji ccsimccsimcccsim

Ci Cj Ck

Complete Link Example

Key notion: cluster representative

• We want a notion of a representative point in a cluster

• Representative should be some sort of “typical” or central point in the cluster, e.g.,– point inducing smallest radii to docs in cluster– smallest squared distances, etc.– point that is the “average” of all docs in the cluster

• Centroid or center of gravity

Centroid-based Similarity

• Always maintain average of vectors in each cluster:

• Compute similarity of clusters by:

• For non-vector data, can’t always make a centroid

j

cx

jc

x

cs j

)(

))(),((),( jiji cscssimccsim

Computational Complexity

• In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(mn2).

• In each of the subsequent n2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters.

• Maintaining of heap of distances allows this to be O(mn2logn)

Hierarchical Agglomerative Clustering Methods

• Generic Agglomerative Procedure (Salton '89):- result in nested clusters via iterations1. Compute all pairwise document-document similarity

coefficients2. Place each of n documents into a class of its own3. Merge the two most similar clusters into one;

- replace the two clusters by the new cluster - recompute intercluster similarity scores w.r.t. the new cluster

4. Repeat the above step until there are only k clusters left (note k could = 1).

79

Starting Situation • Start with clusters of individual points and a

proximity matrixp1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

80

Intermediate Situation• After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

81

Intermediate Situation• We want to merge the two closest clusters (C2 and C5) and update the proximity

matrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

82

After Merging• The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5

C3? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

83

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function Ward’s Method uses squared error

Proximity Matrix

84


p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix



85


p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix



86


p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix



87


p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix



88

Cluster Similarity: MIN or Single Link

• Similarity of two clusters is based on the two most similar (closest) points in the different clusters– Determined by one pair of points, i.e., by one link

in the proximity graph.I1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

89

Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2

90

Strength of MIN

Original Points Two Clusters

• Can handle non-elliptical shapes

91

Limitations of MIN


• Sensitive to noise and outliers

92

Cluster Similarity: MAX or Complete Linkage

• Similarity of two clusters is based on the two least similar (most distant) points in the different clusters– Determined by all pairs of points in the two

clustersI1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

93

Hierarchical Clustering: MAX


3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

3

4

5

6

1

2 5

3

4

94

Strength of MAX


• Less susceptible to noise and outliers

95

Limitations of MAX


•Tends to break large clusters

•Biased towards globular clusters

96

Cluster Similarity: Group Average• Proximity of two clusters is the average of pairwise proximity between points in

the two clusters.

• Need to use average connectivity for scalability since total proximity favors large clusters

|Cluster||Cluster|

)p,pproximity(

)Cluster,Clusterproximity(ji

ClusterpClusterp

ji

jijjii

∑∈∈

=

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

97

Hierarchical Clustering: Group Average


3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

1

2

3

4

5

6

1

2

5

3

4

98

Hierarchical Clustering: Group Average

• Compromise between Single and Complete Link

• Strengths– Less susceptible to noise and outliers

• Limitations– Biased towards globular clusters

99

Cluster Similarity: Ward’s Method• Similarity of two clusters is based on the

increase in squared error when two clusters are merged– Similar to group average if distance between

points is distance squared

• Less susceptible to noise and outliers

• Biased towards globular clusters

• Hierarchical analogue of K-means– Can be used to initialize K-means

100

Hierarchical Clustering: Comparison

Group Average

Ward’s Method

1

2

3

4

5

61

2

5

3

4

MIN MAX

1

2

3

4

5

61

2

5

34

1

2

3

4

5

61

2 5

3

41

2

3

4

5

6

12

3

4

5

101

Hierarchical Clustering: Problems and Limitations

• Once a decision is made to combine two clusters, it cannot be undone

• No objective function is directly minimized• Different schemes have problems with one

or more of the following:– Sensitivity to noise and outliers– Difficulty handling different sized clusters and

convex shapes– Breaking large clusters

Group Agglomerative Clustering

1

2

4

5

6

7

8

93