103
Clustering lecture 07/12/2015

Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Embed Size (px)

Citation preview

Page 1: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Clustering lecture 07/12/2015

Page 2: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

The clustering task

• Cluster observations into groups so that the observations belonging in the same group are similar, whereas observations in different groups are different

• Basic questions:– What does “similar” mean– What is a good partition of the objects? i.e., how is the

quality of a solution measured– How to find a good partition of the observations

Page 3: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Cluster Organization

• For “small” number of documents simple/flat clustering is acceptable

• Search a smaller set of clusters for relevancy• If cluster is relevant, documents in the cluster

are also relevant• Problem: Look for a broader or more specific

documents• Hierarchical clustering has a tree-like structure

Page 4: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

4

Types of Clusterings

• A clustering is a set of clusters

• Important distinction between hierarchical and partitional sets of clusters

• Partitional Clustering– A division data objects into non-overlapping

subsets (clusters) such that each data object is in exactly one subset (simple/flat partitioning)

• Hierarchical clustering– A set of nested clusters organized as a

hierarchical tree

Flat algorithmsUsually start with a random (partial) partitioningRefine it iteratively

K means clusteringModel based clustering

Hierarchical algorithmsBottom-up/agglomerative(Top-down/divisive)

Page 5: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

5

Partitional Clustering

Original Points A Partitional Clustering

Page 6: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

6

Hierarchical Clustering

p4 p1

p3

p2

p4p1 p2 p3Hierarchical Clustering

Dendrogram

Page 7: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Cluster Parameters• A minimum and maximum size of clusters

– Large cluster sizeone cluster attracting many documentsMulti topic themes

• A matching threshold value for including documents in a cluster– Minimum degree of similarity– Affects the number of clusters– High threshold

Fewer documents can join a clusterLarger number of clusters

• The degree of overlap between clusters– Some documents deal with more than one topic– Low degree of overlap

Greater separation of clusters

• A maximum number of clusters

Page 8: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

8

Other Distinctions Between Sets of Clusters

• Exclusive versus non-exclusive– In non-exclusive clusterings, points may belong to multiple clusters.– Can represent multiple classes or ‘border’ points

• Fuzzy versus non-fuzzy– In fuzzy clustering, a point belongs to every cluster with some weight between 0

and 1– Weights must sum to 1– Probabilistic clustering has similar characteristics

• Partial versus complete– In some cases, we only want to cluster some of the data

• Heterogeneous versus homogeneous– Cluster of widely different sizes, shapes, and densities

Page 9: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Aim of Clustering again?• Partitioning data into classes with high intra-class similarity low inter-class similarity• Is it well-defined?

Page 10: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

What is Similarity?

• Clearly, subjective measure or problem-dependent

Page 11: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

How Similar Clusters are?• Ex1: Two clusters or one clusters?

Page 12: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

How Similar Clusters are?• Ex2: Cluster or outliers

Page 13: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Similarity?

• Most cluster methods – use a matrix of similarity computations– Compute similarities between documents

Page 14: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Similarity Matrixtiny little small medium large huge

tiny 1.0 0.8 0.7 0.5 0.2 0.0little 1.0 0.9 0.7 0.3 0.1small 1.0 0.7 0.3 0.2medium 1.0 0.5 0.3large 1.0 0.8huge 1.0

– Diagonal must be 1.0– Monotonicity property must hold– No linearity (value interpolation) assumed– Qualitative Transitive property must hold

Page 15: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Observations to cluster• Real-value attributes/variables

– e.g., salary, height

• Binary attributes

– e.g., gender (M/F), has_cancer(T/F)

• Nominal (categorical) attributes

– e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

• Ordinal/Ranked attributes

– e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

• Variables of mixed types

– multiple attributes with various types

Page 16: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Similarity Measures

• For Ordinal Values– E.g. "small," "medium," "large," "X-large"– Convert to numerical assuming constant …on a

normalized [0,1] scale, where: max(v)=1, min(v)=0, others interpolate

– E.g. "small"=0, "medium"=0.33, etc.– Then, use numerical similarity measures– Or, use similarity matrix

Page 17: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Similarity Measures (cont.)

• For Nominal Values– E.g. "Boston", "LA", "Pittsburgh", or "male", "female", or

"diffuse", "globular", "spiral", "pinwheel"– Binary rule: If di, = dj,k, then sim = 1, else 0– Use underlying sematic property: E.g. Sim(Boston, LA) =

dist(Boston, LA)-1, or Sim(Boston, LA) = (|size(Boston) - size(LA)| ) /Max(size(cities))– Or, use similarity Matrix

Page 18: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Defining Measures of Similarity for text documents

• Semantic similarity -- but this is hard.• Statistical similarity: BOW

– Number of words in common• count

– Weighted words in common• words get additional weight based on inverse document

frequency: more similar if two documents share an infrequent word

– With document as a vector:• Euclidean Distance• Cosine Similarity

Page 19: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Vector Space Model• Each document, d, is considered to be a vector, d, in the term-space (set of

document “words”). • Normally very common words are stripped out completely and different

forms of a word are reduced to one canonical form.• In its simplest form, each document is represented by the (TF) vector d=(tf1,

tf2, ..,tfn) where tfi is the frequency of the ith term in the document. • weights may be based on its inverse document frequency (IDF) in the

document collection. – This discounts frequent words with little discriminating power.

• Finally, in order to account for documents of different lengths, each document vector is normalized so that it is of unit length.

• Given a set, S, of documents and their corresponding vector representations, we define the centroid vector c to be the vector obtained by averaging the weights of the various terms present in the documents of S

Page 20: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Example: Vector space Representation of Documents

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Binary term-document incidence matrixEach document is represented by a binary vector ∈ {0,1}|V|

Sec. 6.2

Page 21: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Example: Vector space Representation of Documents

• Term-document count matrices• Consider the number of occurrences of a term in a

document: – Each document is a count vector in ℕv: a column below

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Sec. 6.2

Page 22: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Bag of words model

• Vector representation doesn’t consider the ordering of words in a document

• John is quicker than Mary and Mary is quicker than John have the same vectors

• This is called the bag of words model.• How to “recover” positional information

Page 23: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Term frequency tf

• The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

• We want to use tf when computing similarity of two documents. But how?

• Raw term frequency is not what we want:– A document with 10 occurrences of the term is more

relevant than a document with 1 occurrence of the term.– But not 10 times more relevant.

• Relevance does not increase proportionally with term frequency.

Page 24: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Log-frequency weighting• The log frequency weight of term t in d is

• 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

• Finding similarity of two documents d1 and d2 : sum over terms t in both d1 ands d2 :

• The score is 0 if the two documents have no common terms.

otherwise 0,

0 tfif, tflog 1 10 t,dt,d

t,dw

21) tflog (1

2,1ddt ddscore

Sec. 6.2

Page 25: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Document frequency

• Rare terms are more informative than frequent terms– Recall stop words

• Consider a term that is rare in the collection (e.g., arachnocentric)

• Two documents containing this term are very likely to be relevant

• → We want a high weight for rare terms like arachnocentric.

Sec. 6.2.1

Page 26: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Document frequency, continued

• Frequent terms are less informative than rare terms• Consider a term that is frequent in the collection (e.g.,

high, increase, line)• Two document containing such a term are more likely to

be relevant than documents that don’t• But it’s not a sure indicator of relevance.• → For frequent terms, we want high positive weights for

words like high, increase, and line• But lower weights than for rare terms.• We will use document frequency (df) to capture this.

Sec. 6.2.1

Page 27: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

idf weight

• dft is the document frequency of t: the number of documents that contain t– dft is an inverse measure of the informativeness of t

– dft N

• We define the idf (inverse document frequency) of t by

– We use log (N/dft) instead of N/dft to “dampen” the effect of idf.

)/df( log idf 10 tt N

Will turn out the base of the log is immaterial.

Sec. 6.2.1

Page 28: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

idf example, suppose N = 1 million

term dft idft

calpurnia 1

animal 100

sunday 1,000

fly 10,000

under 100,000

the 1,000,000

There is one idf value for each term t in a collection.

Sec. 6.2.1

)/df( log idf 10 tt N

6

4

3

2

1

0

Page 29: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Collection vs. Document frequency• The collection frequency of t is the number of

occurrences of t in the collection, counting multiple occurrences.

• Example:

• Which word is a better search term (and should get a higher weight)?

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Sec. 6.2.1

Page 30: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

tf-idf weighting

• The tf-idf weight of a term is the product of its tf weight and its idf weight.

• Best known weighting scheme in information retrieval– Note: the “-” in tf-idf is a hyphen, not a minus sign!– Alternative names: tf.idf, tf x idf

• Increases with the number of occurrences within a document

• Increases with the rarity of the term in the collection

)df/(log)tf1log(w 10,, tdt Ndt

Sec. 6.2.2

Page 31: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Example: Vector space Representation of Documents

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 5.25 3.18 0 0 0 0.35

Brutus 1.21 6.1 0 1 0 0

Caesar 8.59 2.54 0 1.51 0.25 0

Calpurnia 0 1.54 0 0 0 0

Cleopatra 2.85 0 0 0 0 0

mercy 1.51 0 1.9 0.12 5.25 0.88

worser 1.37 0 0.11 4.15 0.25 1.95

• Binary → count → weight matrix• Each document is now represented by a real-

valued vector of tf-idf weights ∈ R|V|

Sec. 6.3

Page 32: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Documents as vectors

• So we have a |V|-dimensional vector space• Terms are axes of the space• Documents are points or vectors in this space• Very high-dimensional: tens of millions of

dimensions when you apply this to a web search engine

• These are very sparse vectors - most entries are zero.

Sec. 6.3

Page 33: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Formalizing vector space proximity (similarity)

• First cut: distance between two points– ( = distance between the end points of the two

vectors)• Euclidean distance?• Euclidean distance is a bad idea . . .• . . . because Euclidean distance is large for

vectors of different lengths.

Sec. 6.3

Page 34: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Euclidean Distance• Euclidean distance: distance between two

measures summed across each feature

• Squared differences to give more weight to larger difference

dist(xi, xj) = sqrt((xi1-xj1)2+(xi2-xj2)2+..+(xin-xjn)2)

Page 35: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Why distance is a bad idea

The Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 are

very similar.

Sec. 6.3

Page 36: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Similarity Measures (Set-based)

|)||,min(|

||

||||

||

||||

||||

||2

||

21

21

21

22

1

1

21

21

21

21

21

21

DD

DD

DD

DD

DDDD

DD

DD

DD

Simple matching (coordination level match)

Dice’s Coefficient

Jaccard’s Coefficient

Cosine Coefficient

Overlap Coefficient

Assuming that D1 and D2 are the sets of terms associated with two documents:

Page 37: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Let the weights for the index terms assigned to two documents i and j be as follows:

Doci = 3,2,1,0,0,0,1,1

Docj = 1,1,1,0,0,1,0,0

EXAMPLES

=12/12 = 1

Dice Coefficient

= 2 [(3*1)+(2*1)+(1*1)+(0*0)+(0*0)+(0*1)+(1*0)+(1*0)](3+2+1+0+0+0+1+1)+(1+1+1+0+0+1+0+0)

Jackard Coefficient

= (3*1)+(2*1)+(1*1)+(0*0)+(0*0)+(0*1)+(1*0)+(1*0)((3+2+1+0+0+0+1+1)+(1+1+1+0+0+1+0+0)-(3*1)+(2*1)+(1*1)+(0*0)+(0*0)+(0*1)+(1*0)+(1*0))

= 6/(12-6)

Page 38: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Document Clustering Techniques• Similarity or Distance Measure:Alternative Choices

– Cosine similarity – Euclidean distance

– Kernel functions, e.g.,

Page 39: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Measuring Similarity of Documents

• The similarity between two documents must be measured

• There are a number of possible measures for computing the similarity between documents

• The most common one is the cosine measure, which is defined as – cosine( d1, d2 ) = (d1 • d2) / |d1| |d2|,

• where • indicates the vector dot product and |d| is the length of vector d.

Page 40: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Cosine similarity measurement

•Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them.

•The result of the Cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value.

•As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases.

21222

122 )22()11(

2*12*1

.cos),(

yxyx

yyxx

BA

BAineBAsim

Page 41: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Clustering Methods

• Many methods to compute clusters• NP complete problem• Each solution can be evaluated quickly but

exhaustive evaluation of all solutions is not feasible

• Each trial may produce a different cluster organization

Page 42: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Stable Clustering

• Results should be independent of the initial order of documents

• Clusters should not be substantially different when new documents are added to the collection

• Results from consecutive runs should not differ significantly

Page 43: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

K-Means

• Heuristic with complexity O(nlogn)– Matrix based algorithms O(n2)

• Begins with an initial set of clusters– Pick the cluster centroids randomly– Use matrix based similarity on a small subset– Use density test to pick cluster centers from sample

data• Di is cluster center if at least n other ds have similarity

greater than threshold• A set of documents that are sufficiently dissimilar must exist

in collection

Use one of these methods

Page 44: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

K means Algorithm

1. Select k documents from the collection to form k initial singleton clusters

2. Repeat Until termination conditions are satisfiedi. For every document d, find the cluster i whose

centroid is most similar, assign d to cluster i.ii. For every cluster i, recompute the centroid based on

the current member documentsiii. Check for termination—minimal or no changes in the

assignment of documents to clusters

3. Return a list of clusters

Page 45: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

K Means Example(K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reassign clusters

xx xx

Compute centroids

Reassign clusters

Converged!

Page 46: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Termination conditions

• Several possibilities, e.g.,– A fixed number of iterations.– Doc partition unchanged.– Centroid positions don’t change.

Page 47: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Time Complexity

• Computing distance between doc and cluster is O(m) where m is the dimensionality of the vectors.

• Reassigning clusters: O(Kn) distance computations, or O(Knm).

• Computing centroids: Each doc gets added once to some centroid: O(nm).

• Assume these two steps are each done once for I iterations: O(IKnm).

Page 48: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Seed Choice• Results can vary based on

random seed selection.• Some seeds can result in poor

convergence rate, or convergence to sub-optimal clusterings.– Select good seeds using a

heuristic (e.g., doc least similar to any existing mean)

– Try out multiple starting points– Initialize with the results of

another method.

In the above, if you startwith B and E as centroidsyou converge to {A,B,C}and {D,E,F}If you start with D and Fyou converge to {A,B,D,E} {C,F}

Example showingsensitivity to seeds

Page 49: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

How Many Clusters?

• Number of clusters K is given– Partition n docs into predetermined number of clusters

• Finding the “right” number of clusters is part of the problem– Given data, partition into an “appropriate” number of

subsets.– E.g., for query results - ideal value of K not known up

front - though UI may impose limits.• Can usually take an algorithm for one flavor and

convert to the other.

Page 50: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

K not specified in advance

• Say, the results of a query.• Solve an optimization problem: penalize having

lots of clusters– application dependent, e.g., compressed summary

of search results list.• Tradeoff between having more clusters (better

focus within each cluster) and having too many clusters

Page 51: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

K not specified in advance

• Given a clustering, define the Benefit for a doc to be some inverse distance to its centroid

• Define the Total Benefit to be the sum of the individual doc Benefits.

Page 52: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Penalize lots of clusters

• For each cluster, we have a Cost C.• Thus for a clustering with K clusters, the Total Cost

is KC.• Define the Value of a clustering to be =

Total Benefit - Total Cost.• Find the clustering of highest value, over all

choices of K.– Total benefit increases with increasing K. But can stop

when it doesn’t increase by “much”. The Cost term enforces this.

Page 53: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Simulated Annealing

• Avoids local optima by randomly searching– Downhill move

• New solution with higher (better) value than the previous solution

– Uphill move• A worse solution is accepted to avoid local minima• The frequency decreases during “life cycle”

• Analogy for crystal formation

Page 54: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Simulated Annealing Algorithm

1. Get initial set of cluster and set the temperature to T2. Repeat until the temperature is reduced to the

minimum– Run a loop x times

• Find a new set of clusters by altering the membership of some documents

• Compare the difference between the values of the new and old set of clusters. If there is an improvement, accept the new set of clusters, otherwise accept the new set of clusters with probability p.

– Reduce the temperature based on cooling schedule

3. Return the final set of clusters

2.1.2.1.2.

2.1.1.

2.2.

Page 55: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Simulated Annealing

• Simple to implement• Solutions are reasonable good and avoid local

minima• Successful in other optimization tasks• Initial set very important• Adjusting the size of clusters is difficult

Page 56: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Genetic Algorithm• Use a population of solutions

1-Arrange the set of documents in a circle such that documents that are similar to one another are located close to each other2-Find key documents from the circle and build clusters from a neighborhood of these documents .

• Each arrangement of documents is a solution: (chromosome)

• Fitnessr: size of similarity neighborhoodn : number of documents

Page 57: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Genetic Algorithm

• Pick two parent solutions x and y from the set of all solutions with preference for solutions with higher fitness score.

• Use crossover operation to combine x and y to generate a new solution z.

• Periodically mutate a solution by randomly exchanging two documents in a solution.

Page 58: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Genetic Operators

5 7 8 3 2 6 1 9 10 4

3 8 2 6 7 5 9 1 4 10

6

Parent 1

Parent 2

child 72 53 98 1 10 4

Conflict Cross over

5 7 8 3 2 6 1 9 10 4

5 9 8 3 2 6 1 7 10 4

Mutation

Page 59: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

• Agglomerative (bottom-up): – Start with each document being a single cluster.– Eventually all documents belong to the same cluster.

• Divisive (top-down): – Start with all documents belong to the same cluster. – Eventually each node forms a cluster on its own.– Could be a recursive application of k-means like algorithms

• Does not require the number of clusters k in advance

• Needs a termination/readout condition

Hierarchical Clustering algorithms

Page 60: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Hierarchical Agglomerative Clustering (HAC)

• Assumes a similarity function for determining the similarity of two instances.

• Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.

• The history of merging forms a binary tree or hierarchy.

Page 61: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

• Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

Dendogram: Hierarchical Clustering

Page 62: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Hierarchical Agglomerative Clustering (HAC)

• Starts with each doc in a separate cluster– then repeatedly joins the closest pair of

clusters, until there is only one cluster.• The history of merging forms a binary tree

or hierarchy.How to measure distance of clusters??

Page 63: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Closest pair of clusters

Many variants to defining closest pair of clusters• Single-link

– Distance of the “closest” points (single-link)• Complete-link

– Distance of the “furthest” points• Centroid

– Distance of the centroids (centers of gravity)• (Average-link)

– Average distance between pairs of elements

Page 64: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Linking Methods

Clique

Star

String

Page 65: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

How to Combine Clusters?

• Intercluster similarity– Single-link (string)– Complete-link (clique)– Group average link (star)

• Single-link clustering– Each document must have a similarity exceeding a stated

threshold value with at least one other document in the same class.

– similarity between a pair of clusters is taken to be the similarity between the most similar pair of items

– each cluster member will be more similar to at least one member in that same cluster than to any member of another cluster

Page 66: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

How to Combine Clusters? (Continued)

• Complete-link Clustering– Each document has a similarity to all other documents

in the same class that exceeds the the threshold value.– similarity between the least similar pair of items from

the two clusters is used as the cluster similarity– each cluster member is more similar to the most

dissimilar member of that cluster than to the most dissimilar member of any other cluster

Page 67: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

How to Combine Clusters? (Continued)

• Group-average link clustering– a compromise between the extremes of single-link

and complete-link systems– each cluster member has a greater average

similarity to the remaining members of that cluster than it does to all members of any other cluster

Page 68: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Group Average Link Clustering

• Group average link clustering

– use the average values of the pairwise links within a cluster to determine similarity

– all objects contribute to intercluster similarity– resulting in a structure intermediate between the

loosely bound single link cluster and tightly bound complete link clusters

Page 69: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Comparison

• The Behavior of Single-Link Cluster– The single-link process tends to produce a small number of large clusters that

are characterized by a chaining effect.– Each element is usually attached to only one other member of the same

cluster at each similarity level.– It is sufficient to remember the list of previously clustered single items.

Page 70: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Comparison

• The Behavior of Complete-Link Cluster – Complete-link process produces a much larger number of small, tightly linked

groupings.– Each item in a complete-link cluster is guaranteed to resemble all other items

in that cluster at the stated similarity level.– It is necessary to remember the list of all item pairs previously considered in

the clustering process.• Comparison

– The complete-link clustering system may be better adapted to retrieval than the single-link clusters.

– A complete-link cluster generation is more expensive to perform than a comparable single-link process.

Page 71: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Single Link Agglomerative Clustering

• Use maximum similarity of pairs:

• Can result in “straggly” (long and thin) clusters due to chaining effect.

• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(max),(,

yxsimccsimji cycx

ji

)),(),,(max()),(( kjkikji ccsimccsimcccsim

Page 72: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Single Link Example

Page 73: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Complete Link Agglomerative Clustering• Use minimum similarity of pairs:

• Makes “tighter,” spherical clusters that are typically preferable.

• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(min),(,

yxsimccsimji cycx

ji

)),(),,(min()),(( kjkikji ccsimccsimcccsim

Ci Cj Ck

Page 74: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Complete Link Example

Page 75: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Key notion: cluster representative

• We want a notion of a representative point in a cluster

• Representative should be some sort of “typical” or central point in the cluster, e.g.,– point inducing smallest radii to docs in cluster– smallest squared distances, etc.– point that is the “average” of all docs in the cluster

• Centroid or center of gravity

Page 76: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Centroid-based Similarity

• Always maintain average of vectors in each cluster:

• Compute similarity of clusters by:

• For non-vector data, can’t always make a centroid

j

cx

jc

x

cs j

)(

))(),((),( jiji cscssimccsim

Page 77: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Computational Complexity

• In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(mn2).

• In each of the subsequent n2 merging iterations, compute the distance between the most recently created cluster and all other existing clusters.

• Maintaining of heap of distances allows this to be O(mn2logn)

Page 78: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Hierarchical Agglomerative Clustering Methods

• Generic Agglomerative Procedure (Salton '89):- result in nested clusters via iterations1. Compute all pairwise document-document similarity

coefficients2. Place each of n documents into a class of its own3. Merge the two most similar clusters into one;

- replace the two clusters by the new cluster - recompute intercluster similarity scores w.r.t. the new cluster

4. Repeat the above step until there are only k clusters left (note k could = 1).

Page 79: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

79

Starting Situation • Start with clusters of individual points and a

proximity matrixp1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 80: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

80

Intermediate Situation• After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 81: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

81

Intermediate Situation• We want to merge the two closest clusters (C2 and C5) and update the proximity

matrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 82: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

82

After Merging• The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5

C3? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 83: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

83

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function Ward’s Method uses squared error

Proximity Matrix

Page 84: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

84

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function Ward’s Method uses squared error

Page 85: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

85

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function Ward’s Method uses squared error

Page 86: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

86

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function Ward’s Method uses squared error

Page 87: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

87

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function Ward’s Method uses squared error

Page 88: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

88

Cluster Similarity: MIN or Single Link

• Similarity of two clusters is based on the two most similar (closest) points in the different clusters– Determined by one pair of points, i.e., by one link

in the proximity graph.I1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Page 89: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

89

Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Page 90: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

90

Strength of MIN

Original Points Two Clusters

• Can handle non-elliptical shapes

Page 91: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

91

Limitations of MIN

Original Points Two Clusters

• Sensitive to noise and outliers

Page 92: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

92

Cluster Similarity: MAX or Complete Linkage

• Similarity of two clusters is based on the two least similar (most distant) points in the different clusters– Determined by all pairs of points in the two

clustersI1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Page 93: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

93

Hierarchical Clustering: MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

3

4

5

6

1

2 5

3

4

Page 94: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

94

Strength of MAX

Original Points Two Clusters

• Less susceptible to noise and outliers

Page 95: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

95

Limitations of MAX

Original Points Two Clusters

•Tends to break large clusters

•Biased towards globular clusters

Page 96: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

96

Cluster Similarity: Group Average• Proximity of two clusters is the average of pairwise proximity between points in

the two clusters.

• Need to use average connectivity for scalability since total proximity favors large clusters

|Cluster||Cluster|

)p,pproximity(

)Cluster,Clusterproximity(ji

ClusterpClusterp

ji

jijjii

∑∈∈

=

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Page 97: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

97

Hierarchical Clustering: Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

1

2

3

4

5

6

1

2

5

3

4

Page 98: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

98

Hierarchical Clustering: Group Average

• Compromise between Single and Complete Link

• Strengths– Less susceptible to noise and outliers

• Limitations– Biased towards globular clusters

Page 99: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

99

Cluster Similarity: Ward’s Method• Similarity of two clusters is based on the

increase in squared error when two clusters are merged– Similar to group average if distance between

points is distance squared

• Less susceptible to noise and outliers

• Biased towards globular clusters

• Hierarchical analogue of K-means– Can be used to initialize K-means

Page 100: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

100

Hierarchical Clustering: Comparison

Group Average

Ward’s Method

1

2

3

4

5

61

2

5

3

4

MIN MAX

1

2

3

4

5

61

2

5

34

1

2

3

4

5

61

2 5

3

41

2

3

4

5

6

12

3

4

5

Page 101: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

101

Hierarchical Clustering: Problems and Limitations

• Once a decision is made to combine two clusters, it cannot be undone

• No objective function is directly minimized• Different schemes have problems with one

or more of the following:– Sensitivity to noise and outliers– Difficulty handling different sized clusters and

convex shapes– Breaking large clusters

Page 102: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,

Group Agglomerative Clustering

1

2

4

5

6

7

8

93

Page 103: Clustering lecture 07/12/2015. The clustering task Cluster observations into groups so that the observations belonging in the same group are similar,