103
807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Embed Size (px)

Citation preview

Page 1: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

807 - TEXT ANALYTICS

Massimo Poesio

Lecture 2: Text (Document) clustering

Page 2: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Reminder: Clustering

• Or, discovering similarities between objects – Individuals, Documents, …

• Applications:– Recommender systems– Document organization

Page 3: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Recommending: restaurants

• We have a list of all Wivenhoe restaurants – with and ratings for some– as provided by some Uni Essex students / staff

• Which restaurant(s) should I recommend to you?

Page 4: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Input

Page 5: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Algorithm 0

• Recommend to you the most popular restaurants– say # positive votes minus # negative votes

• Ignores your culinary preferences– And judgements of those with similar preferences

• How can we exploit the wisdom of “like-minded” people?

Page 6: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Another look at the input - a matrix

Page 7: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Now that we have a matrix

View all other entries as zeros for now.

Page 8: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

8

.

PREFERENCE-DEFINED DATA SPACE

.

...

. .. ..

..

....

.

...

. .. ..

..

....

.

Page 9: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Similarity between two people

• Similarity between their preference vectors.• Inner products are a good start.• Dave has similarity 3 with Estie

– but -2 with Cindy.• Perhaps recommend Black Buoy to Dave

– and Bakehouse to Bob, etc.

Page 10: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Algorithm 1.1

• You give me your preferences and I need to give you a recommendation.

• I find the person “most similar” to you in my database and recommend something he likes.

• Aspects to consider:– No attempt to discern cuisines, etc.– What if you’ve been to all the restaurants he has?– Do you want to rely on one person’s opinions?

Page 11: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Algorithm 1.k

• You give me your preferences and I need to give you a recommendation.

• I find the k people “most similar” to you in my database and recommend what’s most popular amongst them.

• Issues:– A priori unclear what k should be– Risks being influenced by “unlike minds”

Page 12: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Slightly more sophisticated attempt

• Group similar users together into clusters• You give your preferences and seek a

recommendation, then– Find the “nearest cluster” (what’s this?)– Recommend the restaurants most popular in this cluster

• Features:– avoids data sparsity issues– still no attempt to discern why you’re recommended

what you’re recommended– how do you cluster?

Page 13: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

CLUSTERS

• Can cluster

Dave, Estie

Alice, Bob

….

Fred ..

Cindy

Page 14: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

How do you cluster?

• Must keep similar people together in a cluster• Separate dissimilar people• Factors:

– Need a notion of similarity/distance– Vector space? Normalization?– How many clusters?

• Fixed a priori?• Completely data driven?

– Avoid “trivial” clusters - too large or small

Page 15: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Looking beyond

Clustering people forrestaurant recommendations

Clustering other things(documents, web pages)

Other approachesto recommendation

General unsupervised machine learning.

Amazon.com

Page 16: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Text clustering

• Search results clustering• Document clustering

Page 17: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Navigating search results

• Given the results of a search (say jaguar), partition into groups of related docs– sense disambiguation

• Approach followed by Uni Essex site– Kruschwitz / al Bakouri / Lungley

• Other examples: IBM InfoSphere Data Explorer– (was: vivisimo.com)

Page 18: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Results list clustering example

•Cluster 1:• Jaguar Motor Cars’ home page

• Mike’s XJS resource page

• Vermont Jaguar owners’ club

•Cluster 2:• Big cats

• My summer safari trip

• Pictures of jaguars, leopards and lions

•Cluster 3:• Jacksonville Jaguars’ Home Page

• AFC East Football Teams

Page 19: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Search results clustering: example

Page 20: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Why cluster documents?

• For improving recall in search applications• For speeding up vector space retrieval• Corpus analysis/navigation

– Sense disambiguation in search results

Page 21: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Improving search recall

• Cluster hypothesis - Documents with similar text are related

• Ergo, to improve search recall:– Cluster docs in corpus a priori– When a query matches a doc D, also return other docs

in the cluster containing D• Hope: docs containing automobile returned on a

query for car because– clustering grouped together docs containing car with

those containing automobile.

Why might this happen?

Page 22: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Speeding up vector space retrieval

• In vector space retrieval, must find nearest doc vectors to query vector

• This would entail finding the similarity of the query to every doc - slow!

• By clustering docs in corpus a priori– find nearest docs in cluster(s) close to query– inexact but avoids exhaustive similarity

computationExercise: Make up a simple example with points on a line in 2 clusters where this inexactness shows up.

Page 23: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Corpus analysis/navigation

• Given a corpus, partition it into groups of related docs– Recursively, can induce a tree of topics– Allows user to browse through corpus to home in

on information– Crucial need: meaningful labels for topic nodes.

Page 24: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

CLUSTERING DOCUMENTS IN A (VERY) LARGE COLLECTION: GOOGLE NEWS

Page 25: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

CLUSTERING DOCUMENTS IN A VERY LARGE COLLECTION: JRC’S NEWS EXPLORER

Page 26: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

What makes docs “related”?

• Ideal: semantic similarity.• Practical: statistical similarity

– We will use cosine similarity.– Docs as vectors.– For many algorithms, easier to think in terms of a

distance (rather than similarity) between docs.– We will describe algorithms in terms of cosine

similarity.

Page 27: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

DOCUMENTS AS BAGS OF WORDS

broad tech stock rally may signal trend - traders.

technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.

broadmay rallyralliedsignal stockstocks techtechnology traderstraders trend

DOCUMENTINDEX

Page 28: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Doc as vector

• Each doc j is a vector of tfidf values, one component for each term.

• Can normalize to unit length.• So we have a vector space

– terms are axes - aka features– n docs live in this space– even with stemming, may have 10000+

dimensions– do we really want to use all terms?

Page 29: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE

ikiki df

Nftfidf log*,,

FREQUENCY of term i in document k Number of documents

with term i

Page 30: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Intuition

Postulate: Documents that are “close together” in vector space talk about the same things.

t 1

D2

D1

D3

D4

t 3

t 2

x

y

Page 31: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Cosine similarity

. Aka

1

)(

:, of similarity Cosine

,

product inner normalized

m

i ikwijwDDsim

DD

kj

kj

Page 32: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Two flavors of clustering

• Given n docs and a positive integer k, partition docs into k (disjoint) subsets.

• Given docs, partition into an “appropriate” number of subsets.– E.g., for query results - ideal value of k not known

up front - though UI may impose limits.• Can usually take an algorithm for one flavor

and convert to the other.

Page 33: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Thought experiment

• Consider clustering a large set of computer science documents– what do you expect to see in the vector space?

Page 34: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Thought experiment

• Consider clustering a large set of computer science documents– what do you expect to see in the vector space?

NLP

Graphics

AI

Theory

Arch.

Page 35: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Decision boundaries

• Could we use these blobs to infer the subject of a new document?

NLP

Graphics

AI

Theory

Arch.

Page 36: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Deciding what a new doc is about

• Check which region the new doc falls into– can output “softer” decisions as well.

NLP

Graphics

AI

Theory

Arch.

= AI

Page 37: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Setup

• Given “training” docs for each category– Theory, AI, NLP, etc.

• Cast them into a decision space– generally a vector space with each doc viewed as a

bag of words• Build a classifier that will classify new docs

– Essentially, partition the decision space• Given a new doc, figure out which partition it

falls into

Page 38: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Supervised vs. unsupervised learning

• This setup is called supervised learning in the terminology of Machine Learning

• In the domain of text, various names– Text classification, text categorization– Document classification/categorization– “Automatic” categorization– Routing, filtering …

• In contrast, the earlier setting of clustering is called unsupervised learning– Presumes no availability of training samples– Clusters output may not be thematically unified.

Page 39: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

“Which is better?”

• Depends– on your setting– on your application

• Can use in combination– Analyze a corpus using clustering– Hand-tweak the clusters and label them– Use clusters as training input for classification– Subsequent docs get classified

• Computationally, methods quite different

Page 40: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

What more can these methods do?

• Assigning a category label to a document is one way of adding structure to it.

• Can add others, e.g.,extract from the doc– people– places– dates– organizations …

• This process is known as information extraction– can also be addressed using supervised learning.

Page 41: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Clustering: a bit of terminology

• REPRESENTATIVE• CENTROID• OUTLIER

Page 42: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Key notion: cluster representative

• In the algorithms to follow, will generally need a notion of a representative point in a cluster

• Representative should be some sort of “typical” or central point in the cluster, e.g.,– point inducing smallest radii to docs in cluster– smallest squared distances, etc.– point that is the “average” of all docs in the cluster

• Need not be a document

Page 43: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Key notion: cluster centroid

• Centroid of a cluster = component-wise average of vectors in a cluster - is a vector.– Need not be a doc.

• Centroid of (1,2,3); (4,5,6); (7,2,6) is (4,3,5).

Centroid

Page 44: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

(Outliers in centroid computation)

• Can ignore outliers when computing centroid.• What is an outlier?

– Lots of statistical definitions, e.g.– moment of point to centroid > M some cluster moment.

CentroidOutlier

Say 10.

Page 45: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Clustering algorithms

• Partitional vs. hierarchical• Agglomerative• K-means

Page 46: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Partitional Clustering

Original Points A Partitional Clustering

Page 47: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Hierarchical Clustering

p4p1

p3

p2

p4 p1

p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering

Non-traditional Dendrogram

Traditional Dendrogram

Page 48: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Agglomerative clustering

• Given target number of clusters k.• Initially, each doc viewed as a cluster

– start with n clusters;• Repeat:

– while there are > k clusters, find the “closest pair” of clusters and merge them.

Page 49: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

“Closest pair” of clusters

• Many variants to defining closest pair of clusters– Clusters whose centroids are the most cosine-

similar– … whose “closest” points are the most cosine-

similar– … whose “furthest” points are the most cosine-

similar

Page 50: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Example: n=6, k=3, closest pair of centroids

d1 d2

d3

d4

d5

d6

Centroid after first step.

Centroid aftersecond step.

Page 51: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Issues

• Have to support finding closest pairs continually– compare all pairs?

• Potentially n3 cosine similarity computations• To avoid: use approximations.

– “points” are changing as centroids change.• Changes at each step are not localized

– on a large corpus, memory management an issue– sometimes addressed by clustering a sample.

Why?What’sthis?

Page 52: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Exercise

• Consider agglomerative clustering on n points on a line. Explain how you could avoid n3 distance computations - how many will your scheme use?

Page 53: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

“Using approximations”

• In standard algorithm, must find closest pair of centroids at each step

• Approximation: instead, find nearly closest pair– use some data structure that makes this

approximation easier to maintain– simplistic example: maintain closest pair based on

distances in projection on a random line Random line

Page 54: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

A partitional clustering algorithm: K-MEANS

• Given k - the number of clusters desired.• Each cluster associated with a centroid.• Each point assigned to the cluster with the

closest centroid.• Iterate.

Page 55: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Basic iteration

• At the start of the iteration, we have k centroids.

• Each doc assigned to the nearest centroid.• All docs assigned to the same centroid are

averaged to compute a new centroid;– thus have k new centroids.

Page 56: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Iteration example

Current centroidsDocs

Page 57: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Iteration example

New centroidsDocs

Page 58: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

K-means Clustering: the full algorithm

Page 59: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

K-means Clustering – Details

• Initial centroids are often chosen randomly.– Clusters produced vary from one run to another.

• The centroid is (typically) the mean of the points in the cluster.

• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.

• K-means will converge for common similarity measures mentioned above.

• Most of the convergence happens in the first few iterations.

– Often the stopping condition is changed to ‘Until relatively few points change clusters’

• Complexity is O( n * K * I * d )– n = number of points, K = number of clusters,

I = number of iterations, d = number of attributes

Page 60: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Effect of initial choice of centroids:Two different K-means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Page 61: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 62: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Termination conditions

• Several possibilities, e.g.,– A fixed number of iterations.– Doc partition unchanged.– Centroid positions don’t change.

Does this mean that the docs in a cluster are

unchanged?

Page 63: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Convergence

• Why should the k-means algorithm ever reach a fixed point?– A state in which clusters don’t change.

• k-means is a special case of a general procedure known as the EM algorithm.– Under reasonable conditions, known to converge.– Number of iterations could be large.

Page 64: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Exercise

• Consider running 2-means clustering on a corpus, each doc of which is from one of two different languages. What are the two clusters we would expect to see?

• Is agglomerative clustering likely to produce different results?

Page 65: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

k not specified in advance

• Say, the results of a query.• Solve an optimization problem: penalize

having lots of clusters– application dependant, e.g., compressed summary

of search results list.• Tradeoff between having more clusters (better

focus within each cluster) and having too many clusters

Page 66: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

k not specified in advance

• Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid

• Define the Total Benefit to be the sum of the individual doc Benefits.

Why is there always a clustering of Total Benefit n?

Page 67: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Penalize lots of clusters

• For each cluster, we have a Cost C.• Thus for a clustering with k clusters, the Total

Cost is kC.• Define the Value of a cluster to be =

Total Benefit - Total Cost.• Find the clustering of highest Value, over all

choices of k.

Page 68: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Evaluating K-means Clusters• Most common measure is Sum of Squared Error (SSE)

– For each point, the error is the distance to the nearest cluster– To get SSE, we square these errors and sum them.

– x is a data point in cluster Ci and mi is the representative point for cluster Ci

• can show that mi corresponds to the center (mean) of the cluster– Given two clusters, we can choose the one with the smallest error– One easy way to reduce SSE is to increase K, the number of clusters

• A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

K

i Cxi

i

xmdistSSE1

2 ),(

Page 69: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Limitations of K-means

• K-means has problems when clusters are of differing – Sizes– Densities– Non-globular shapes

• K-means has problems when the data contains outliers.

Page 70: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Page 71: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Page 72: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Page 73: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Hierarchical clustering

• As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts.

d1

d2

d3

d4

d5

d1,d2

d4,d5

d3

d3,d4,d5

Page 74: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Hierarchical Clustering

• Produces a set of nested clusters organized as a hierarchical tree

• Can be visualized as a dendrogram– A tree like diagram that records the sequences of

merges or splits

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Page 75: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Strengths of Hierarchical Clustering

• Do not have to assume any particular number of clusters– Any desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper level

• They may correspond to meaningful taxonomies– Example in biological sciences (e.g., animal

kingdom, phylogeny reconstruction, …)

Page 76: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Hierarchical Clustering

• Two main types of hierarchical clustering– Agglomerative:

• Start with the points as individual clusters• At each step, merge the closest pair of clusters until only one cluster (or k

clusters) left

– Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k

clusters)

• Traditional hierarchical algorithms use a similarity or distance matrix– Merge or split one cluster at a time

Page 77: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

82

Hierarchical Agglomerative Clustering (HAC)

• More popular hierarchical clustering technique• Assumes a similarity function for determining

the similarity of two instances.• Starts with all instances in a separate cluster

and then repeatedly joins the two clusters that are most similar until there is only one cluster.

• The history of merging forms a binary tree or hierarchy.

Page 78: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Agglomerative Clustering Algorithm

• Basic algorithm is straightforward1. Compute the proximity matrix2. Let each data point be a cluster3. Repeat4. Merge the two closest clusters5. Update the proximity matrix6. Until only a single cluster remains

• Key operation is the computation of the proximity of two clusters

– Different approaches to defining the distance between clusters distinguish the different algorithms

Page 79: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Starting Situation

• Start with clusters of individual points and a proximity matrix

p1

p3

p5p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 80: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Intermediate Situation• After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 81: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Intermediate Situation• We want to merge the two closest clusters (C2 and C5) and update

the proximity matrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 82: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

After Merging• The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5

C3? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 83: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function– Ward’s Method uses squared error

Proximity Matrix

Page 84: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function– Ward’s Method uses squared error

Page 85: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function– Ward’s Method uses squared error

Page 86: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.Proximity Matrix

MIN MAX Group Average Distance Between Centroids Other methods driven by an objective

function– Ward’s Method uses squared error

Page 87: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Text clustering issues and applications

Page 88: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

List of issues/applications covered

• Term vs. document space clustering• Multi-lingual docs• Feature selection• Speeding up scoring• Building navigation structures

– “Automatic taxonomy induction”• Labeling

Page 89: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Term vs. document space

• Thus far, we clustered docs based on their similarities in terms space

• For some applications, e.g., topic analysis for inducing navigation structures, can “dualize”:– use docs as axes– represent (some) terms as vectors– proximity based on co-occurrence of terms in

docs– now clustering terms, not docs

Page 90: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Term vs. document space

• If terms carefully chosen (say nouns)– fixed number of pairs for distance computation

• independent of corpus size

– clusters have clean descriptions in terms of noun phrase co-occurrence - easier labeling?

– left with problem of binding docs to these clusters

Page 91: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Multi-lingual docs

• E.g., News Explorer, Canadian government docs.

• Every doc in English and equivalent French.– Must cluster by concepts rather than language

• Simplest: pad docs in one lang with dictionary equivalents in the other– thus each doc has a representation in both

languages• Axes are terms in both languages

Page 92: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Feature selection

• Which terms to use as axes for vector space?• Huge body of (ongoing) research• IDF is a form of feature selection

– can exaggerate noise e.g., mis-spellings• Pseudo-linguistic heuristics, e.g.,

– drop stop-words– stemming/lemmatization– use only nouns/noun phrases

• Good clustering should “figure out” some of these

Page 93: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Clustering to speed up scoring

• From CS276a, recall sampling and pre-grouping– Wanted to find, given a query Q, the nearest docs

in the corpus– Wanted to avoid computing cosine similarity of Q

to each of n docs in the corpus.

Page 94: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Sampling and pre-grouping

• First run a clustering phase– pick a representative leader for each cluster.

• Process a query as follows:– Given query Q, find its nearest leader L.– Seek nearest docs from L’s followers only

• avoid cosine similarity to all docs.

Page 95: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Navigation structure

• Given a corpus, agglomerate into a hierarchy• Throw away lower layers so you don’t have n

leaf topics each having a single doc– Many principled methods for this pruning such as

MDL.

d1

d2

d3

d4

d5

d1,d2

d4,d5

d3

d3,d4,d5

Page 96: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Navigation structure

• Can also induce hierarchy top-down - e.g., use k-means, then recur on the clusters.– Need to figure out what k should be at each point.

• Topics induced by clustering need human ratification– can override mechanical pruning.

• Need to address issues like partitioning at the top level by language.

Page 97: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Major issue - labeling

• After clustering algorithm finds clusters - how can they be useful to the end user?

• Need pithy label for each cluster– In search results, say “Football” or “Car” in the jaguar example.

– In topic trees, need navigational cues.• Often done by hand, a posteriori.

Page 98: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

How to Label Clusters

• Show titles of typical documents– Titles are easy to scan– Authors create them for quick scanning!– But you can only show a few titles which may not

fully represent cluster• Show words/phrases prominent in cluster

– More likely to fully represent cluster– Use distinguishing words/phrases– But harder to scan

Page 99: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Labeling

• Common heuristics - list 5-10 most frequent terms in the centroid vector.– Drop stop-words; stem.

• Differential labeling by frequent terms– Within the cluster “Computers”, child clusters all

have the word computer as frequent terms.– Discriminant analysis of sub-tree centroids.

Page 100: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

The biggest issue in clustering?

• How do you compare two alternatives?• Computation (time/space) is only one metric

of performance• How do you look at the “goodness” of the

clustering produced by a method

Page 101: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

READINGS

• Ingersoll / Morton / Farris – chapter 6

Page 102: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Resources

• Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections (1992)– Cutting/Karger/Pederesen/Tukey

– http://citeseer.nj.nec.com/cutting92scattergather.html

• Data Clustering: A Review (1999)– Jain/Murty/Flynn

– http://citeseer.nj.nec.com/jain99data.html

Page 103: 807 - TEXT ANALYTICS Massimo Poesio Lecture 2: Text (Document) clustering

Acknowledgments

• Many of these slides were borrowed, from– Chris Manning’s Stanford module– Tan, Steinbach and Kumar, Intro to Data Mining– Ray Mooney’s Austin module