Text clustering

Text Clustering(Part-2)

1

Text Clustering

What is clustering?

2

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. Partition unlabeled examples into disjoint subsets of clusters, such that:• Examples within a cluster are very similar• Examples in different clusters are very different

Discover new categories in an unsupervised manner (no sample category labels provided).

• Grouping of text documents into meaningful clusters in an unsupervised manner.

• Cluster Hypothesis : Relevant documents tend to be more similar to each other than to non-relevant ones.

Text Clustering 3

Document Clustering

Government

Science

Arts

Document Clustering

Motivation• Automatically group related documents based on their contents• No predetermined training sets or taxonomies• Generate a taxonomy at runtime

Clustering Process• Data preprocessing: – remove stop words, stem, feature extraction, lexical analysis, etc.

• Hierarchical clustering: – compute similarities applying clustering algorithms.

• Model-Based clustering (Neural Network Approach): – clusters are represented by “exemplars”. (e.g.: SOM)

Text Clustering 4

Document Clustering

Clustering of documents based on the similarity of their content improve the search effectiveness in terms of:

1. Improving Search Recall

When a query matches a document its whole cluster can be returned.

2. Improving Search Precision

Grouping the documents into a much smaller number of groups of related documents

3. Scatter/Gather

Enhance the efficiency of human browsing of a document collection when a specific search query cannot be formulated.

4. Query-Specific Clustering

The most related documents will appear in the small tight clusters, nested inside bigger clusters containing less similar documents.

Text Clustering 5

Text Clustering 6

Text ClusteringOverview

Standard (Text) Clustering Methods Bisecting k-means Agglomerative Hierarchical Clustering

Specialised Text Clustering Methods Suffix Tree Clustering Frequent-Termset-Based Clustering

Joint Cluster Analysis Attribute data: text content Relationship data: hyperlinks

Text Clustering 7

Text Clustering

Bisecting k-means [Steinbach, Karypis & Kumar 2000]

K-means

Bisecting k-means Partition the database into 2 clusters Repeat: partition the largest cluster into 2 clusters . . . Until k clusters have been discovered

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Text Clustering 8

Text Clustering

Bisecting k-means

Two types of clusterings• Hierarchical clustering• Flat clustering: any cut of this hierarchy

Distance Function• Cosine measure (similarity measure)

))(())((

))(()),((),(

cgcg

cgcgs

Text Clustering 9

Text Clustering

Agglomerative and Hierarchical Clustering

1. Form initial clusters consisting of a singleton object, and computethe distance between each pair of clusters.

2. Merge the two clusters having minimum distance.

3. Calculate the distance between the new cluster and all other clusters.

4. If there is only one cluster containing all objects:Stop, otherwise go to step 2.

Representation of a Cluster C

Vt

iCd

i

tdnCrep

),()(

Text Clustering 10

Text Clustering

Experimental Comparison [Steinbach, Karypis & Kumar 2000]

Clustering Quality Measured as entropy on a prelabeled test data set Using several text and web data sets Bisecting k-means outperforms k-means. Bisecting k-means outperforms agglomerative hierarchical

clustering.

Efficiency Bisecting k-means is much more efficient than agglomerative

hierarchical clustering. O(n) vs. O(n2)

Text Clustering 11

Text Clustering

Suffix Tree Clustering [Zamir & Etzioni 1998]

Forming Clusters Not by similar feature vectors But by common terms

Strengths of Suffix Tree Clustering (STC)

Efficiency: runtime O(n) for n text documents

Overlapping clusters

Method

1. Identification of “ basic clusters“

2. Combination of basic clusters

Text Clustering 12

Text ClusteringIdentification of Basic Clusters

• Basic Cluster: set of documents sharing one specific phrase

• Phrase: multi-word term

• Efficient identification of basic clusters using a suffix-tree

Insertion of (1) “ cat ate cheese“cat ate cheese

cheese ate cheese

1, 1 1, 2 1, 3

Insertion of (2) “ mouse ate cheese too“

cat ate cheese cheese

ate cheese

too too

mouse ate cheese too

too

1, 1 1, 2

2, 2

1, 3

2, 3

2, 1

2, 4

Text Clustering 13

Text Clustering

Combination of Basic Clusters

• Basic clusters are highly overlapping

• Merge basic clusters having too much overlap

• Basic clusters graph: nodes represent basic clusters

Edge between A and B iff |A B| / |A| > 0,5 and |A B| / |B| > 0,5• Composite cluster:

a component of the basic clusters graph• Drawback of this approach:

Distant members of the same component need not be similar

No evaluation on standard test data

Text Clustering 14

Text Clustering

Example from the Grouper System

Text Clustering 15

Text Clustering

Frequent-Term-Based Clustering [Beil, Ester & Xu 2002]

• Frequent term set:

description of a cluster• Set of documents containing

all terms of the frequent term set: cluster• Clustering: subset of set of all frequent term sets covering the DB with a low mutual overlap

{} {D1, . . ., D16}

{sun} {fun} {beach}

{D1, D2, D4, D5, {D1, D3, D4, D6, {D2, D7, D8,

D6, D8, D9, D10, D7, D8, D10, D11, D9, D10, D12,

D11, D13, D15} D14, D15, D16} D13,D14, D15}

{sun, fun} {sun, beach} {fun, surf} {beach, surf}

{D1, D4, D6, D8, {D2, D8, D9, . . . . . .

D10, D11, D15} D10, D11, D15}

{sun, fun, surf} {sun, beach, fun}

{D1, D6, D10, D11} {D8, D10, D11, D15}

Text Clustering 16

Text Clustering

Method

• Task: Efficient calculation of the overlap of a given cluster (description) Fi with the union of the other cluster

(description)s• F: set of all frequent term sets in D • f j : the number of all frequent term sets supported by document Dj

• Standard overlap of a cluster Ci:

|}|{| jiij DFFFf

||

)1(

)(i

CiDjj

i C

f

CSO

Text Clustering 17

Text Clustering

Algorithm FTCFTC(database D, float minsup)

SelectedTermSets:= {};

n:= |D|;

RemainingTermSets:= DetermineFrequentTermsets(D, minsup);

while |cov(SelectedTermSets)| ≠ n do

for each set in RemainingTermSets do

Calculate overlap for set;

BestCandidate:=element of Remaining TermSets with minimum overlap;

SelectedTermSets:= SelectedTermSets {BestCandidate};

RemainingTermSets:= RemainingTermSets - {BestCandidate};

Remove all documents in cov(BestCandidate) from D and from the

coverage of all of the RemainingTermSets;

return SelectedTermSets;

Text Clustering 18

Text Clustering

Joint Cluster Analysis [Ester, Ge, Gao et al 2006]

Attribute data: intrinsic properties of entities

Relationship data: extrinsic properties of entities

Existing clustering algorithms use either attribute or relationship data

Often: attribute and relationship data somewhat related, but containcomplementary information

Joint cluster analysis of attribute and relationship data

Edges = relationships2D location = attributes

Informative Graph

Text Clustering 19

Text Clustering

The Connected k-Center ProblemGiven: • Dataset D as informative graph • k - # of clusters

Connectivity constraint:

Clusters need to be connected graph components

Objective function:maximum (attribute) distance (radius) between nodes and cluster center

Task:Find k centers (nodes) and a corresponding partitioning of D into k

clusters such that• each clusters satisfies the connectivity constraint and• the objective function is minimized

Text Clustering 20

Text Clustering

The Hardness of the Connected k-Center Problem

• Given k centers and a radius threshold r, finding the optimal assignment for the

remaining nodes is NP-hard

• Because of articulation nodes (e.g., a) Assignment of a impliesthe assignment of b

Assignment of a to C1

minimizes the radiusa

b

C1 C2

Text Clustering 21

Text Clustering

Example: CS Publications

1073: Kempe, Dobra, Gehrke: “Gossip-based computation of aggregate information”, FOCS'031483: Fagin, Kleinberg, Raghavan: “Query strategies for priced information”, STOC'02

True cluster labels based on conference

Text Clustering 22

Text Clustering

Application for Web Pages

• Attribute data

text content

• Relationship data

hyperlinks

• Connectivity constraint

clusters must be connected

• New challenges

clusters can be overlapping

not all web pages must belong to a cluster (noise)

. . .

The General Clustering Problem

A clustering task may include the following components

(Jain, Murty, and Flynn 1999):

• Problem representation, including feature extraction, selection, or both,

• Definition of proximity measure suitable to the domain,

• Actual clustering of objects,

• Data abstraction, and

• Evaluation.

Text Clustering 23

The Vector-Space Model

• Assume t distinct terms remain after preprocessing; call them index terms

or the vocabulary.

• These “orthogonal” terms form a vector space.

Dimension = t = |vocabulary|

• Each term, i, in a document or query, j, is given a real-valued weight, wij.

• Both documents and queries are expressed as t-dimensional vectors:

dj = (w1j, w2j, …, wtj)

• New document is assigned to the most likely category based on vector

similarity. Text Clustering 24

Graphic Representation

Text Clustering 25

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

32

5

Example:

D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

• Is D1 or D2 more similar to Q?• How to measure the degree of

similarity? Distance? Angle? Projection?

Document Collection

• A collection of n documents can be represented in the vector space model by

a term-document matrix.

• An entry in the matrix corresponds to the “weight” of a term in the

document; zero means the term has no significance in the document or it

simply doesn’t exist in the document.

Text Clustering 26

T1 T2 …. Tt

D1 w11 w21 … wt1

D2 w12 w22 … wt2

: : : : : : : :Dn w1n w2n …

T1 T2 …. Tt

D1 w11 w21 … wt1

D2 w12 w22 … wt2

: : : : : : : :Dn w1n w2n … wtn

Term Weights: Term Frequency

• More frequent terms in a document are more important, i.e. more

indicative of the topic.

fij = frequency of term i in document j

• May want to normalize term frequency (tf) by dividing by the frequency

of the most common term in the document:

tfij = fij / maxi{fij}

Text Clustering 27

Term Weights: Inverse Document Frequency

• Terms that appear in many different documents are less indicative of overall topic

df i = document frequency of term i

= number of documents containing term i

idfi = inverse document frequency of term i,

= log2 (N/ df i)

(N: total number of documents)

• An indication of a term’s discrimination power.• Log used to dampen the effect relative to tf.

Text Clustering 28

TF-IDF Weighting

• A typical combined term importance indicator is tf-idf weighting:

wij = tfij idfi = tfij log2 (N/ dfi)

• A term occurring frequently in the document but rarely in the rest of the

collection is given high weight.

• Many other ways of determining term weights have been proposed.

• Experimentally, tf-idf has been found to work well.

Text Clustering 29

Computing TF-IDF -- An Example

• Given a document containing terms with given frequencies:

A(3), B(2), C(1)

• Assume collection contains 10,000 documents and

document frequencies of these terms are:

A(50), B(1300), C(250)

Then:

A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6

B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0

C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

Text Clustering 30

Query Vector

• Query vector is typically treated as a document and also tf-idf

weighted.

• Alternative is for the user to supply weights for the given query

terms.

Text Clustering 31

Similarity Measure

• A similarity measure is a function that computes the degree of similarity

between two vectors.

• Using a similarity measure between the query and each document:

– It is possible to rank the retrieved documents in the order of presumed

relevance.

– It is possible to enforce a certain threshold so that the size of the

retrieved set can be controlled.

Text Clustering 32

Similarity Measure - Inner Product

• Similarity between vectors for the document di and query q can be computed

as the vector inner product (a.k.a. dot product):

sim(dj,q) = dj•q =

Where,

wij is the weight of term i in document j and

wiq is the weight of term i in the query

• For binary vectors, the inner product is the number of matched query terms

in the document (size of intersection).

• For weighted term vectors, it is the sum of the products of the weights of the

matched terms.

Text Clustering 33

iq

t

iij ww

1

Properties of Inner Product

Text Clustering 34

• The inner product is unbounded.

• Favors long documents with a large number of unique terms.

• Measures how many terms matched but not how many terms are not

matched.

Text Clustering 3535

Inner Product -- Examples

Binary:• D = 1, 1, 1, 0, 1, 1, 0

• Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

retri

eval

database

archite

cture

computer

text

managemen

t

informatio

n

Size of vector = size of vocabulary = 70 means corresponding term not found

in document or query

Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3

sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Cosine Similarity Measure

Text Clustering 36

• Cosine similarity measures the cosine of the angle between two vectors.

• Inner product normalized by the vector lengths.

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13

Q = 0T1 + 0T2 + 2T3

2

t3

t1

t2

D1

D2

Q

1

D1 is 6 times better than D2 using cosine similarity but only 5 times better using

inner product.

t

i

t

i

t

i

ww

ww

qd

qd

iqij

iqij

j

j

1 1

22

1

)(

CosSim(dj, q) =

Naïve Implementation

• Convert all documents in collection D to tf-idf weighted vectors, dj, for

keyword vocabulary V.

• Convert query to a tf-idf-weighted vector q.

• For each dj in D do

Compute score sj = cosSim(dj, q)

• Sort documents by decreasing score.

• Present top ranked documents to the user.

Time complexity: O(|V|·|D|) Bad for large V & D !

|V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000

Text Clustering 37

Comments on Vector Space Models

• Simple, mathematically based approach.

• Considers both local (tf) and global (idf) word occurrence frequencies.

• Provides partial matching and ranked results.

• Tends to work quite well in practice despite obvious weaknesses.

• Allows efficient implementation for large document collections.

Text Clustering 38

Problems with Vector Space Model

• Missing semantic information (e.g. word sense).

• Missing syntactic information (e.g. phrase structure, word order, proximity

information).

• Assumption of term independence (e.g. ignores synonomy).

• Lacks the control of a Boolean model (e.g., requiring a term to appear in a

document).

–Given a two-term query “A B”, may prefer a document containing A

frequently but not B, over a document that contains both A and B, but

both less frequently.

Text Clustering 39

Clustering Algorithms

Problem variant

• A flat (or partitional) clustering produces a single partition of a set of

objects into disjoint groups.

• A hierarchical clustering results in a nested series of partitions.

Distance-based Clustering Algorithms

• Agglomerative and Hierarchical Clustering Algorithms

• Distance-based Partitioning Algorithms (k-Means)

• A Hybrid Approach: The Scatter-Gather Method (Buckshot,

Fractionization)

Text Clustering 40

Hard/Soft Clustering

Hard Clustering• Every object may belong to exactly one cluster.

Soft Clustering • The membership is fuzzy - Objects may belong to several clusters with a

fractional degree of membership in each.

Text Clustering 41


The most commonly used algorithms are

The K-means (hard, flat, shuffling),

The EM-based mixture resolving (soft, flat, probabilistic), and

The HAC (hierarchical, agglomerative).

Text Clustering 42


Document hierarchical clustering

• Bottom-up, agglomerative

• Top-down, divisive

Document partitioning (flat clustering)

• K-means

• Probabilistic clustering using the Naïve Bayes or Gaussian mixture

model, etc.

Document clustering based on graph model

Text Clustering 43

K-Means AlgorithmPartitions a collection of vectors {x1, x2, . . . xn} into a set of clusters {C1, C2, . . . Ck}.

The algorithm needs k cluster seeds for initialization. The algorithm proceeds as follows:

Initialization:• k seeds, either given or selected randomly, form the core of k clusters.• Every other vector is assigned to the cluster of the closest seed.

Iteration:• The centroids Mi of the current clusters are computed:

• Each vector is reassigned to the cluster with the closest centroid.

Stopping condition:• At convergence – when no more changes occur.• The K-means algorithm maximizes the clustering quality function Q:

Text Clustering 44

icx

ii xC 1||

)(),...,,(1

21

C Cx

iK

i

xSimCCCQ

K-Means

Works when we know k, the number of clustersIdea:• Randomly pick k points as the “centroids” of the k clusters• Loop:– points, add to cluster with nearest centroid – Recompute the cluster centroids– Repeat loop (until no change)

Text Clustering 45

Iterative improvement of the objective function:

Sum of the squared distance from each point to the centroid of its cluster.

Finding the global optimum is NP-hard.

The k-means algorithm is guaranteed to converge a local optimum.

K-means Example

For simplicity, 1-dimension objects and k=2.• Numerical difference is used as the distance

Objects: 1, 2, 5, 6,7

K-means:

Randomly select 5 and 6 as centroids;

=> Two clusters {1,2,5} and {6,7}; meanC1=8/3, meanC2=6.5

=> {1,2}, {5,6,7}; meanC1=1.5, meanC2=6

=> no change.

Aggregate dissimilarity – (sum of squares of distance each point of each cluster from its cluster

center--(intra-cluster distance)

= 0.52+ 0.52+ 12+ 02+12 = 2.5

Text Clustering 46

|1-1.5|2


K Means Example (K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Time Complexity

• Assume computing distance between two instances is O(m) where m is the

dimensionality of the vectors.

• Reassigning clusters: O(kn) distance computations, or O(knm).

• Computing centroids: Each instance vector gets added once to some

centroid: O(nm).

• Assume these two steps are each done once for I iterations: O(Iknm).

• Linear in all relevant factors, assuming a fixed number of iterations, more

efficient than O(n2) HAC.

Text Clustering 48

Problems with K-means

Need to know k in advance• Could try out several k?– Cluster tightness increases with increasing K.

– Look for a kink in the tightness vs. K curve

Tends to go to local minima that are sensitive to the starting centroids• Try out multiple starting points

Disjoint and exhaustive• Doesn’t have a notion of “outliers”–Outlier problem can be handled by K-medoid or neighborhood-based

algorithms

Assumes clusters are spherical in vector space• Sensitive to coordinate changes, weighting etc.

Text Clustering 49

EM-based Probabilistic Clustering Algorithm

Mixture-resolving Algorithms - Underlying Assumption:• The objects to be clustered are drawn from k distributions, and• The goal is to identify the parameters of each that would allow the

calculation of the probability P(Ci | x) of the given object’s belonging to the cluster Ci

The Expectation Maximization (EM)• A general purpose framework for estimating the parameters of distribution

in the presence of hidden variables in observable data.• Probabilistic method for soft clustering.

• Direct method that assumes k clusters:{c1, c2,… ck} and Soft version of k-means.• For text, typically assume a naïve-Bayes category model.

– Parameters = {P(ci), P(wj | ci): i{1,…k}, j {1,…,|V|}}

Text Clustering 50

EM Algorithm

Initialization:

The initial parameters of k distributions are selected either randomly or externally.

Iteration:

E-Step: Compute the P(Ci |x) for all objects x by using the current parameters of the distributions. Re-label all objects according to the computed probabilities.

M-Step: Re-estimate the parameters of the distributions to maximize the likelihood of the objects’ assuming their current labeling.

Stopping condition:

At convergence - when the change in log-likelihood after each iteration becomes small.

Text Clustering 51

EM


Unlabeled Examples

+

+

+

+

+

Assign random probabilistic labels to unlabeled dataInitialize:

EM


Prob. Learner

+

+

+

+

+

Give soft-labeled training data to a probabilistic learnerInitialize:

EM


Prob. Learner

Prob.Classifier

+

+

+

+

+

Produce a probabilistic classifierInitialize:

EM


Prob. Learner

Prob.Classifier

Relabel unlabled data using the trained classifier

+

+

+

+

+

E Step:

EM


Prob. Learner

+

+

+

+

+Prob.Classifier

Continue EM iterations until probabilistic labels on unlabeled data converge.

Retrain classifier on relabeled dataM step:

Hierarchical Agglomerative Clustering (HAC)

• Assumes a similarity function for determining the similarity of two instances.

• Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.

• The history of merging forms a binary tree or hierarchy.

HAC Algorithm

Initialization:

Every object is put into a separate cluster.

Iteration:

Find the pair of most similar clusters and merge them.

Stopping condition:

When everything is merged into a single cluster.

Text Clustering 57

Cluster Similarity

Assume a similarity function that determines the similarity of two instances:

sim(x,y).

• Cosine similarity of document vectors.

How to compute similarity of two clusters each possibly containing multiple

instances?

• Single Link: Similarity of two most similar members.

• Complete Link: Similarity of two least similar members.

• Group Average: Average similarity between members.

Text Clustering 58

Clustering functions - Merging criteria

Text Clustering 59

Extending the distance measure from samples to sets of samples

Similarity of 2 most similar members

Similarity of 2 least similar members

Average similarity between members

Single Link Agglomerative Clustering

Use maximum similarity of pairs:

Can result in “straggly” (long and thin) clusters due to chaining effect.• Appropriate in some domains, such as clustering islands.

Text Clustering 60

),(max),(,

yxsimccsimji cycx

ji


Single Link Example

Complete Link Agglomerative Clustering

Use minimum similarity of pairs:

Makes more “tight,” spherical clusters that are typically preferable.

Text Clustering 62

),(min),(,

yxsimccsimji cycx

ji

Complete Link Example

Text Clustering 63

Computational Complexity

• In the first iteration, all HAC methods need to compute similarity of all

pairs of n individual instances which is O(n2).

• In each of the subsequent n2 merging iterations, it must compute the

distance between the most recently created cluster and all other existing

clusters.

• In order to maintain an overall O(n2) performance, computing similarity to

each other cluster must be done in constant time.

Text Clustering 64

Computing Cluster Similarity

After merging ci and cj, the similarity of the resulting cluster to any other cluster, ck, can be computed by:

• Single Link:

• Complete Link:

Text Clustering 65

)),(),,(max()),(( kjkikji ccsimccsimcccsim

)),(),,(min()),(( kjkikji ccsimccsimcccsim

Group Average Agglomerative Clustering

• Use average similarity across all pairs within the merged cluster to

measure the similarity of two clusters.

• Compromise between single and complete link.

• Averaged across all ordered pairs in the merged cluster instead of

unordered pairs between the two clusters (to encourage tighter final

clusters).

Text Clustering 66

)( :)(

),()1(

1),(

ji jiccx xyccyjiji

ji yxsimcccc

ccsim

Other Clustering Algorithms

Minimal Spanning Tree (MST)• Single-link clusters are sub-graphs of the MST, which are also the

connected components• Complete-link clusters are the maximal complete sub-graphs

The Nearest Neighbor Clustering• Assigns each object to the cluster of its nearest labeled neighbor object

provided the similarity to that neighbor is sufficiently high. • The process continues until all objects are labeled.

The Buckshot Algorithm• Combines HAC and K-Means clustering.• First randomly take a sample of instances of size n • Run group-average HAC on this sample, which takes only O(n) time.• Use the results of HAC as initial seeds for K-means.• Overall algorithm is O(n) and avoids problems of bad seed selection.

Text Clustering 67

Word and Phrase-based Clustering

• Text documents from high-dimensional domain, important clusters of words may be found and utilized for finding clusters of documents.

Example:

In a corpus containing d terms and n documents, view a term-document matrix as an n × d matrix, in which the (i, j)th entry is the frequency of the jth term in the ith document.

• The problem of clustering rows in this matrix is clustering documents, clustering columns in this matrix is clustering words.

• The most general technique for simultaneous word and document clustering is referred to as co-clustering.

• The problem of word clustering is related to dimensionality reduction, whereas document clustering is related to traditional clustering

• Probabilistic framework which determines word clusters and document clusters simultaneously is referred to as topic modeling

Text Clustering 68

Word and Phrase-based Clustering

Approaches

• Clustering with Frequent Word Patterns

• Leveraging Word Clusters for Document Clusters

• Co-clustering Words and Documents

• Clustering with Frequent Phrases

Text Clustering 69

Clustering with Frequent Word Patterns

• A frequent item set in the context of text data is referred to as a frequent term set.

• Frequent term sets is defined on the basis of the overlaps between the supporting documents of the different frequent term sets.

Idea:

Cluster the low dimensional frequent term sets as cluster candidates.

Flat Clustering:

The documents covered by the selected frequent term are removed from the database, and the overlap in the next iteration is computed with respect to the remaining documents.

Hierarchical Clustering:

Each level of the clustering is applied to a set of term sets containing a fixed number k of terms.

Text Clustering 70

Leveraging Word Clusters for Document Clusters

A two phase clustering procedure, to perform document clustering:

First Phase • Determine word-clusters from the documents in such a way that most of

mutual information between words and documents is preserved when representing the documents in terms of word clusters rather than words.

Second Phase• Use the condensed representation of the documents in terms of word-

clusters in order to perform the final document clustering.• Specifically, replace the word occurrences in documents with word-cluster

occurrences in order to perform the document clustering.

Advantage of two-phase procedure• Significant reduction in the noise in the representation.

Text Clustering 71

Co-clustering Words and Documents

• Co-clustering is defined as a pair of maps from rows to row-cluster indices and columns to column-cluster indices.

• These maps are determined simultaneously by the algorithm in order to optimize the corresponding cluster representations

• Matrix Factorization Approach Discovers word clusters and document clusters simultaneously. Transform knowledge from word space to document space in the

context of document clustering techniques.• Co-clustering is a form of local feature selection, in which the features

selected are specific to each cluster• Methods for document co-clustering

Co-clustering with graph partitioning Information-Theoretic Co-clustering

Text Clustering 72

Clustering with Frequent Phrases

Phrase cluster• A phrase cluster is a phrase that is shared by at least two documents, and

the group of documents that contain the phrase. • Each document is treated as a string of words, rather than characters.• It uses an indexing method in order to organize the phrases in the

document collection, and then uses this organization to create the clusters.

Steps to create the clusters

Step 1:• perform the cleaning of the strings representing the documents.• A light stemming algorithm is used by deleting word prefixes and suffixes

and reducing plural to singular.• Sentence boundaries are marked and non-word tokens are stripped.

Text Clustering 73

Clustering with Frequent PhrasesStep 2:• Identification of base clusters. • These are defined by the frequent phases in the collection which are

represented in the form of a suffix tree.

Step 3: • An important characteristic of the base clusters created by the suffix tree is

that they do not define a strict partitioning and have overlaps with one another• The algorithm merges the clusters based on the similarity of their

underlying document sets.

Let P and Q be the document sets corresponding to two clusters. The base similarity BS(P,Q) is defined as follows:

Text Clustering 74

5.0

|}||,max{|

||),(

QP

QPQPBS

Benefits of Word Clustering

Useful Semantic word clustering

• Automatically generates a “Thesaurus”

Higher classification accuracy

Smaller classification models

• size reductions as dramatic as 50000 50

Text Clustering 75

Semi-supervised clustering: problem definition

Input:• A set of unlabeled objects, each described by a set of attributes (numeric

and/or categorical)• A small amount of domain knowledge

Output:• A partitioning of the objects into k clusters (possibly with some discarded

as outliers)

Objective:• Maximum intra-cluster similarity• Minimum inter-cluster similarity• High consistency between the partitioning and the domain knowledge

Text Clustering 76

Why semi-supervised clustering?

Why not clustering?• The clusters produced may not be the ones required.• Sometimes there are multiple possible groupings.

Why not classification?• Sometimes there are insufficient labeled data.

Potential applications• Bioinformatics (gene and protein clustering)• Document hierarchy construction• News/email categorization• Image categorization

Text Clustering 77

Semi-Supervised Clustering

Domain knowledge• Partial label information is given• Apply some constraints (must-links and cannot-links)

Approaches• Search-based Semi-Supervised Clustering–Alter the clustering algorithm using the constraints

• Similarity-based Semi-Supervised Clustering–Alter the similarity measure based on the constraints

• Combination of both

Text Clustering 78

Search-Based Semi-Supervised Clustering

Alter the clustering algorithm that searches for a good partitioning by:

• Modifying the objective function to give a reward for obeying labels on

the supervised data [Demeriz:ANNIE99].

• Enforcing constraints (must-link, cannot-link) on the labeled data during

clustering [Wagstaff:ICML00, Wagstaff:ICML01].

• Use the labeled data to initialize clusters in an iterative refinement

algorithm (k-Means,) [Basu:ICML02].

Text Clustering 79

Similarity-based semi-supervised clustering

• Alter the similarity measure based on the constraints

• Two types of constraints: Must-links and Cannot-links

• Clustering algorithm: Hierarchical clustering

Text Clustering 80

Transfer Learning

Text Clustering 81

It is motivated by human learning. People can often transfer knowledge learnt previously to novel situations Chess Checkers Mathematics Computer Science Table Tennis Tennis

TL - Solve the fundamental problem of different distributions between the training and testing data.

Transfer Learning (TL):The ability of a system to recognize and apply knowledge and skills learned in previous tasks to novel tasks (in new domains)

Traditional ML vs. TL

Text Clustering 82

Traditional ML in multiple domains

Transfer of learningacross domains

trai

ning

item

s

test

item

s

trai

ning

item

s

test

item

s

Humans can learn in many domains. Humans can also transfer from one domain to other domains.

Traditional ML vs. TL

Text Clustering 83

Learning Process of Traditional ML

Learning Process of Transfer Learning

training items

Learning System Learning System

training items

Learning System

Learning SystemKnowledge

Why Transfer Learning?

• In some domains, labeled data are in short supply.

• In some domains, the calibration effort is very expensive.

• In some domains, the learning process is time consuming

Text Clustering 84

Transfer learning techniques may help!

How to extract knowledge learnt from related domains to help learning in a target domain with a few labeled data?

How to extract knowledge learnt from related domains to speed up learning in a target domain?

Settings of Transfer Learning

Text Clustering 85

Transfer learning settingsLabeled data in a source domain

Labeled data in a target domain

Tasks

Inductive Transfer Learning × √ Classification

Regression…√ √

Transductive Transfer Learning √ ×Classification

Regression…

Unsupervised Transfer Learning × × Clustering…

Approaches to Transfer Learning

Text Clustering 86

Transfer learning approaches Description

Instance-transferTo re-weight some labeled data in a source

domain for use in the target domain

Feature-representation-transfer Find a “good” feature representation that reduces difference between a source and a target domain

or minimizes error of models

Model-transferDiscover shared parameters or priors of models between a source domain and a target domain

Relational-knowledge-transferBuild mapping of relational knowledge between

a source domain and a target domain.

Approaches to Transfer Learning

Text Clustering 87

Inductive Transfer Learning

Transductive Transfer Learning

Unsupervised Transfer Learning

Instance-transfer √ √

Feature-representation-transfer

√ √ √

Model-transfer √

Relational-knowledge-transfer

√

Unsupervised Transfer Learning

Feature-representation-transfer ApproachesSelf-taught Clustering (STC)

[Dai et al. ICML-08]

Text Clustering 88

Input: A lot of unlabeled data in a source domain and a few unlabeled data in a target domain.

Goal: Clustering the target domain data.

Assumption: The source domain and target domain data share some common

features, which can help clustering in the target domain.

Main Idea: To extend the information theoretic co-clustering algorithm

[Dhillon et al. KDD-03] for transfer learning.