48
SFU, CMPT 741, Fall 2009, Martin Ester 1 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification [Han & Kamber 2006, Sections 10.4 and 10.5] References to research papers at the end of this chapter Mining Text and Web Data

SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 1

Contents of this Chapter

Introduction

Data Preprocessing

Text and Web Clustering

Text and Web Classification

[Han & Kamber 2006, Sections 10.4 and 10.5]

References to research papers at the end of this chapter

Mining Text and Web Data

Page 2: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 2

The Web and Web Search

Web Server Crawler

ClusteringClassification Indexer

Storage Server

Inverted IndexTopic Hierarchy

SearchQuery

Business

Root

News Science

Computers AutomobilesPlants Animals

The jaguar, a cat, can run at speeds reaching 50 mph

The jaguar has a 4 liter engine

engine jaguar cat

jaguar

Repository

Documents in repository

Page 3: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 3

Web Search Engines

Keyword Search

““data mining”data mining”

519,000 results519,000 results

Page 4: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 4

Web Search Engines

Boolean search• care AND NOT old

Phrases and proximity• “new care”• loss NEAR/5 care• <SENTENCE>

Indexing

Inverted Lists

How to rank the result lists?

My0 care1 is loss of carewith old care done

Your care is gain ofcare with new care won

D1

D2

care D1: 1, 5, 8

D2: 1, 5, 8

new D2: 7

old D1: 7

loss D1: 3

Page 5: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 5

Ranking Web Pages

Page Rank [Brin & Page 98]

• Idea: the more high ranked pages link to a web page, the higher its rank.

• Interpretation by random walk: PageRank is the probability that a “random surfer” visits a page

» Parameter p is probability that the surfer gets bored and starts on a new random page.» (1-p) is the probability that the random surfer follows a link on current page.

• PageRanks correspond to principal eigenvector of the normalized link matrix.• Can be calculated using an efficient iterative algorithm.

vu uOutDegree

uPageRankppvPageRank

)(

)()1()(

Page 6: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 6

Ranking Web Pages

Hyperlink-Induced Topic Search (HITS) [Kleinberg 98]

Definitions• Authorities: highly-referenced pages on a topic.

• Hubs: pages that “point” to authorities.

• A good authority is pointed to by many good hubs; a good hub points to many good authorities.

Hubs

Authorities

Page 7: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 7

Ranking Web Pages

HITS

Method• Collect seed set of pages S (e.g., returned by search engine).

• Expand seed set to contain pages that point to or are pointed to by pages inseed set.

• Initialize all hub/authority weights to 1.

• Iteratively update hub weight h(p) and authority weight a(p) for each page:

• Stop, when hub/authority weights converge.

qppq

qaphqhpa )()()()(

Page 8: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 8

Ranking Web Pages

Comparison

• PageRanks computed initially for web pages independent of search query.

• PageRank is used by web search engines such as Google.

• HITS: Hub and authority weights computed for different root sets in the context of a particular search query.

• HITS is applicable, e.g., for ranking the crawl front of a focused web crawler.

• Can use the relevance of a web page for the given topic to weight the edges of the web (sub) graph.

Page 9: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 9

Directory Services

Browsing

““data mining”data mining”

~ 200 results~ 200 results

Page 10: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 10

Directory Services

Topic Hierarchy• Provides a hierarchical classification of documents / web pages (e.g., Yahoo!)

• Topic hierarchy can be browsed interactively to find relevant web pages.

• Keyword searches performed in the context of a topic: return only a subset of

web pages related to the topic.

Recreation Science Business News

Yahoo home page

SportsTravel Companies Finance Jobs

Page 11: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 11

Mining Text and Web Data

Shortcomings of the Current Web Search Methods

Low precision• Thousands of irrelevant documents returned by web search engine

99% of information of no interest to 99% of people.

Low recall• In particular, for directory services (due to manual acquisition).

• Even largest crawlers cover less than 50% of all web pages.

Low quality• Many results are out-dated, broken links etc.

Page 12: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 12

Mining Text and Web Data

Data Mining Tasks

• Classification of text documents / web pages

To insert into topic hierarchy, as feedback for focused web crawlers, . . .

• Clustering of text documents / web pages

To create topic hierarchy, as front-end for web search engines . . .

• Resource discovery

Discovery of relevant web pages using a focused web crawler,

in particular ranking of the links at the crawl front

Page 13: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 13

Mining Text and Web Data

Challenges• Ambiguity of natural language

synonyms, homonyms, . . .

• Different natural languages

English, Chinese, . . .• Very high dimensionality of the feature spaces

thousands of relevant terms• Multiple datatypes

text, images, videos, . . .

• Web is extremely dynamic

> 1 million pages added each day

• Web is a very large distributed database

huge number of objects, very high access costs

Page 14: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 14

Text Representation

Preprocessing

• Remove HTML tags, punctuation etc.

• Define terms (single-word / multi-word terms)

• Remove stopwords

• Perform stemming

• Count term frequencies

• Some words are more important than others

smooth the frequencies,

e.g. weight by inverse document frequency

Page 15: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 15

Transformation

• Different definitions of inverse document frequency

n(d,t): number of occurrences of term t in document d

• Select “significant” subset of all occurring terms

• Vocabulary V, term ti, document d represented as

Vector Space Model

Most n’s are zeroes for a single document

Vti itdndrep ),()(

),(max

),(,

),(

),(,

),(

),(

tdn

tdn

tdn

tdn

tdn

tdn

ttd

Text Representation

Page 16: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 16

Similarity Function

)()(

)(),(),(

21

2121

drepdrep

drepdrepddsimilarity

productinner ,

documentdata

miningsimilar

dissimilar

Cosine Similarity

Text Representation

Page 17: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 17

Semantic Similarity

Where can I fix my scooter?

A great garage to repair your 2-wheeler is …

A scooter is a 2-wheeler. Fix and repair are synonyms.

Two basic approaches for semantic similarity

• Hand-made thesaurus (e.g., WordNet)

• Co-occurrence and associations … car …

… auto …

… auto …car… car … auto… auto …car

… car … auto… auto …car… car … auto

car auto

Text Representation

Page 18: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 18

Text and Web Clustering

Overview

Standard (Text) Clustering Methods Bisecting k-means Agglomerative Hierarchical Clustering

Specialised Text Clustering Methods Suffix Tree Clustering Frequent-Termset-Based Clustering

Joint Cluster Analysis• Attribute data: text content• Relationship data: hyperlinks

Page 19: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 19

Text and Web Clustering

Bisecting k-means [Steinbach, Karypis & Kumar 2000]

K-means

Bisecting k-means• Partition the database into 2 clusters• Repeat: partition the largest cluster into 2 clusters . . .• Until k clusters have been discovered

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 20: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 20

Text and Web Clustering

Bisecting k-means

Two types of clusterings• Hierarchical clustering• Flat clustering: any cut of this hierarchy

Distance Function• Cosine measure (similarity measure)

))(())((

))(()),((),(

cgcg

cgcgs

Page 21: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 21

Text and Web Clustering

Agglomerative Hierarchical Clustering

1. Form initial clusters consisting of a singleton object, and computethe distance between each pair of clusters.

2. Merge the two clusters having minimum distance.

3. Calculate the distance between the new cluster and all other clusters.

4. If there is only one cluster containing all objects: Stop, otherwise go to step 2.

Representation of a Cluster C

Vt

iCd

i

tdnCrep

),()(

Page 22: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 22

Text and Web Clustering

Experimental Comparison [Steinbach, Karypis & Kumar 2000]

Clustering Quality• Measured as entropy on a prelabeled test data set• Using several text and web data sets• Bisecting k-means outperforms k-means.• Bisecting k-means outperforms agglomerative hierarchical clustering.

Efficiency• Bisecting k-means is much more efficient than agglomerative

hierarchical clustering.• O(n) vs. O(n2)

Page 23: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 23

Text and Web Clustering

Suffix Tree Clustering [Zamir & Etzioni 1998]

Forming Clusters Not by similar feature vectors But by common terms

Strengths of Suffix Tree Clustering (STC)

Efficiency: runtime O(n) for n text documents

Overlapping clusters

Method

1. Identification of „basic clusters“

2. Combination of basic clusters

Page 24: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 24

Text and Web Clustering

Identification of Basic Clusters • Basic Cluster: set of documents sharing one specific phrase

• Phrase: multi-word term

• Efficient identification of basic clusters using a suffix-tree

Insertion of (1) „cat ate cheese“cat ate cheese

cheese ate cheese

1, 1 1, 2 1, 3

Insertion of (2) „mouse ate cheese too“

cat ate cheese cheese

ate cheese

too too

mouse ate cheese too

too

1, 1 1, 2

2, 2

1, 3

2, 3

2, 1

2, 4

Page 25: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 25

Text and Web Clustering

Combination of Basic Clusters

• Basic clusters are highly overlapping

• Merge basic clusters having too much overlap

• Basic clusters graph: nodes represent basic clusters

Edge between A and B iff |A B| / |A| > 0,5 and |A B| / |B| > 0,5• Composite cluster:

a component of the basic clusters graph

• Drawback of this approach:

Distant members of the same component need not be similar

No evaluation on standard test data

Page 26: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 26

Text and Web Clustering

Example from the Grouper System

Page 27: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 27

Text and Web Clustering

Frequent-Term-Based Clustering [Beil, Ester & Xu 2002]

• Frequent term set:

description of a cluster

• Set of documents containing

all terms of the frequent term set: cluster

• Clustering: subset of set of all frequent term sets covering the DB with a low mutual overlap

{} {D1, . . ., D16}

{sun} {fun} {beach}

{D1, D2, D4, D5, {D1, D3, D4, D6, {D2, D7, D8,

D6, D8, D9, D10, D7, D8, D10, D11, D9, D10, D12,

D11, D13, D15} D14, D15, D16} D13,D14, D15}

{sun, fun} {sun, beach} {fun, surf} {beach, surf}

{D1, D4, D6, D8, {D2, D8, D9, . . . . . .

D10, D11, D15} D10, D11, D15}

{sun, fun, surf} {sun, beach, fun}

{D1, D6, D10, D11} {D8, D10, D11, D15}

Page 28: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 28

Text and Web Clustering

Method

• Task: efficient calculation of the overlap of a given cluster (description) Fi with the union of the other cluster

(description)s

• F: set of all frequent term sets in D

• f j : the number of all frequent term sets supported by document Dj

• Standard overlap of a cluster Ci:

|}|{| jiij DFFFf

||

)1(

)(i

CiDjj

i C

f

CSO

Page 29: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 29

Text and Web Clustering

Algorithm FTCFTC(database D, float minsup)

SelectedTermSets:= {};

n:= |D|;

RemainingTermSets:= DetermineFrequentTermsets(D, minsup);

while |cov(SelectedTermSets)| ≠ n do

for each set in RemainingTermSets do

Calculate overlap for set;

BestCandidate:=element of Remaining TermSets with minimum overlap;

SelectedTermSets:= SelectedTermSets {BestCandidate};

RemainingTermSets:= RemainingTermSets - {BestCandidate};

Remove all documents in cov(BestCandidate) from D and from the

coverage of all of the RemainingTermSets;

return SelectedTermSets;

Page 30: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 30

Text and Web Clustering

Experimental Evaluation

Classic Dataset

Page 31: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 31

Text and Web Clustering

Experimental Evaluation

Reuters Dataset

FTC achieves comparable clustering quality more efficientlyBetter cluster description

Page 32: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 32

Text and Web Clustering

Joint Cluster Analysis [Ester, Ge, Gao et al 2006]

Attribute data: intrinsic properties of entities

Relationship data: extrinsic properties of entities

Existing clustering algorithms use either attribute or relationship data

Often: attribute and relationship data somewhat related, but containcomplementary information

Joint cluster analysis of attribute and relationship data

Edges = relationships2D location = attributes

Informative Graph

Page 33: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 33

Text and Web Clustering

The Connected k-Center Problem

Given

• Dataset D as informative graph

• k - # of clusters

Connectivity constraint

Clusters need to be connected graph components

Objective functionmaximum (attribute) distance (radius) between nodes and cluster center

TaskFind k centers (nodes) and a corresponding partitioning of D into k clusters such that

• each clusters satisfies the connectivity constraint and

• the objective function is minimized

Page 34: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 34

Text and Web Clustering

The Hardness of the Connected k-Center Problem

Given k centers and a radius threshold r, finding the optimal assignment for the remaining nodes is NP-hard

Because of articulation nodes (e.g., a)

Assignment of a impliesthe assignment of b

Assignment of a to C1

minimizes the radiusa

b

C1 C2

Page 35: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 35

Text and Web Clustering

Example: CS Publications

1073: Kempe, Dobra, Gehrke: “Gossip-based computation of aggregate information”, FOCS'031483: Fagin, Kleinberg, Raghavan: “Query strategies for priced information”, STOC'02

True cluster labels based on conference

Page 36: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 36

Text and Web Clustering

Application for Web Pages

• Attribute datatext content

• Relationship datahyperlinks

• Connectivity constraintclusters must be connected

• New challenges clusters can be overlappingnot all web pages must belong to a cluster (noise). . .

Page 37: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 37

Text and Web Classification

Overview

Standard Text Classification Methods

• Naive Bayes Classifier

• Bag of Words Model

Advanced Methods

• Use EM to exploit non-labeled training documents

• Exploit hyperlinks

Page 38: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 38

Text and Web Classification

Naive Bayes Classifier

• Decision rule of the Bayes Classifier:

• Assumptions of the Naive Bayes Classifier:

– d = (d1, . . ., dd)

– the attribute values di are independent from each other

• Decision rule of the Naive Bayes Classifier :

)()|( jjCc

cPcdPargmaxj

d

ijij

Cc

cdPcPargmaxj 1

)|()(

Page 39: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 39

Text and Web Classification

Bag of Words Model

• Allows to estimate the P(dj| c) from the training documents

• Neglects position (order) of terms in document (bag!)

• Model for generating a document d of class c consisting of n terms

• conditional probability of observing term ti,

given that the document belongs to class c

)|( ji cdP

Vtcdk

cdi

ji

kj

j

tdn

tdn

cdP

,

),(

),(

)|(:approachNaive

Page 40: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 40

Text and Web Classification

Bag of Words Model

Problem

• Term ti appears in no training document of class cj

• ti appears in a document d to be classified

• Document d also contains terms which strongly indicate class cj,

but: P(dj| c) = 0 and P(d| c) = 0

Solution

• Smoothing of the relative frequencies in the training documents

jcd

itdn 0),(

||),(

1),(

)|(

,

Vtdn

tdn

cdP

Vtcdk

cdi

ji

kj

j

Page 41: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 41

Text and Web Classification

Experimental Evaluation [Craven et al. 1998]

Training data

• 4127 web pages of CS departments

• Classes: department, faculty, staff, student, course, . . ., other

Major Results

• Classification accuracy of 70% to 80 % for most classes

• Only 9% accuracy for class staff, but 80% correct in superclass person

• Low classification accuracy for class other

Web pages are harder to classify than „professional“ text documents

Page 42: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 42

Text and Web Classification

Exploiting Unlabeled Documents [Nigam et al. 1999]

• Labeling is laborious, but many unlabeled documents available.

• Exploit unlabeled documents to improve classification accuracy:

Classifier

Labeled Docs

Unlabeled Docs

Initial training

re-train (Maximization)

Assign weighted class labels (Estimation)

Page 43: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 43

Text and Web Classification

Exploiting Unlabeled Documents

• Let training documents d belong to classes in a graded manner Pr(c|d).

• Initially labeled documents have 0/1 membership.

• Expectation Maximization (EM)

repeat

calculate class model parameters P(di|c);

determine membership probabilities P(c|d);

until classification accuracy does no longer improve.

Page 44: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 44

Text and Web Classification

Exploiting Unlabeled Documents

Small labeled set large accuracy boost

Page 45: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 45

Text and Web Classification

Exploiting Hyperlinks [Chakrabarti, Dom & Indyk 1998]

• Logical “documents” are often fragmented into several web pages.

• Neighboring web pages often belong to the same class.

• Web documents are extremely diverse.

• Standard text classifiers perform poorly on web pages.

• For classification of web pages, use

text of the page itself

text and class labels of neighboring pages.

Page 46: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 46

Text and Web Classification

Exploiting Hyperlinks

• c = class, t = text, N = neighbors

• Text-only model: P(t|c)

• Using neighbors’ text to judge my topic: P(t, t(N) | c)

• Using neighbors’ class label to judge my topic: P(t, c(N) | c)

• Most of the neighbor’s classlabels are unknown.

• Method: iterative relaxation labeling.

?

Page 47: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 47

Text and Web Classification

05

10152025303540

0 50 100

%Neighborhood known%

Err

orText Class Text+Class

Experimental Evaluation

• Using neighbors’ text increases classification error.

• Even when tagging non-local texts.

• Using neighbors’ class labels reduces classification error.

• Using text and class of neighbors yields the lowest classification error.

Page 48: SFU, CMPT 741, Fall 2009, Martin Ester 302 Contents of this Chapter Introduction Data Preprocessing Text and Web Clustering Text and Web Classification

SFU, CMPT 741, Fall 2009, Martin Ester 48

ReferencesBeil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", Proc. KDD ’02, 2002.

Brin S., Page L.: ”The anatomy of a large-scale hypertextual Web search engine”, Proc. WWW7, 1998.

Chakrabarti S., Dom B., Indyk P.: “Enhanced hypertext categorization using hyperlinks”, Proc. ACM SIGMOD 1998.

Craven M., DiPasquo D., Freitag D., McCallum A., Mitchell T., Nigam K., Slattery S.: “Learning to Extract Symbolic Knowledge from the World Wide Web”, Proc. AAAI 1998. Deerwater S., Dumais S. T., Furnas G. W., Landauer T. K., Harshman R.: “Indexing by latent semantic analysis”, Journal of the Society for Information Science, 41(6), 1990.

Ester M., Ge R., Gao B. J., Hu Z., Ben-Moshe B.: “Joint Cluster Analysis of Attribute Data and Relationship Data: the Connected k-Center Problem”, Proc. SDM 2006.

Kleinberg J.: “Authoritative sources in a hyperlinked environment”, Proc. SODA 1998.

Nigam K., McCallum A., Thrun S., Mitchell T.: “Text Classification from Labeled and Unlabeled Documents using EM”, Machine Learning, 39(2/3), pp. 103-134, 2000.

Steinbach M., Karypis G., Kumar V.: “A comparison of document clustering techniques”, Proc. KDD Workshop on Text Mining, 2000.

Zamir O., Etzioni O.: “Web Document Clustering: A Feasibility Demonstration”, Proc. SIGIR 1998.