HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00

Preview:

Citation preview

Hypertext Categorization

Rayid Ghani

IR Seminar - 10/3/00

“Standard” Approach

Apply traditional text learning algorithms In many cases, goal is not to classify

hypertext but to test the algorithms Is it actually the right approach?

Results?

Mixed results Positive results in most cases BUT the goal was

to test the algorithms Negative in few e.g. Chakrabarti BUT the goal

was to motivate their own algorithm

How is hypertext different?

Link Information Diverse Authorship Short text - topic not obvious from the text Structure / position within the web graph Author-supplied features(meta-tags) Bold , italics, heading etc.

How to use those extra features?

Specific approaches to classify hypertext

Chakrabarti et al SIGMOD 98 Oh et al. SIGIR 00 Slattery & Mitchell ICML 00 Goal is not classification but retrieval

Bharat & Henzinger SIGIR 98 Croft & Turtle 93

Chakrabarti et al. SIGMOD 98

Use the page and linkage information Add words from the “neighbors” and treat

them as belonging to the page itself Decrease in performance (not surprising) Link information is very noisy

Use topic information from neighbors instead

Data Sets

IBM Patent Database 12 classes (630 train, 300 test for each class)

Yahoo 13 classes , 20000 docs (for expts involving

hypertext, only 900 documents were used) (?)

Experiments

Using text from neighbors Local+Neighbor_Text: Local+Neighbor_Text_Tagged:

Assume Neighbors are Pre-classified Text – 36% Link – 34% Prefix – 22.1% (words in class heirarchy used) Text+Prefix – 21%

Oh et al. SIGIR 2000

Relationship b/w class and neighbors of a web page in the training set is not consistent/useful (?)

Instead, Use the class and neighbor info of the page being classified (use regularities in the test set)

Classify test instance d by:

Classification

)]())|()([(maxarg

)]|()|([maxarg

)],|([maxarg

)|(||

1

cNeighborctPcP

GCPTCP

TGCP

ddtN

T

ii

c

c

c

i

Ld

dd w

L

clcNeighbor

)()(

Algorithm

For each test document d, generate a set A of “trustable” neighbors

For all terms ti in d, adjust the term weight using the term weights from A

For each doc a in A, assign a max confidence value if its class is known otherwise assign a class probabilistically and give it partial confidence weight

Classify d using the equation given earlier.

Experiments

Reuters used to assess the algorithm on datasets without hyperlinks – only varying the size of the training set & # of features (?) Results not directly comparable but numbers

similar to reported results

Articles from an encyclopedia – 76 classes, 20836 documents

Results

Terms+Classes > Only Classes > Only Terms > No use of inlinks

Other issues

Link discrimination Knowledge of neighbor classes Use of links in training set Inclusion of new terms from neighbors

ComparisonChakrabarti Oh et al. Improvement

Links in training set

Y N 5%

Link discrimination

N Y 6.7%

Knowledge of neighbor class

Y Y 6.6%

1.9%

Iteration Y N 1.5%

Using new terms from neighbors

Y N 31.4%

Slattery & Mitchell ICML 00

Given a problem setting in which the test set contains structural regularities, How can we find and use them?

Hubs and AuthoritiesKleinberg (1998)

“.. a good hub is a page that points to many good authorities;

a good authority is a page pointed to by many good hubs.”

Hubs Authorities

Hubs and AuthoritiesKleinberg (1998)

“Hubs and authorities exhibit what could be called a mutually reinforcing relationship”

Iterative relaxation:

pqq

qpq

qp

qp

:

:

)(Hub)(Authority

)(Authority)(Hub

Hubs Authorities

The Plan

Take an existing learning algorithm Extend it to exploit structural regularities in

the test set Using Hubs and Authorities as inspiration

FOILQuinlan & Cameron-Jones (1993)

Learns relational rules like:target_page(A) :- has_research(A), link(A,B),

has_publications(B).

For each test example Pick matching rule with best training set

performance p. Predict positive with confidence p

FOIL-Hubs Representation

Add two rules to a learned rule set target_page(A):-link(B,A),target_hub(B). target_hub(A):-link(A,B),target_page(B).

Talk about confidence rather than truth target_page(page15) = 0.75

Evaluate by summing instantiations

page15) link(B, : B

(B)target_hub e(page15)target_pag

FOIL-Hubs Algorithm

1. Apply learned FOIL rules: learned(A)

2. Iterate1. Evaluate target_hub(A)

2. Evaluate target_page(A)

3. Set target_page(A) =

3. Report target_page(A)

learned(A)e(A)target_pag s

FOIL-Hubs AlgorithmLearned FOIL rules

foil(A) target_hub(A)target_page(A)

1. Apply learned FOIL rules to test set

2. Initialise target_page(A) confidence from foil(A)

3. Evaluate target_hub(A)

4. Evaluate target_page(A)

5. target_page(A)=target_page(A)s+foil(A)

Data Set

4127 pages from Computer Science departments of four universities:Cornell University University of Texas at Austin

University of Washington University of Wisconsin

• Hand labeled into:Student 558 Web pages

Course 243 Web pages

Faculty 153 Web pages

Experiment

Three binary classification tasks

1. Student Home Page

2. Course Home Page

3. Faculty Home Page

Leave two university out cross-validation

Student Home Page

0

20

40

60

80

100

0 20 40 60 80 100

Recall

Pre

cisi

on

FOIL-Hubs

FOIL

Course Home Page

0

20

40

60

80

100

0 20 40 60 80 100

Recall

Pre

cisi

on

FOIL-Hubs

FOIL

More Detailed Results

Partition the test data into Examples covered by some learned FOIL

rule Examples covered by no learned FOIL rule

Student – FOIL covered

0

20

40

60

80

100

0 20 40 60 80 100

Recall

Pre

cisi

on

FOIL-Hubs

FOIL

Student – FOIL uncovered

0

20

40

60

80

100

0 20 40 60 80 100

Recall

Pre

cisi

on

FOIL-Hubs

FOIL

Course – FOIL covered

0

20

40

60

80

100

0 20 40 60 80 100

Recall

Pre

cisi

on

FOIL-Hubs

FOIL

Course – FOIL uncovered

0

20

40

60

80

100

0 20 40 60 80 100

Recall

Pre

cisi

on

FOIL-Hubs

FOIL

Recap

We’ve searched for regularities of the form

student_page(A):-

link(Web->KB members page,A)

in the test set. We consider this an instance of a regularity schema

student_page(A):-

link(<page constant>,A)

Conclusions

Test set regularities can be used to improve classification performance

FOIL-Hubs used such regularities to outperform FOIL on three Web page classification problems

We can potentially search for other regularity schemas using FOIL

Other work

Using the structure of HTML to improve retrieval. Michal Cutler, Yungming Shih, Weiyi Meng. USENIX 1997 Use tfidf - different different weights to text in

different html tags