Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin

Accelerated Focused CrawlingThrough

Online Relevance Feedback

Soumen Chakrabarti, IIT BombayKunal Punera, IIT Bombay

Mallela Subramanyam, UT Austin

First-generation focused crawling

Crawl regions of the Web pertaining to specific topic c*, avoiding irrelevant topics

Guess relevance of unseen node v based on the relevance of u (uv) evaluated by topic classifier

Baseline learner

Dmoztopic

taxonomy

Class modelsconsisting of

term stats

Frontier URLSpriority queue

Crawler

Pick best

Newly fetchedpage u

Submit page for classification

If Pr(c*|u) is large enoughthen enqueue all outlinks v of uwith priority Pr(c*|u)

Crawldatabase

Seed URLs

Baseline crawling results 20 topics from

http://dmoz.org Half to two-thirds of

pages fetched are irrelevant

“Every click on a link is a leap of faith”

Humans leap better than focused crawlers

Adequate clues in text + DOM to leap better

0

0.2

0.4

0.6

0.8

1

AIAstronomy

BasketballCancer

ChessComposers

FlyFishingFolkDance

HorsesIceHockey

KayakingLinux

MeteorologySoups

Tobacco

10

100

1000

10000

100000

#pages

Relevance probability

http://dmoz.org/

How to seek out distant goals? Manually collect paths

(context graphs) leading to relevant goals

Use a classifier to predict link-distance to goal from page text

Prioritize link expansion by estimated distance to relevant goals

No discrimination amongdifferent links on a page

Goal

1

2 3

Crawler

1

2 3 4C

lass

ifier

The apprentice+critic strategy

Baseline learner (Critic)

Dmoztopic

taxonomy

Class modelsconsisting of

term stats

Frontier URLspriority queue

Crawler

Pick best

Newly fetchedpage uSubmit page for classification

If Pr(c*|u) islarge enough...

An instance (u,v)for the apprentice

u

vPr(c*|v)

Pr(c|u) forall classes c

Crawldatabase

Apprentice learner

Classmodels+ -

Onlinetraining

...

subm

it (u

,v)

to t

he a

ppre

ntic

e

Apprentice assigns moreaccurate priority to node v

u

v

Good

Good/ Bad

Design rationale Baseline classifier specifies what to crawl

Could be a user-defined black boxUsually depends on large amounts of training

data, relatively slow to train (hours)

Apprentice classifier learns how to locate pages approved by the baseline classifierUses local regularities in HTML, site structureLess training data, faster training (minutes)

Guards against high fan-out (~10) No need to manually collect paths to goals

Apprentice feature design HREF source page u represented by DOM tree Leaf nodes marked with offsets wrt the HREF Many usable representations for term at offset d

A t,d tuple, e.g., “download”, -2 t,p,d where p is path length from t to d through LCA

aHREF

TEXT font

TEXT

lili li

ul

li

TEXTTEXT em

TEXT

tt

TEXT

TEXT

@0 @0 @1 @2 @3@-1@-2Offsets

“download”

LCA

Offsets of good t,d features Plot information gain at

each d averaged over terms t

Max at d=0, falls off on both sides, but…

Clear saw-tooth pattern for most topics—why?<li><a…><li><a…>…<a…><br><a…><br>…

Topic-independent authoring idioms, best handled by apprentice

AI

4.00E-05

5.00E-05

6.00E-05

7.00E-05

8.00E-05

9.00E-05

1.00E-04

-8 -6 -4 -2 0 2 4 6 8d

Info

Ga

ind_max=8

d_max=5

d_max=4

d_max=3

Apprentice learner design Instance (u,v),Pr(c*|v) represented by

t,d features for d up to some dmax

HREF source topics {c,Pr(c|u) c}Cocitation: w1, w2 siblings, w1 goodw2 good

Learning algorithmWant to map features to a score in [0,1]Discretize [0,1] into ranges, each a label Class label has an associated value q

Use a naïve Bayes classifier to find Pr(|(u,v))Score = q Pr(|(u,v))

Apprentice accuracy Better elimination of

useless outlinks with increasing dmax

Good accuracy with dmax< 6

Using DOM offset info improves accuracy

Small accuracy gainsLCA distance in t,p,dSource page topicsCocitation features

AI

65

70

75

80

85

90

0 2 4 6 8d_max

%A

ccu

racy

Negative

Positive

Average

AI

76

78

80

82

84

86

0 2 4 6 8d_max

%A

ccu

racy Text

Offset

Offline apprentice trials Run baseline, train

apprentice Start new crawler at

the same URLs Let it fetch any page it

schedules (Recall) limit it to

pages visited by the baseline crawler

Baseline loss > recall loss > apprentice loss

Small URL overlap

Online guidance from apprentice Run baseline Train apprentice Re-evaluate frontier

Apprentice not as optimistic as baseline

Many URLs downgraded

Continue crawling with apprentice guidance Immediate reduction

in loss rate

Classical Composers

900

1100

1300

1500

1700

2000 4000 6000 8000#Pages fetched

Cu

mu

lativ

e lo

ss

Train apprentice

Folk Dancing

0

4000

8000

12000

0 0-.2 .2-.4 .4-.6 .6-.8 .8-1Estimated relevance of outlinksF

req

ue

ncy

Baseline

Apprentice

Summary New generation of focused crawler

Discriminates between links, learning online

Apprentice easy and fast to train onlineAccurate with small dmax around 4—6

DOM-derived features better than textEffective division of labor (‘what’ vs. ‘how’)

Loss rate reduced by 30—90%Apprentice better at guessing relevance of

unvisited nodes than baseline crawlerBenefits visible after 100—1000 page fetches

Ongoing work Extending to larger radius and deeper

DOM + site structure Public domain C++ software

Crawler• Asynchronous DNS, simple callback model• Can saturate dedicated 4Mbps with a Pentium2

HTML cleaner• Simple, customizable, table-driven patch logic• Robust to bad HTML, no crash or memory leak

HTML to DOM converter• Extensible DOM node class

Documents

Accelerated Focused Crawling Through Online Relevance Feedback Soumen Chakrabarti, IIT Bombay Kunal Punera, IIT Bombay Mallela Subramanyam, UT Austin