Upload
alban-lindsey
View
224
Download
1
Tags:
Embed Size (px)
Citation preview
Accelerated Focused CrawlingThrough
Online Relevance Feedback
Soumen Chakrabarti, IIT BombayKunal Punera, IIT Bombay
Mallela Subramanyam, UT Austin
First-generation focused crawling
Crawl regions of the Web pertaining to specific topic c*, avoiding irrelevant topics
Guess relevance of unseen node v based on the relevance of u (uv) evaluated by topic classifier
Baseline learner
Dmoztopic
taxonomy
Class modelsconsisting of
term stats
Frontier URLSpriority queue
Crawler
Pick best
Newly fetchedpage u
Submit page for classification
If Pr(c*|u) is large enoughthen enqueue all outlinks v of uwith priority Pr(c*|u)
Crawldatabase
Seed URLs
Baseline crawling results 20 topics from
http://dmoz.org Half to two-thirds of
pages fetched are irrelevant
“Every click on a link is a leap of faith”
Humans leap better than focused crawlers
Adequate clues in text + DOM to leap better
0
0.2
0.4
0.6
0.8
1
AIAstronomy
BasketballCancer
ChessComposers
FlyFishingFolkDance
HorsesIceHockey
KayakingLinux
MeteorologySoups
Tobacco
10
100
1000
10000
100000
#pages
Relevance probability
How to seek out distant goals? Manually collect paths
(context graphs) leading to relevant goals
Use a classifier to predict link-distance to goal from page text
Prioritize link expansion by estimated distance to relevant goals
No discrimination amongdifferent links on a page
Goal
1
2 3
Crawler
1
2 3 4C
lass
ifier
The apprentice+critic strategy
Baseline learner (Critic)
Dmoztopic
taxonomy
Class modelsconsisting of
term stats
Frontier URLspriority queue
Crawler
Pick best
Newly fetchedpage uSubmit page for classification
If Pr(c*|u) islarge enough...
An instance (u,v)for the apprentice
u
vPr(c*|v)
Pr(c|u) forall classes c
Crawldatabase
Apprentice learner
Classmodels+ -
Onlinetraining
...
subm
it (u
,v)
to t
he a
ppre
ntic
e
Apprentice assigns moreaccurate priority to node v
u
v
Good
Good/ Bad
Design rationale Baseline classifier specifies what to crawl
Could be a user-defined black boxUsually depends on large amounts of training
data, relatively slow to train (hours)
Apprentice classifier learns how to locate pages approved by the baseline classifierUses local regularities in HTML, site structureLess training data, faster training (minutes)
Guards against high fan-out (~10) No need to manually collect paths to goals
Apprentice feature design HREF source page u represented by DOM tree Leaf nodes marked with offsets wrt the HREF Many usable representations for term at offset d
A t,d tuple, e.g., “download”, -2 t,p,d where p is path length from t to d through LCA
aHREF
TEXT font
TEXT
lili li
ul
li
TEXTTEXT em
TEXT
tt
TEXT
TEXT
@0 @0 @1 @2 @3@-1@-2Offsets
“download”
LCA
Offsets of good t,d features Plot information gain at
each d averaged over terms t
Max at d=0, falls off on both sides, but…
Clear saw-tooth pattern for most topics—why?<li><a…><li><a…>…<a…><br><a…><br>…
Topic-independent authoring idioms, best handled by apprentice
AI
4.00E-05
5.00E-05
6.00E-05
7.00E-05
8.00E-05
9.00E-05
1.00E-04
-8 -6 -4 -2 0 2 4 6 8d
Info
Ga
ind_max=8
d_max=5
d_max=4
d_max=3
Apprentice learner design Instance (u,v),Pr(c*|v) represented by
t,d features for d up to some dmax
HREF source topics {c,Pr(c|u) c}Cocitation: w1, w2 siblings, w1 goodw2 good
Learning algorithmWant to map features to a score in [0,1]Discretize [0,1] into ranges, each a label Class label has an associated value q
Use a naïve Bayes classifier to find Pr(|(u,v))Score = q Pr(|(u,v))
Apprentice accuracy Better elimination of
useless outlinks with increasing dmax
Good accuracy with dmax< 6
Using DOM offset info improves accuracy
Small accuracy gainsLCA distance in t,p,dSource page topicsCocitation features
AI
65
70
75
80
85
90
0 2 4 6 8d_max
%A
ccu
racy
Negative
Positive
Average
AI
76
78
80
82
84
86
0 2 4 6 8d_max
%A
ccu
racy Text
Offset
Offline apprentice trials Run baseline, train
apprentice Start new crawler at
the same URLs Let it fetch any page it
schedules (Recall) limit it to
pages visited by the baseline crawler
Baseline loss > recall loss > apprentice loss
Small URL overlap
Online guidance from apprentice Run baseline Train apprentice Re-evaluate frontier
Apprentice not as optimistic as baseline
Many URLs downgraded
Continue crawling with apprentice guidance Immediate reduction
in loss rate
Classical Composers
900
1100
1300
1500
1700
2000 4000 6000 8000#Pages fetched
Cu
mu
lativ
e lo
ss
Train apprentice
Folk Dancing
0
4000
8000
12000
0 0-.2 .2-.4 .4-.6 .6-.8 .8-1Estimated relevance of outlinksF
req
ue
ncy
Baseline
Apprentice
Summary New generation of focused crawler
Discriminates between links, learning online
Apprentice easy and fast to train onlineAccurate with small dmax around 4—6
DOM-derived features better than textEffective division of labor (‘what’ vs. ‘how’)
Loss rate reduced by 30—90%Apprentice better at guessing relevance of
unvisited nodes than baseline crawlerBenefits visible after 100—1000 page fetches
Ongoing work Extending to larger radius and deeper
DOM + site structure Public domain C++ software
Crawler• Asynchronous DNS, simple callback model• Can saturate dedicated 4Mbps with a Pentium2
HTML cleaner• Simple, customizable, table-driven patch logic• Robust to bad HTML, no crash or memory leak
HTML to DOM converter• Extensible DOM node class