Keywords Selection Problem in Hidden Web Crawling
Ka Cheung Sia, Richard
March 15 2004
Agenda What is Hidden Web? How to crawl the Hidden Web? Problem formalization Searching for “best” keyword
Greedy Tree searching Pruning
Experiments & results Conclusion
What is Hidden Web? Hidden
Unreachable by following hyperlinks Dynamically generated Accessible only through a search interface
Informative Examples
http://citeseer.ist.psu.edu/ - CS research paper http://www.pubmed.org – medical research paper http://catalog.loc.gov – library of congress
What is Hidden Web? Search interface
http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Search+Documents&cs=1
What is Hidden Web? Result
What is Hidden Web? Document
How to crawl the Hidden Web http://citeseer.ist.psu.edu/cis?
q=heuristic+search&submit=Search+Documents&cs=1
Figure out a keyword
HiddenWeb
QueryResult
Our task
Problem formalization Set-cover
Vertex – documents Hyper-edges – query words
Goal Maximize the number of unique documents
retrieved with minimum number of query words
Problem formalization P(qi)
portion of unique documents retrieved by issuing query word qi (portion of documents containing “qi”)
P(qi v qj) portion of unique documents retrieved by issuing query
words qi and qj (portion of documents containing qi or qj)
P(qi | qj) portion of documents containing qi in the set of
documents retrieved by issuing query words qj
Problem formalization What is the next “best” query word?
P((q1 v … v qi-1) v qi)= P(q1 v … v qi-1) + P(qi) – P((q1 v … v qi-1) ^ qi)= P(q1 v … v qi-1) + P(qi) – P(q1 v … v qi-1)P(qi | q1 v … v qi-1)
P(q1 v … v qi-1) – knownP(qi | q1 v … v qi-1) – knownP(qi) – unknown Approximate P(qi) using P(qi | q1 v … v qi-1)
Search for best query word Greedy: choose the most frequently occurring
word so far to be the query Choose qi with maximum P(qi | q1 v … v qi-1)
For set-cover problem, greedy is proven to obtain log-optimal solution
Search for best query word Can we do better? Intuition
Correlation of keywords E.g.
- linux- debian, redhat, suse, knoppix, fedora, etc…
We might save the query word “linux” !
Search for best query word
Wholedocumentcollection
Already retrieveddocuments
Documents retrieved by qi
Documentsretrieved by qj
Documentsretrieved by qk
Search for best query word
linux
debian
redhat
f(x) = Number of documents we get by issuing queries linux, debain, redhat minus theoverlapping between “redhat, linux” and “debain, linux” and “redhat, debain”
Search for best query word The search tree is huge (branching factor)
We look ahead for the 10 most frequent keywords
We only search up to depth 6 Pruning
Search for best query word DFBnB
Sub-tree where the sum of documentsretrieved assuming no overlappingbetween keywords are less than thecurrent best solution
Experiment Document collection : ~100K front pages of
randomly selected websites Query interface : an inverted index (a program that
returns documents containing the given query word) Methods
Greedy DFS search (look ahead for 10 words, up to depth 6) DFS search with pruning (DFBnB)
Results Does searching helps?
provide 51work 159privacy 144years 172world 344list 205info 1467map 184want 57order 87people 85read 56main 2270high 95designed 240latest 36events 132looking 46send 80right 380enter 1285local 77browser 1216questions 77real 77
provide 51work 159privacy 144years 172read 101main 2364designed 291info 1455latest 53looking 60send 101right 402local 99world 239list 142map 150want 42order 69people 67high 85events 126questions 85enter 1272browser 1216real 77
Results Does searching helps?
Results How much does pruning saves?
With out pruning – 187300 nodes are examined187300=(10)+(10*9)+(10*9*8)+(10*9*8*7)+(10*9*8*7*6)+(10*9*8*7*6*5)
With pruning – 5558 nodes are examined on average (when we choose the most frequent keyword to expand)
DFBnB saves ~ 30 times
Conclusion Searching helps little “in this problem”
DFBnB is “really effective” in pruning search tree
End
More results Priori information helps
Results
Results
Search & Greedy
Search with prune & Greedy
Search for best query word base = q1 v … v qi
P(base v qi+1 v qi+2)= P(base v qi+1) + P(qi+2) – P((base v qi+1) ^ qi+2)
P((base v qi+1) ^ qi+2)= P((base ^ qi+2) v (qi +1^ qi+2))= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)= P(base ^ qi+2) + P(qi+1 ^ qi+2) – P(base ^ qi+1 ^ qi+2)
2 words overlapping
3 words overlapping