Upload
augusta-perry
View
216
Download
1
Embed Size (px)
Citation preview
(C) 2003, The University of Michigan 1
Information Retrieval
Handout #7
March 24, 2003
(C) 2003, The University of Michigan 2
Course Information
• Instructor: Dragomir R. Radev ([email protected])
• Office: 3080, West Hall Connector
• Phone: (734) 615-5225
• Office hours: M&F 11-12
• Course page: http://tangra.si.umich.edu/~radev/650/
• Class meets on Mondays, 1-4 PM in 409 West Hall
(C) 2003, The University of Michigan 3
Schedule
• Readings for 03/31:– Chakrabarti, van den Berg, and Dom “Focused
Crawling” WWW 1999– Hawking, Voorhees, Craswell, and Bailey
"Overview of the TREC-8 Web Track" TREC 2000
– Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002
(C) 2003, The University of Michigan 4
Schedule
• March 24– The link-content hypothesis– XML retrieval
• March 31– Information extraction– Language reuse
• April 7– Language modeling for IR– The Lemur system
(C) 2003, The University of Michigan 5
Schedule
• HW3 assigned 03/24
• HW3 due 04/07
• Final projects due 04/11
• Final project presentations 04/14
• Final exam 04/212-3 essay questions, 2-3 problems
(C) 2003, The University of Michigan 6
The link-content hypothesis
(C) 2003, The University of Michigan 7
Kleinberg and Lawrence, The structure of the Web - Science 294 1849-1850
Web structure
(C) 2003, The University of Michigan 8
Web structure
• 16-20 links on average
• The fraction of pages with n in-links is approximately n- for ~ 2.1
• Kleinberg/Lawrence: 100,000 coherent communities (e.g., people concerned with oil spills off the coast of Japan)
(C) 2003, The University of Michigan 9
Topical locality [Davison 00]
• Most web pages are linked to others with related content - this helps users navigate the Web.
• Presence of topical locality - important for building focused crawlers.
• Traditionally search engines only indexed titles and/or the first few lines of each document. Now, they index all links.
• “More evil than Satan himself”
(C) 2003, The University of Michigan 10
Experimental design
• Local crawl of 100,000 pages
• Starts from HotBot and AltaVista
• Biased towards English-language pages
• From each page, retrieve one outgoing link per page.
(C) 2003, The University of Michigan 11
TFIDF cosine similarity
all
iii
wIDFPwTF
wIDFPwTFPwTFIDF
2))(*),((
)(*),(),(
)(
1log)(
wDF
nwIDF
)1)(log(),( wDFPwTF
allw allw
ii
PwTFIDFQwTFIDF
PwTFIDFQwTFIDFPQCosTFIDF
22 ),(*),(
),(*),(),(_
(C) 2003, The University of Michigan 12
Other metrics
)(#
)(#),(
Qterms
QwtimesQwFract
allw
PwqwFractPQ
otherwise, 0
if , ),(),(Prob
Query-document overlap
Query term probability
allw
QwFractPwFractPQOverlap )),(),,(min(),(
(C) 2003, The University of Michigan 13
Experimental results
• 100,000 URLs but only 89,891 retrievable• An additional 111,107 URLs: two children
per initial page• www.geocities.com (561),
www.webring.com(419), www.amazon.com(303), etc.
• 18% top-level pages• 50% .com, 27% .edu
(C) 2003, The University of Michigan 14
Textual similarity
• TFIDF similarity– 0.31 same domain
– 0.23 linked pages
– 0.19 sibling
– 0.02 random
(C) 2003, The University of Michigan 15
Structure and content [Menczer 01]
• Cluster hypothesis (van Rijsbergen 79)
• Link-cluster conjecture (Menczer) - preservation of semantics across link
(C) 2003, The University of Michigan 16
Experimental design
• Open directory project (dmoz.org)
• 896,233 URLs from 97,614 topics
• 150,000 URLs from 47,174 topics
• 10,000 from each of the 15 top-level branches
(C) 2003, The University of Michigan 17
Measures of similarity
• Cosine
• Link similarity
• Semantic similarity
21
2121 ),(
pp
pp
lUU
UUpp
]Pr[log]Pr[log
)],(Pr[log2),(
21
2121 cc
cclcaccs
lca
c2
c1
(C) 2003, The University of Michigan 18
Correlations between similarities
• Over 3.84x109 pairs
• Highest for News and Home ( > 0.2)
• Lowest for Arts and Games ( < 0.05)
(C) 2003, The University of Michigan 19
Fit
21)1()(
e
03.01=1.8, 2=0.6 ,
(C) 2003, The University of Michigan 20
Document closures for Q&A
capital
P L P
Madridspain
spain
capital
(C) 2003, The University of Michigan 21
Document closures for IR
Physics
P L P
PhysicsDepartment
University ofMichigan
Michigan
(C) 2003, The University of Michigan 22
The perltree experiments
• 23.6% of the Excite log (2.5 M queries)– 60% have both words in WordNet– 27% have one word in WordNet– 13% have no words in WordNet
• 200 queries from the log
• 200 random queries
(C) 2003, The University of Michigan 23
Two-word queriesjimi SATseats davidcaesar pokercruise yellowscience Tisharatrim yankeewitnesses nakedswaybar cheatsrides Preciousdrugs universityClock enginesmetal choreographyanthony swingingpsychoanalysis webdesignpic lens
toys onlinespeech therapyMalcolm McDowellcellular accessoriesmigrant farmworkerswitch tvdavis instrumentsAdult Gameschichen itzafreighter Cruisesused motorcyclesfeng shuirevolucion mexicanazeebrugee belgiumelectronic greetings
(C) 2003, The University of Michigan 24
Query analysis
• Words:– Familiarity– Ambiguity– IDF
• Queries;– GoogleSize– SemDist– DistribSim
(C) 2003, The University of Michigan 25
Query analysis
Fam1 Fam2 Amb1 Amb2 IDF1 IDF2 Gsize SemD DistS
Excite (E) 1.42 1.89 1.70 2.36 4.00 4.74 670,000 0.39 0.06
Random (R) 1.54 1.61 2.06 2.29 4.40 4.55 329,000 0.29 0.02
(C) 2003, The University of Michigan 26
Link-based language models
• Wt2g corpus
• 247,491 pages
• 3,118,248 links
• 948,036 unique words
(C) 2003, The University of Michigan 27
A - number of documents that contain the word and are in the collection total documents in the collection = 246379B - number of all the outgoing linksC - number of outgoing links that are in the collectionD - number of outgoing links that are not in the collectionE - number of outgoing links that are in the collection and contain the wordF - number of outgoing links that are in the collection and do not contain the word
#fami l i ari typol ysemy old idf word new idf A B C D E F p=A/total p'=E/C p'/pdf recoved from idf3 n/a n/ a 0. 79 the 0. 78968 213800 871953 160935 711018 146913 14022 0. 867769 0. 912872 1. 05198 2128394 n/a n/ a 0. 8 of 0. 80112 211132 889967 161670 728297 144685 16985 0. 85694 0. 89494 1. 04434 2101835 n/a n/ a 0. 81 to 0. 809069 209307 867918 157538 710380 142058 15480 0. 849533 0. 901738 1. 06145 2083677 n/a n/ a 0. 83 and 0. 83064 204471 855610 157059 698551 138221 18838 0. 829904 0. 880058 1. 06043 2035528 3 4 0. 86 a 0. 862427 197638 832433 149540 682893 126968 22572 0. 802171 0. 849057 1. 05845 1967509 2 2 0. 86 i n 0. 875135 194999 821104 145786 675318 123851 21935 0. 791459 0. 84954 1. 07338 19412311 n/a n/ a 0. 91 for 0. 911742 187675 804872 144882 659990 119299 25583 0. 761733 0. 823422 1. 08098 1868321 n/a n/ a 0. 55 on 0. 952471 179979 809814 141024 668790 112477 28547 0. 730497 0. 797573 1. 09182 17917012 n/a n/ a 1. 01 i s 1. 009982 169847 692395 122096 570299 94133 27963 0. 689373 0. 770975 1. 11837 16908413 n/a n/ a 1. 05 by 1. 049723 163302 733707 122178 611529 93417 28761 0. 662808 0. 764598 1. 15357 16256814 n/a n/ a 1. 12 wi th 1. 122566 152169 687696 113973 573723 79774 34199 0. 617622 0. 699938 1. 13328 15148515 n/a n/ a 1. 15 thi 1. 151407 148043 676925 106199 570726 73402 32797 0. 600875 0. 691174 1. 15028 14737816 1 1 1. 15 ar 1. 16283 146450 636927 109257 527670 74169 35088 0. 594409 0. 678849 1. 14206 14579219 n/a n/ a 1. 18 f rom1. 177686 144412 678667 114326 564341 78700 35626 0. 586138 0. 688382 1. 17444 14376310 1 1 0. 88 be 1. 185611 143340 632562 100816 531746 68522 32294 0. 581787 0. 679674 1. 16825 14269620 2 2 1. 19 at 1. 196502 141884 667875 108091 559784 73134 34957 0. 575877 0. 676597 1. 1749 14124721 n/a n/ a 1. 2 or 1. 208855 140256 659173 103814 555359 68726 35088 0. 569269 0. 662011 1. 16291 13962622 n/a n/ a 1. 21 that 1. 213056 139708 607946 95101 512845 63810 31291 0. 567045 0. 670971 1. 18328 139080
(C) 2003, The University of Michigan 28
Procedure
• Given a query q1q2
– Get top 50 hits from Altavista (A)
– Extract links that contain q1 or q2
– Get pages that are linked (B)– Extract links from A U B that point to A U B– Index A U B using glimpse– Compute link fertility
(C) 2003, The University of Michigan 29
Results
• New links pointing to pages that were not in the AltaVista top 50– E = +11.7%, R = +8.9%
• Improvements higher for– rarer words– lower distributional similarity– lower semantic distance
(C) 2003, The University of Michigan 30
Topic distillation [Chakrabarti et al. 01]
• Topic drift
• Returning snippets rather than full documents
• Clique attacks (www.411fun.com, www.411fashion.com, www.411loans.com)