View
218
Download
2
Tags:
Embed Size (px)
Citation preview
Sigir’992
Basic Architectures: Search
Web
Log
Index
SE
Spider
Spam
Freshness
Quality results
20M queries/day
Browser
800M pages?
24x7
SE
SE
Sigir’993
Query Language
Augmented Vector spaceRelevance scored results
Tf, idf weighting
Boolean constraints: +, -
Phrases: “”
Fields:e.g. title:
Sigir’994
Does Word Order Matter?
Try “information retrieval” versus“retrieval information”
Do you get the same results?
The query parserInterprets query syntax: +,-, “”
Rarely used
General query from free textCritical for precision
Sigir’996
Precision Enhancement
Phrase inductionAll terms, the closer the better
Url and Title matching
Site clusteringGroup urls from same site
Quality-based reranking
Sigir’997
Link Analysis
Authors vote via linksPages with higher inlink are higher quality
Not all links are equalLinks from higher quality sites are better
Links in context are better
Resistant to SpamOnly cross-site links considered
Sigir’998
Page Rank (Page’98)
Limiting distribution of a random walkJump to a random page with Prob. Follow a link with Prob. 1-
Probability of landing at a page D:/T + P(C)/L(C)
Sum over pages leading to D
L(C) = number of links on page D
Sigir’999
HITS (Kleinbery’98)
Hubs: pages that point to many good pages
Authorities: pages pointed to by many good pages
Operates over a vincity graphpages relevant to a query
Refined by the IBM Clever groupfurther contextualization
Sigir’9910
Hyperlink Vector Voting (Li’97)
Index documents by in-link anchor textsFollow links backward
Can be both precision and recall enhancingThe “evil empire”
How to combine with standard ranking?Relative weight is a tuning issue
Sigir’9911
Evaluation
No industry standard benchmarkEvaluations are qualitative
Excessive claims abound
Press is not be discerning
Shifting targetIndices change daily
Cross engine comparison elusive
Sigir’9912
Complexity Analysis
Search is both CPU and I/O intensiveI/O to access postings
Random access
CPU to compute scores
Caching strategies are very effectiveTerm cache has 40% hit rate
Expensive queries are long and loaded with rare terms
Sigir’9914
Complexity Analysis
CPU costs asymptotically constantDue to term truncation
I/O cost can be kept to one I/O per termAgain due to truncation
Implies the bigger the betterNo advantage to distributed search
Sigir’9915
The Economics of Big Indices
Very large indices require distributed searchEasy scalability; maintenance
Practical hardware limitations
Implies Cost = Size * ThroughputSince each half of a big index requires the same hardware to sustain the same throughput
Worse: queries needing a big index are hard to monetize
Sigir’9916
How to Have your Cake...
Layered SearchSmall, high quality engine for common queries
Low cost per query; high revenue per query
Large, low throughput engine for rare queriesHigh cost per query, low revenue per query
Average query costs can be kept lowWhile still offering comprehensiveness
Sigir’9918
Novel Search Engines
Ask JeevesQuestion Answering
Directory for the Hidden Web
Direct HitDirect popularity
Click stream mining