Information retrieval on the Web - UIC Computer Scienceifc/resources/monika.pdf · Information Retrieval on the Web – Page 2 A. Broder & M. Henzinger FOCS 98 Tutorial: Information

Information Retrieval on the Web – Page 1

Information retrieval on the WebTools & algorithmic issues

Andrei Broder & Monika HenzingerCompaq Systems Research Center

November 1998

A. Broder & M. Henzinger FOCS 98 Tutorial: Information Retrieval on the Web

November 8,19982

Systems Research Center

What is this talk about?

l Topic:

– Algorithms for retrieving information on the Webl Non-Topics:

– Algorithmic issues in classic information retrieval (IR),e.g. stemming

– String algorithms, e.g. approximate matching– Other algorithmic issues related to the Web:

• networking & routing• cacheing• security• e-commerce



November 8,19983


Preliminaries

l Each Web page has a unique URL

http://theory.stanford.edu/~focs98/tutorials.html

l (Hyper) link = pointer from one page to another, loadssecond page if clicked on

l In this talk: document = (Web) page

Accessprotocol

Host name =Domain name

Page name


November 8,19984


princess diana

Engine 1 Engine 2 Engine 3

Relevant butlow quality

Not relevantindex pollution

Example of a query

Relevant andhigh quality



November 8,19985


Open problems

Outline

l Classic IR vs. Web IR

l Some IR tools specific to the Web

For each type

– Examples

– Algorithmic issues

l Conclusions

Details on - Ranking - Duplicate elimination - Measuring the Web


November 8,19986


Classic IR

l Input: Document collection

l Goal: Retrieve documents or text with information

content that is relevant to user’s information need

l Two aspects:

1. Processing the collection

2. Processing queries (searching)

l Reference Texts: SW’97, BF’92



November 8,19987


Determining query results

“model” = strategy for determining which documents to returnl Logical model: String matches plus AND, OR, NOTl Vector model:

– Documents and query represented as vector of terms– Vector entry i = weight of term i = function of

frequencies within document and within collection– Similarity of document & query = cosine of angle of their

vectors– Query result: documents ordered by similarity

l Other models used in IR but not discussed here:– Probabilistic model– Cognitive model– …


November 8,19988


IR on the Web

l Input:The publicly accessible Webl Goal: Retrieve high quality pages that are relevant to

user’s need– Static (files: text, audio, … )– Dynamically generated on request: mostly data base

accessl Two aspects:

1. Processing and representing the collection• Gathering the static pages• “Learning” about the dynamic pages

2. Processing queries (searching)



November 8,19989


What’s different about the Web?Pages

l Bulk …………………… 350 M (7/98); grows at 20M/monthl Lack of stability……….. Estimates: 1%/day -- 1%/weekl Heterogeneity

– Type of documents .. Text, pictures, audio, scripts,…– Quality ……………… From dreck to FOCS papers …– Language ………….. 100+

l Duplication– Syntactic……………. 30% (near) duplicates– Semantic……………. ??

l Non-running text……… many home pages, bookmarks, ...

l High linkage…………… ≥ 8 links/page in the average


November 8,199810


Typical home page: non-runningtext



November 8,199811


The big challenge

Meet the user needs given

the heterogeneity of Web pages


November 8,199812


What’s different about the Web?Users

l Make poor queries– Short (2.35 terms avg)– Imprecise terms– Sub-optimal syntax

(80% queries withoutoperator)

– Low effortl Wide variance in

– Needs– Expectations– Knowledge– Bandwidth

l Specific behavior– 85% look over one result

screen only– 78% of queries are not

modified– Follow links– See various user studies

in CHI, Hypertext, SIGIR,etc.



November 8,199813


The bigger challenge

Meet the user needs given

the heterogeneity of Web pagesand

the poorly made queries.


November 8,199814


Why don’t the users get what theywant?

I need to get rid of mice inthe basement

What’s the best way to trap mice alive?

mouse trap

User request(verbalized)

Query to IR system

Translationproblems

Example

User need

Results

PolysemySynonymy

computer supplies,software, etc



November 8,199815


AltaVista output: mouse trap


November 8,199816


AltaVista output: mice trap



November 8,199817


The bright side: Web advantages vs. classic IR

Collection/toolsl Redundancyl Hyperlinksl Statistics

– Easy to gather– Large sample sizes

l Interactivity (make theusers explain what theywant)

Userl Many tools availablel Personalizationl Timelinessl Interactivity (refine the

query if needed)


November 8,199818


Quantifying the quality of results

l How to evaluate different strategies?

l How to compare different search engines?



November 8,199819


Classic evaluation of IR systems

l Precision: % of returnedpages that are relevant.

l Recall: % of relevantpages that are returned.

l Precision at 10: % of top10 pages that are relevant(“ranking quality”)

l …

We start from a human made relevance judgementfor each (query, page) pair and compute:

Relevantpages

Returnedpages

All pages


November 8,199820


Evaluation in the Web context

l Quality of pages varies widely

l Relevance is not enough

l We need both relevance and high quality = value ofpage.



November 8,199821


Web IR tools

l General-purpose search engines:

– direct: AltaVista, Excite, Infoseek, Lycos, HotBot, ….

– Indirect (Meta-search): MetaCrawler, DogPile,AskJeeves, …

l Hierarchical directories: Yahoo!, all portals.

l Specialized search engines:

– Home page finder: Ahoy

– Shopping robots: Jango, Junglee,…

– Applet finders

Database mostly built

by hand


November 8,199822


Web IR tools (cont...)

l Search-by-example: Netscape’s “What’s related”,Excite’s “More like this”, …

l Collaborative filtering: Firefly, GAB, …l Notification systems: Clipping services

l Meta-information:– Web statistics:

size, number of domains, geography, ….– Usage statistics– Query log statistics– …



November 8,199823


General purpose search engines

l Search engines’ components:

– Spider = Crawler -- collects the documents

– Indexer -- process and represents the data

– Search interface -- answers queries

l Example: AltaVista as of 7/98

– Collects ~ 170,000,000 pages.

– Indexes ~ 125,000,000 pages = 700 GB of text.

– Roughly 35% of the entire indexable Web.


November 8,199824


Algorithmic issues related tosearch engines

l Collectingdocuments– Priority– Load

balancing• Internal• External

– Trapavoidance

– ...

l Processing andrepresenting thedata– Query-

independentranking

– Graphrepresentation

– Index building– Duplicate

elimination– Categorization– …

l Processingqueries– Query-

dependentranking

– Duplicateelimination

– Queryrefinement

– Clustering– ...



November 8,199825


Ranking

l Goal: order the answers to a query in decreasing orderof value– Query-independent: assign an intrinsic value to a

document, regardless of the actual query– Query-dependent: value is determined only wrt a

particular query.– Mixed: combination of both valuations.

l Examples– Query-independent: length, vocabulary, publication

data, number of citations (indegree), etc.– Query-dependent: cosine measure


November 8,199826


Some ranking criteria

l Content-based techniques (variant of term vector modelor probabilistic model)

l Ad-hoc factors (anti-porn heuristics, publication/locationdata, ...)

l Human annotationsl Connectivity-based techniques

– Query-independent (PageRank [PBMW’98, BP’98],indegree [CK’97], …)

– Query-dependent (HITS [K’98], …)



November 8,1998


Connectivity analysis

l Idea: Mine hyperlink information of the Web

l Assumptions:

– Links often connect related pages

– A link between pages is a recommendation

l Classic IR work (citations = links) a.k.a. “Bibliometrics”[K’63, G’72, S’73, …]

l Many Web related papers build on this idea [PPR’96,AMM’97, S’97, CK’97, K’98, BP’98,…]


November 8,199828


Graph representation for the Web

l Node for each page u

l Directed edge (u,v) if page u contains a hyperlink to

page v.



November 8,199829


Query-independent connectivitybased ranking: PageRank [BP’98]

l Consider a random Web surfer:

– Jumps to random page with probability ε– With probability 1-ε, follows a random hyperlink of the

current pagel Transition probability matrix is

where U is the uniform distribution and A is adjacencymatrix (normalized)

l Query-independent rank = stationary probability for thisMarkov chain.

l Used in GOOGLE! http://google.stanford.edu/

AU ×−+× )1( εε


November 8,199830


Output from Google! princess diana



November 8,199831


Output from Google!: jobs


November 8,199832


Query Results= Start Set Forward SetBack Set

Query-dependent ranking:the neighborhood graph

An edge for each hyperlink, but no edges within the same host

Result1

Result2

Resultn

f1

f2

fs

...

b1

b2

bm

… ...

l Subgraph associated to each query



November 8,199833


HITS [K’98]

l Goal: Given a query find:

– Good sources of content (authorities)

– Good sources of links (hubs)


November 8,199834


l Authority comes from in-edges.Being a good hub comes from out-edges.

l Better authority comes from in-edges from good hubs.Being a better hub comes from out-edges to goodauthorities.

Intuition



November 8,199835


Repeat until HUB and AUTH converge:

Normalize HUB and AUTH

HUB[v] := Σ AUTH[ui] for all ui with Edge(v, ui)

AUTH[v] := Σ HUB[wi] for all wi with Edge(wi, v)

HITS details

w1

wk

......Aw2

u1

uk

u2......

H

v


November 8,199836


J

JJ

JJJ

JJJ

Output from HITS: jobs(start set from AltaVista)



November 8,199837


Problems & solutions

Author A Author B1/3

1/3

1/3

l Some edges are “wrong” -- not arecommendation:– multiple edges from same author– automatically generated– spam, etc.

Solution: Weight edges to limit influence

l Topic drift– Query: jaguar AND cars

Result: pages about cars in general

Solution: Analyze content and assign topicscores to nodes


November 8,199838


Repeat until HUB and AUTH converge:

Normalize HUB and AUTH

HUB[v] := Σ AUTH[ui] TopicScore[ui] weight[v,ui]

for all ui with Edge(v, ui)

AUTH[v] := Σ HUB[wi] TopicScore[wi] weight[wi,v]

for all wi with Edge(wi, v)

Modified HITS algorithms [CDRRGK’98, BH’98, CDGKRRT’98]



November 8,199839


User study [BH’98]

Valuable pages within 10 top answers (averaged over 28 topics)

012345

6789

10

Authorities Hubs

Val

uabl

e pa

ges

with

in to

p 10

Original

EdgeWeighting

EW +ContentAnalysis


November 8,199840


Open problems

l Compare performance of query-dependent and query-

independent connectivity analysis

l Exploit order of links on the page (see e.g.

[CDGKRRT’98])

l Both Google! and HITS compute principal eigenvector.

What about non-principal eigenvector? ([K’98])

l Derive other graphs from the hyperlink structure …



November 8,199841


More on graph representation

l Graphs derived from the hyperlink structure of the Web:– Node =page– Edge (u,v) iff pages u and v are related in a specific

way (directed or not)l Examples of edges:

– iff u has hyperlink to v– iff there exists a page w pointing to both u and v– iff u is often retrieved within x seconds after v– …


November 8,199842


Graph representation usage

l Ranking algorithms

– PageRank

– HITS

– …

l Market research

– Lycos “LinkAlert”

l Categorization of Webpages

– [CDI’98]

l Visualization/Navigation

– Mapuccino [MJSUZB’97]

– Microsoft’s Site Mappingtool

– …

l Structured Web query tools

– WebSQL [AMM’97]

– WebQuery [CK’97]



November 8,199843


Example: SRC Connectivity Server[BBHKV’98]

Directed edges = Hyperlinksl Goal: Support two basic operations for all URLs

collected by AltaVista– InEdges(URL u, int k)

• Return k URLs pointing to u– OutEdges(URL u, int k)

• Return k URLs that u points tol Difficulties:

– Memory usage (~180 M nodes, 1B edges)– Preprocessing time (days …)– Query time (~ 0.0001s/result URL)


November 8,1998


www.foobar.com/www.foobar.com/gandalf.htmwww.foograb.com/

0 www.foobar.com/ 115 gandalf.htm 267 grab.com/ 41

Original text Delta Encoding

gandalf.htm 15

size of shared prefix

26

Node ID

URL database

Sorted list of URLs is 8.7 GB (≈ 48 bytes/URL)Å Delta encoding reduces it to 3.8 GB (≈ 21 bytes/URL)



November 8,199845


...

Graph data structure

ptr to outlist tableptr to inlist tableptr to URL

Node Table

URL DatabaseInlist Table Outlist Table

...

......

...


November 8,199846


Open theory problems

l Graph compression: How much compression possiblewithout significant run-time penalty?– Efficient algorithms to find frequently repeated small

structures (e.g. wheels, K2,2)– Asymptotic complexity?

l External memory graph algorithms: How to assign thegraph representation to pages so as to reduce paging?(see [NGV’96, AAMVV’98])

l Stringology: Less space for URL database? Fasteralgorithms for URL to node translation?

l Dynamic data structures: How to make updates efficientat the same space cost?



November 8,199847





– Trapavoidance

– ...

l Processing andrepresenting thedataáQuery-

independentranking

áGraphrepresentation

– Index building– Duplicate


l ProcessingqueriesáQuery-

dependentranking

– Duplicateelimination

– Queryrefinement



November 8,199848


performance

l Indexes about 0.8TB of text

l Every word is indexed! (No stop words.)

l About 37 million queries on weekdays

l Mean response time = 0.6 sec (elapsed Palo Alto time)

l Median response time, as observed in NJ = 0.9 sec[LG’98b]



November 8,199849


Why is AltaVista so fast?

l Conceptually simple algorithms

l Good algorithms engineering

l Good software engineering

l Lots of fast hardware, lots of RAM


November 8,199850


AltaVista search index

l View all documents as concatenated into one document(Special token to separate original documents)

l Basic data structure: flat inverted file [See e.g. BF’92]

– Store for each word, a delta-encoded list of alllocations where word occurs

– Order words lexicographically and compresscommon word prefixes

l ⇒ Size of full-text index ≈ 30% of input size



November 8,199851


Operations

l Extended Boolean model:– matches: exact, prefix, phrase– operators: AND, OR, AND NOT, NEAR

l To support this: only one abstract data type Index Stream Reader (ISR) = the ordered list of

locations that match a given query. For a given ISR:

• loc(): return current location• next(): advance to next location

• seek(int k): advance to first location ≥ k


November 8,199852


Translating a query into an ISR

Each query is decomposed into these elementaryoperations:

CREATE: WORD → ISR ...……….. Create a list

OR: ISR × ISR → ISR ...…………… Merge lists

AND NOT: ISR × ISR → ISR ...…… Intersect lists

AND: ISR × ISR → ISR ...…………………."

NEAR: ISR × ISR → ISR ...………………. "

PHRASE: ISR × ISR × … × ISR→ ISR …. "

The final resulting ISR is translated back into list ofdocuments and ranked



November 8,199853


Hardware

The search happens on

l ~ 20 64-bit machines

l Typical configuration

– 10 CPU each (625 MHz)

– 12 GB RAM

– 300 GB RAID


November 8,199854


Reasons for duplicate filtering

l Proliferation of almost but not quite equal documents onthe Web:

– Legitimate: Mirrors, local copies, updates, etc.

– Malicious: Spammers, spider traps, dynamic URLs

– Mistaken: Spider errors

l Costs:

– RAM and disks

– Unhappy users

l Approximately 30% of the pages on the Web are (near)duplicates. [BGMZ’97,SG’98]



November 8,1998


Basic mechanism needed

l Must filter both duplicate and near-duplicate documents

l Computing pair-wise edit distance would take forever

l Preferably to store only a short sketch for each

document.


November 8,1998


The basics of a solution

[B’97],[BGMZ’97],[B’98]

1. Reduce the problem to a set intersection problem

2. Estimate intersections by sampling minima.



November 8,199857


D

a rose is a rose is a rosea rose is a rose is a rose is a rose is a rose is a rose is a rose

FingerprintShinglingSet of 64 bit

fingerprints

Set of shingles

Shingling

l Shingle = Fixed size sequence of w contiguous words


November 8,199858


Defining resemblance

1D2D

||

||eresemblanc

21

21

SS

SS

U

I=

||||

21

21

SSSS

eresemblancUI=

1S 2S



November 8,199859


l Apply a random permutation σ to the set [0..264]

l Crucial fact

Let

Sampling minima

||||

)Pr(21

21

SSSS

UI== βα

))(min())(min( 21 SS σβσα ==

1S2S


November 8,1998


Implementation

l Choose a set of t random permutations of U

l For each document keep a sketch S(D) consisting of tminima = samples

l Estimate resemblance of A and B by counting commonsamples

l The permutations should be from a min-wiseindependent family of permutations. See [BCFM’97] forthe theory of mwi permutations.



November 8,1998


If we need only high resemblance

Sketch 1:

Sketch 2:

l Divide sketch into k groups of s samples (t = k * s)

l Fingerprint each group ⇒ feature

l Two documents are fungible if they have more

than r common features.

l WantFungibility ⇔ Resemblance above fixed threshold ρ


November 8,1998


Real implementation

l ρ = 90%. In a 1000 word page with shingle length = 8this corresponds to

• Delete a paragraph of about 50-60 words.

• Change 5-6 random words.

l Sketch size t = 84, divide into k = 6 groups of s = 14samples

l 8 bytes fingerprints → we store only 6 x 8 =

48 bytes/documentl Threshold r = 2



November 8,199863


Probability that two documents aredeemed fungible

Two documents with resemblance ρl Using the full sketch

l Using features

( ) iksisk

ri i

kP

−⋅

=

−

= ∑ ρρ 1

( ) iskisk

ti i

skP −⋅

⋅

=

−

⋅

= ∑ ρρ 1


November 8,1998


Features vs. full sketch

Prob

Resemblance

Probability that two pages are deemed fungible

Using full sketchUsing features



November 8,199865


Related work

l Similarity detection [M’94, BDG’95, SG’95, H’96,

BGMZ’97]

l Mathematics of resemblance [B’97, B’98]

l Min-wise independent permutations [BCFM’97, BCM’98]

l Duplication on the Web [SG’98]

l Efficient ways of finding data base elements that share

many attributes. [FSGMU’98]


November 8,199866


Open problems

l Best way of grouping samples for a given thresholdand/or for multiple thresholds?

l Efficient ways to find in a data base pairs of records thatshare many attributes. Best approach?

l Min-wise independent permutations -- lots of openquestions.

l Connections with the Hamming distance nearestneighbor problem. (See [IM’98]).

l Other applications possible (images, sounds, ...) -- needtranslation into set intersection problem.



November 8,199867





– Trapavoidance

– ...

l Processing andrepresenting thedataáQuery-

independentranking

áGraphrepresentation

áIndex buildingáDuplicate


l ProcessingqueriesáQuery-

dependentranking

áDuplicateelimination

– Queryrefinement



November 8,199868


Adding pages to the index

Queue of links to explore

Expired pagesfrom index

Add URL

Get link at top of queue

Fetch page

Add to queue

Index page and

parse links

Crawling process



November 8,199869


Queuing discipline

l Standard graph exploration:– Random– BFS– DFS (+ depth limits)

l Goal: get “best” pages for a given index size– Priority based on query-independent ranking– Visit the page that has the highest potential

PageRank [CGP’98]l Goal: keep index fresh

– Priority based on rate of change [CLW’97]


November 8,199870


Load balancing

l Internal -- can not handle too much retrieved datasimultaneously, but– Response time is unpredictable– Size of answers is unpredictable– There are additional system constraints (# threads, #

open connections, etc.)l External

– Should not overload any server or connection– A well-connected crawler can saturate the entire

outside bandwidth of some small countries– Any queuing discipline must be acceptable to the

community



November 8,199871


Web IR Tools

á General-purpose search enginesl Hierarchical directories

– Automatic categorizationl Specialized search engines

(dealing with heterogeneous data sources)– Shopping robots– Home page finder [SLE’97]– Applet finders– …

l Search-by-examplel Collaborative filteringl Notification systemsl Meta-information


November 8,199872


Dealing with heterogeneoussources

l Modern life problem:

Hot issue in current data base research

Red hot issue in the industry.

l Examples– Inter-business e-commerce e.g. www.industry.net

– Meta search engines

– Shopping robots

l Issues

– Determining relevant sources -- the “identification” problem

– Merging the results -- the “fusion” problem

Given information sources with various capabilities, query all of them and combine the output.



November 8,199873


Standards and tools

l Some standards:– STARTS → Stanford Protocol Proposal for [distributed]

Internet Retrieval and Search– EDI → Electronic Data Interchange– XML → Extensible Markup Language

• http://www.w3.org/TR/1998/REC-xml-19980210

• http://www.oasis-open.org/cover/sgml-xml.html

l Some tools:– WebL → scripting language for automating tasks on the

Web• http://www.research.digital.com/SRC/WebL

– WebSQL → SQL like query language [M’96, AMM’97]


November 8,199874


Example: a shopping robot

l Input: A product description in some formFind: Merchants for that product on the Web

l Jango [DEW’97]– preprocessing: Store vendor URLs in database;

learn for each vendor:• the URL of the search form• how to fill in the search form and• how the answer is returned

– request processing: fill out form at every vendor andtest whether the result is a success

– range of products is predetermined



November 8,199875


Jango input example


November 8,199876


Jango output



November 8,199877


Direct query to K&L Wine


November 8,199878


What price decadence?



November 8,199879


Open problems

l Going beyond the lowest common capabilityl Learning problem: automatic understanding of new

interfacesl Efficient ways to determine which DB is relevant to a

particular queryl “Partial knowledge indexing”: indexer has limited access

to the full DB


November 8,199880


Web IR Tools

á General-purpose search enginesá Hierarchical directoriesá Specialized search engines

(dealing with heterogeneous data sources)l Search-by-examplel Collaborative filteringl Notification systemsl Meta-information



November 8,199881


Search-by-example

l Given: set S of URLsl Find: URLs of similar pagesl Algorithms:

– Connectivity-based: build graph consisting ofneighborhood of S and run HITS [K’98]

– Usage based: related pages are pages visitedfrequently after S [Netscape “What’s related”]

– Query refinement [Excite’s “More like this”]


November 8,199882


Netscape example



November 8,199883


Web IR Tools

á General-purpose search engines:á Hierarchical directoriesá Specialized search engines:á Search-by-examplel Collaborative filteringl Notification systemsl Meta-information


November 8,199884


Collaborative filtering

Suggestions

predictionphase

Userinput

analysisphase

Model

Collectedinput

Explicit preferences



November 8,199885


Lots of projects

l Collaborative filtering seldom used in classic IR, bigrevival on the Web. Projects:

– PHOAKS -- ATT labs → Web pages recommendationbased on Usenet postings

– Firefly -- MIT → share profiles without infringing privacy

– GAB -- Bellcore → Web browsing

– Grouplens -- U. Minnesota → Usenet newsgroups

– EachToEach -- Compaq SRC → rating movies

– …

See http://sims.berkeley.edu/resources/collab/


November 8,199886


Why do we care?

l The ranking schemes that we discussed are also a formof collaborative ranking!

– Connectivity = people vote with their links

– Usage = people vote with their clicks

l These schemes are used only for a global model building.Can it be combined with per-user data?Ideas:

– Consider the graph induced by the user’s bookmarks.

– Profile the user -- see www.globalbrain.net

– Deal with privacy concerns!



November 8,199887


Web IR Tools

á General-purpose search engines:á Hierarchical directoriesá Specialized search engines:á Search-by-exampleá Collaborative filteringl Meta-information

– Web Statistics: Size, Growth, ...– Usage Statistic– Query Log Statistics


November 8,199888


Size of the Web

l How big is the Web?

l How fast does the Webgrow?

l What is the proportion ofpages in French, XML,Postscript, etc. ?

l How do various searchengines compare?

Arachnometry: The art and science of measuring theworld wide Web, in particular the coverage of varioussearch engines.



November 8,199889


Comparing Search EngineCoverage

l Naïve Approaches– Get a list of URLs from each search engine and

compare• Not practical or reliable.

– Result Set Size Comparison• Reported sizes are approximate.

– …l Better Approach

– Statistical Sampling


November 8,199890


URL sampling

l Ideal strategy: Generate a random URL and check forcontainment in each index.

l Random URLs are hard to generate:– Random walks methods

• Graph is directed• Stationary distribution is non-uniform• Must prove rapid mixing.

– Pages in cache, query logs [LG’98a], etc.• Correlated to the interests of a particular group of

users.l A simple way: collect all pages on the Web and pick one

at random.



November 8,199891


Sampling via queries [BB’98]

l Search engines have the best crawlers -- why not

exploit them?

l Method:

– Sample from each engine in turn

– Estimate the relative sizes of two search engines

– Compute absolute sizes from a reference point


November 8,199892


|A ∩ B| ≈ (1/2) * |A|

|A ∩ B| ≈ (1/6) * |B|

∴ |B| ≈ 3 * |A|

Select pages randomly fromA (resp. B)

Check if page contained inB (resp. A)

Two steps: (i) Selecting (ii) Checking

Estimate relative sizes



November 8,199893


Selecting a random page

l Generate random query

– Build a lexicon of words that occur on the Web

– Combine random words from lexicon to form queries

l Get the first 100 query results from engine A

l Select a random page out of the set

l Distribution is biased -- the conjecture is that

where p(D) is the probability that D is picked by this scheme

||||

~)(

)(

ABA

Dp

Dp

AD

BAD ∩

∑∑

∈

∩∈


November 8,199894


Checking if an engine has a pagel Create a “unique query” for the page:

– Use 8 rare words.– E.g., for the Systems Research Center Home Page:



November 8,199895


Results of the BB’98 study

21

75

37

45

100

125

240

350

0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375

Lycos

Northern Light

Infoseek

Excite

HotBot

AltaVista

Union

Static Web

Size in Millions of Pages

Jun-97 Nov-97 Mar-98 Jul-98

Status as of July ‘98Status as of July ‘98Web size: 350 million pagesGrowth: 25M pages/monthAltaVista is the biggest: 35%Six largest engines cover: 2/3Small overlap: 3M pages

Crawling strategy is very different!

Exclusive listings in millions of pages

8

13

45

50

20

5

0 10 20 30 40 50 60

Infoseek

Excite

HotBot

AltaVista

Northern Light

LycosJul-98



November 8,199897


Open problems

l Random page generation via random walksl Better “sampling via queries” methodl Cryptography based approach: want random pages

from each engine but no cheating! (page should bechosen u.a.r. from the actual index)– Each search engine can commit to the set of pages it

has without revealing it– Need to ensure that this set is the same as the set

actually indexed– Need efficient oblivious protocol to obtain random

page from search engine– See [NP‘98] for possible solution


November 8,199898


How often do people view a page?

l Problems:– Web caches interfere with click counting– cheating pays (advertisers pay by the click)

l Solutions:– naïve: forces caches to re-fetch for every click.

• Lots of traffic, annoyed Web users– extend HTML with counters [ML’97]

• requires compliance, down caches falsify results.– use sampling [P’97]

• force refetches on random days• force refetches for random users and IP addresses

– cryptographic audit bureaus [NP’98a]l Commercial providers: 100hot, Media Matrix, Relevant

Knowledge, …



November 8,199899


Query Log Statistics [SHMM’98]

request = new queries or new result screen of old querysession = a series of requests by one user close together

in timel analyzed ~1B AltaVista requests consisting of:

– ~840 000 000 non-empty requests– ~575 000 000 non-empty queries– ~153 000 000 unique non-empty queries– ~285 000 000 user sessions

l average number of terms: 2.35 (max 393)l average number of operators: 0.41l 78% of sessions -- only 1 query askedl 64% of queries occur only once


November 8,1998100


Lots of things we didn’t eventouch...

l Clustering = group similar items (documents or queries)together ↔ unsupervised learning

l Categorization = assign items to predefined categories↔ supervised learning

l Classic IR issues that are not substantially different inthe Web context:– Latent semantic indexing -- associate “concepts” to

queries and documents and match on concepts– Summarization: abstract the most important parts of

text content. (See [TS’98] for the Web context)l Many others …



November 8,1998101


Final conclusionsl We talked mostly about IR methods and tools that

– take advantage of the Web particularities– mitigate some of the difficulties

l Web IR offers plenty of interesting problems… … but not on a silver platterl Almost every area of algorithms research is relevant:

l Great need for algorithms engineering

data structures

random walks

learning

stringology

cryptography

compression

graph algorithm

external memoryalgorithms

on-linescheduling

load balancing

etc


November 8,1998102


Thanks!

l Krishna Bharat

l Moses Charikar

l Edith Cohen

l Jeff Dean

l Uri Feige

l Puneet Kumar

l Hannes Marais

l Michael Mitzenmacher

l Prabhakar Raghavan

l Lyle Ramshaw