75
Challenges in Distributed IR Ricardo Baeza-Yates Crawling Indexing Query Processing Caching Challenges in Distributed Information Retrieval Ricardo Baeza-Yates 1,2 Joint work with: C. Castillo 1 , F. Junqueira 1 , V. Plachouras 1 and F. Silvestri 3 1. Yahoo! Research Barcelona – Catalunya, Spain 2. Yahoo! Research Latin America – Santiago, Chile 3. ISTI-CNR – Pisa, Italy

Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

  • View
    3.071

  • Download
    1

Embed Size (px)

DESCRIPTION

Presentation done by Ricardo Baeza-Yates

Citation preview

Page 1: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching Challenges in DistributedInformation Retrieval

Ricardo Baeza-Yates1,2

Joint work with: C. Castillo1, F. Junqueira1,V. Plachouras1 and F. Silvestri3

1. Yahoo! Research Barcelona – Catalunya, Spain2. Yahoo! Research Latin America – Santiago, Chile

3. ISTI-CNR – Pisa, Italy

Page 2: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

1 Crawling

2 Indexing

3 Query Processing

4 Caching

Page 3: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Main Modules and Issues

Partition Dependability Communication External(sync.) factors

Crawling URL assignment Re-crawl URLexchanges

Web growth,Content change,Network topology,Bandwidth, DNS,QoS of servers

Indexing Doc. partition,Term partition

Re-index Partialindexing,updating,merging

Web growth,Content change,Global statistics

Querying Query routing,Collectionselection, Loadbalancing

Replication,caching

Rankaggregation,Personaliza-tion

Changing userneeds, User basegrowth, DNS

Page 4: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching1 Crawling2 Indexing3 Query Processing4 Caching

Page 5: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Crawling

In theory it is simple: fetch, parse, fetch, parse, . . .

In practice it is difficult: implies using other people’sresources (web servers’ CPU and network)

Page 6: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Crawling

In theory it is simple: fetch, parse, fetch, parse, . . .

In practice it is difficult: implies using other people’sresources (web servers’ CPU and network)

Page 7: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Issues

How to partition the crawling task?

What to do when one agent fails?

How to communicate among agents?

How to deal with external factors?

Page 8: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Issues

How to partition the crawling task?

What to do when one agent fails?

How to communicate among agents?

How to deal with external factors?

Page 9: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Issues

How to partition the crawling task?

What to do when one agent fails?

How to communicate among agents?

How to deal with external factors?

Page 10: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Issues

How to partition the crawling task?

What to do when one agent fails?

How to communicate among agents?

How to deal with external factors?

Page 11: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning

Host-based partitioning exploits locality of links

Balance improves if large/small hosts are treateddifferently

Performance improves if geographic location is considered

Consistent hashing

Allows to add and remove agents from thepool [Boldi et al., 2004]

Page 12: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning

Host-based partitioning exploits locality of links

Balance improves if large/small hosts are treateddifferently

Performance improves if geographic location is considered

Consistent hashing

Allows to add and remove agents from thepool [Boldi et al., 2004]

Page 13: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Communication

Host-based partitioning reduces communication

Highly-linked URLs should be cached

Communication with the server can be improved if servercooperates

Page 14: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

External factors

DNS can be a bottleneck

Varying quality of implementation of HTTP

Varying quality of HTML coding

Varying quality of service in general

SPAM

Page 15: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching1 Crawling2 Indexing3 Query Processing4 Caching

Page 16: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

What’s Indexing

Indexing in Database and IR is the process of building anindex over a collection of documents

Inverted Indexes are typically used in IR indexes

Lexicon: contains distinct terms appearing in thecollection’s documentsPosting Lists: contains descriptions of occurrences ofrelative terms within the corresponding documents

Page 17: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

What’s Indexing

Indexing in Database and IR is the process of building anindex over a collection of documents

Inverted Indexes are typically used in IR indexes

Lexicon: contains distinct terms appearing in thecollection’s documentsPosting Lists: contains descriptions of occurrences ofrelative terms within the corresponding documents

Page 18: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

What’s Indexing

Indexing in Database and IR is the process of building anindex over a collection of documents

Inverted Indexes are typically used in IR indexes

Lexicon: contains distinct terms appearing in thecollection’s documents

Posting Lists: contains descriptions of occurrences ofrelative terms within the corresponding documents

Page 19: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

What’s Indexing

Indexing in Database and IR is the process of building anindex over a collection of documents

Inverted Indexes are typically used in IR indexes

Lexicon: contains distinct terms appearing in thecollection’s documentsPosting Lists: contains descriptions of occurrences ofrelative terms within the corresponding documents

Page 20: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Index and Distributed Indexing

Document

Partition

Term

PartitionD

T

D

T

T1

T2

Tn

D1

D2

Dm

Page 21: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Document Partitioning

split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)

pros:

higher throughputnew documents are easily added to existing indexesload balanced

cons:

high number of disk operationshigh volume of data read from disk

Page 22: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Document Partitioning

split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)

pros:

higher throughputnew documents are easily added to existing indexesload balanced

cons:

high number of disk operationshigh volume of data read from disk

Page 23: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Document Partitioning

split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)

pros:

higher throughput

new documents are easily added to existing indexesload balanced

cons:

high number of disk operationshigh volume of data read from disk

Page 24: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Document Partitioning

split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)

pros:

higher throughputnew documents are easily added to existing indexes

load balanced

cons:

high number of disk operationshigh volume of data read from disk

Page 25: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Document Partitioning

split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)

pros:

higher throughputnew documents are easily added to existing indexesload balanced

cons:

high number of disk operationshigh volume of data read from disk

Page 26: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Document Partitioning

split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)

pros:

higher throughputnew documents are easily added to existing indexesload balanced

cons:

high number of disk operationshigh volume of data read from disk

Page 27: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Document Partitioning

split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)

pros:

higher throughputnew documents are easily added to existing indexesload balanced

cons:

high number of disk operations

high volume of data read from disk

Page 28: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Document Partitioning

split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)

pros:

higher throughputnew documents are easily added to existing indexesload balanced

cons:

high number of disk operationshigh volume of data read from disk

Page 29: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning

split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)

pros:

require the entire index to be built before slicing it intopartitionsnot scalable with large collections

cons:

reduced number of disk accessesreduced volume of exchanged data

Page 30: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning

split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)

pros:

require the entire index to be built before slicing it intopartitionsnot scalable with large collections

cons:

reduced number of disk accessesreduced volume of exchanged data

Page 31: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning

split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)

pros:

require the entire index to be built before slicing it intopartitions

not scalable with large collections

cons:

reduced number of disk accessesreduced volume of exchanged data

Page 32: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning

split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)

pros:

require the entire index to be built before slicing it intopartitionsnot scalable with large collections

cons:

reduced number of disk accessesreduced volume of exchanged data

Page 33: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning

split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)

pros:

require the entire index to be built before slicing it intopartitionsnot scalable with large collections

cons:

reduced number of disk accessesreduced volume of exchanged data

Page 34: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning

split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)

pros:

require the entire index to be built before slicing it intopartitionsnot scalable with large collections

cons:

reduced number of disk accesses

reduced volume of exchanged data

Page 35: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning

split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)

pros:

require the entire index to be built before slicing it intopartitionsnot scalable with large collections

cons:

reduced number of disk accessesreduced volume of exchanged data

Page 36: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Goals

partitioning is the first design issue to be faced indistributed indexing

a distributed index should allow for efficient query routingand resolution

reduction of the number of nodes queried, is desirable too

Page 37: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Goals

partitioning is the first design issue to be faced indistributed indexing

a distributed index should allow for efficient query routingand resolution

reduction of the number of nodes queried, is desirable too

Page 38: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Goals

partitioning is the first design issue to be faced indistributed indexing

a distributed index should allow for efficient query routingand resolution

reduction of the number of nodes queried, is desirable too

Page 39: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Techniques

random partitioning

documents are assigned u.a.r. to various partitions

topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])

documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)

usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])

clustering is induced by the way users interact with theindex

Page 40: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Techniques

random partitioning

documents are assigned u.a.r. to various partitions

topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])

documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)

usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])

clustering is induced by the way users interact with theindex

Page 41: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Techniques

random partitioning

documents are assigned u.a.r. to various partitions

topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])

documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)

usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])

clustering is induced by the way users interact with theindex

Page 42: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Techniques

random partitioning

documents are assigned u.a.r. to various partitions

topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])

documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)

usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])

clustering is induced by the way users interact with theindex

Page 43: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Techniques

random partitioning

documents are assigned u.a.r. to various partitions

topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])

documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)

usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])

clustering is induced by the way users interact with theindex

Page 44: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Techniques

random partitioning

documents are assigned u.a.r. to various partitions

topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])

documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)

usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])

clustering is induced by the way users interact with theindex

Page 45: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Load Balancing Issues

In document partitioned indexes not adopting collectionselection strategies, load is almost balanced among allthe query processorsIn term partitioned indexes (even the new pipelinedschema [Webber et al., 2006]) load balancing is an issueIn federated document partitioned systems wherecollection selection is applied, balancing the load is stillan unexplored issue.

1 2 3 4 5 6 7 8

Document-distributed

0.0

20.0

40.0

60.0

80.0

100.0

Load

per

cent

age

1 2 3 4 5 6 7 8

Pipelined

0.0

20.0

40.0

60.0

80.0

100.0

Load

per

cent

age

Figure 3: Average per-processor busy load for k = 8 and TB/01, for document-distributed pro-cessing and pipelined processing. The dashed line in each graph is the average busy load over theeight processors.

nodes become starved for work as queries with a term on node two queue up for processing.

The document-partitioned system has a much higher average busy load than the partitioned

one, 95.3% compared to 58.0%. On the one hand, this is to the credit of document-distribution, in

that it demonstrates that it is better able to make use of all system resources, whereas pipelining

leaves the system underutilized. On the other hand, the fact that in this configuration pipelining

is able to achieve roughly 75% of the throughput of document-distribution using only 60% of the

resources is encouraging, and confirms the model’s underlying potential.

Figure 3 summarized system load for the 10,000-query run as a whole; it is also instructive to

examine the load over shorter intervals. Figure 4 shows the busy load for document-distribution

(top) and pipelining (bottom) when measured every 100 queries, with the eight lines in each graph

showing the aggregate load of this and lower numbered processors, meaning that the eighth of the

lines shows the overall total as a system load out of the available 8.0 total resource. Note how

the document distributed approach is consistent in its performance over all time intervals, and the

total system utilization remains in a band between 7.1 and 7.8 out of 8.0 (ignoring the trail-off as

the system is finishing up).

The contrast with pipelining (the bottom graph in Figure 4) is stark. The total system utilization

varies between 3.1 and 5.9 out of 8.0, and is volatile, in that nodes are not all busy or all quiet at

the same time. The only constant is that node two is busy all the time, acting as a throttle on the

performance of the system.

The reason for the uneven system load lies in the different ways the document-partitioned and

pipelined systems split the collection. The two chief determinants of system load in a text query

evaluation engine are the number of terms to be processed, and the length of each term’s inverted

lists. Document partitioning divides up this load evenly – each node processes every term in the

25

Page 46: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Exploiting Usage Information

Query logs contain features that are critical foroptimizing efficiency of different parts of search engines

query distributionquery arrival timeclickthrough information. . .

Page 47: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Exploiting Usage Information

Query logs contain features that are critical foroptimizing efficiency of different parts of search engines

query distribution

query arrival timeclickthrough information. . .

Page 48: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Exploiting Usage Information

Query logs contain features that are critical foroptimizing efficiency of different parts of search engines

query distributionquery arrival time

clickthrough information. . .

Page 49: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Exploiting Usage Information

Query logs contain features that are critical foroptimizing efficiency of different parts of search engines

query distributionquery arrival timeclickthrough information

. . .

Page 50: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Exploiting Usage Information

Query logs contain features that are critical foroptimizing efficiency of different parts of search engines

query distributionquery arrival timeclickthrough information. . .

Page 51: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Usage Information in Term Partitioned Systems

frequency of query terms can be exploited to partition acollection with the aim of balancing the load of queryprocessors

bin-packing approach [Moffat et al., 2006]

data mining approach [Lucchese et al., 2007]

Page 52: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Usage Information in Term Partitioned Systems

frequency of query terms can be exploited to partition acollection with the aim of balancing the load of queryprocessors

bin-packing approach [Moffat et al., 2006]

data mining approach [Lucchese et al., 2007]

Page 53: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Usage Information in Term Partitioned Systems

frequency of query terms can be exploited to partition acollection with the aim of balancing the load of queryprocessors

bin-packing approach [Moffat et al., 2006]

data mining approach [Lucchese et al., 2007]

Page 54: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Usage Information in Document PartitionedSystems

random partitioning does not ensure loadbalancing [Badue et al., 2006]

broadcast-based systems perform unnecessary operationson sub-collections containing few or no relevantdocuments

Usage-based mapping can be adopted to partitionsub-collections that can be effectively discriminated uponquery receipt [Puppin et al., 2006]

Page 55: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Usage Information in Document PartitionedSystems

random partitioning does not ensure loadbalancing [Badue et al., 2006]

broadcast-based systems perform unnecessary operationson sub-collections containing few or no relevantdocuments

Usage-based mapping can be adopted to partitionsub-collections that can be effectively discriminated uponquery receipt [Puppin et al., 2006]

Page 56: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Usage Information in Document PartitionedSystems

random partitioning does not ensure loadbalancing [Badue et al., 2006]

broadcast-based systems perform unnecessary operationson sub-collections containing few or no relevantdocuments

Usage-based mapping can be adopted to partitionsub-collections that can be effectively discriminated uponquery receipt [Puppin et al., 2006]

Page 57: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Challenges in Distributed Indexing

in document partitioned system it is needed to findpartitioning strategies for enhancing collection selectionperformance in terms of effectiveness

in both systems it is a challenges to find effective loadbalancing strategies

Page 58: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Challenges in Distributed Indexing

in document partitioned system it is needed to findpartitioning strategies for enhancing collection selectionperformance in terms of effectiveness

in both systems it is a challenges to find effective loadbalancing strategies

Page 59: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Query processing

System components

Clients submitting queries

Sites consisting of servers

Servers are commodity computers

Query processing

System receives a query

Query routing: forwarding query to appropriate sites

Merging results

Challenges

Determine appropriate sites on the fly

WAN communication is costly

Page 60: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Challenges in more detail

Large-scale systems

Large amount of data

Large data structures

Large number of clients and servers

Partitioning of data structures

Necessary due to very large data structures

Parallel processing

e.g. document collection split by topic, language, region

Replication of data structures

For availability, throughput, and response time

Conflict with resource utilization

Page 61: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Framework for Distributed Query Processing

WANClient

1 2

3

Site ARegion X

Site BRegion Y

Site CRegion Z

Query processor matches documents to the received queries

Coordinator receives queries and routes them to appropriatesites

Cache stores results from previous queries

Page 62: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Currently...

Multiple sites

Sites are full replicas of each other

Simple query routing: Dynamic DNS

According to the previous framework, opportunity to

Use storage resources more efficiently

More sophisticated query routing mechanisms

Effective partition strategies (e.g., language-based strategies)

Page 63: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning

Goals

Achieve cost-effective scalability

Reduce response times

Potential solutions

Partition of large data structures by topic, language, etc.

Effective query routing first to local sites, then to global sites

Incremental presentation of results to alleviate networklatencies

Page 64: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Dependability

Goals

Availability of query processors

Consistency of replicated query data (can be weak)

Consistency of user state: e.g., personalization, userpreferences

Potetial solutions

More network resources: multi-homed sites

Replication: within and across sites

Consistency: techniques for weak consistency (replicaseventually converge)

Caching: improve availability when query processors areunavailable

Page 65: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Dependability

Achieving availability is not straighforward

BIRN system studied by Junqueira andMarzullo [Junqueira and Marzullo, 2005]

Partitions are quite frequent

0

2

4

6

8

10

12

< 97< 98< 99< 99.8< 100

Ave

rage

num

ber

of s

ites

Monthly availability

Page 66: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Communication

Message latency

Communication is costly in wide-area networks

Latency is not neglible

Reduced capacity of servers as the latency to process a queryincreases

Potential solutions

Reduce as much as possible the number of sites contacted toprocess a query

Most queries processed by sites that are close according tonetwork distance

Page 67: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Caching query results orpostings [Baeza-Yates et al., 2007]

Caching query answers:

44% of queries are singletons (appear only once)

88% of the unique queries are singletons

Infinite cache would achieve 56% hit-ratio

Caching postings of terms:

4% of terms are singletons

73% of the unique terms (the vocabulary) are singletons

Infinite cache would achieve 96% hit-ratio

Note: All statistics and graphs on caching refer to a one-year query

log from yahoo.co.uk

Page 68: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Static or dynamic caching of postings

Static caching of postings (Qtf)

Cache terms with the highest query log frequency fq(t)

However, there is a tradeoff between fq(t) and fd(t)

Terms with high query log frequency fq(t) are good for thecache

Terms with high document frequency fd(t) occupy too muchspace

Static caching of postings as a KnapSack problem (QtfDf)

Cache posting lists of terms with the highest ratiofq(t)fd (t)

Page 69: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Static or dynamic caching of postings

Page 70: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Analysis of static caching

Trade-offs between caching postings and answers

Caching postings results in more hits

Caching answers is faster

To compare need to consider time/space parameters

Problem: Given a fixed amount of memory and the average

response times for a system, how much to allocate for caching

answers and how much for caching postings?

Page 71: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Analysis of static caching

Scenario 1: Centralized retrieval system, complete/partial queryevaluation, un/compressed postings

Postings cache can answer more queries than answers cache

Most available memory for caching postings

Scenario 2: WAN distributed system, complete/partial queryevaluation, un/compressed postings

Network time dominates

Most available memory for caching answers

Query Dynamics

Slowly changing query dynamics makes static caching viable

Page 72: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., andZiviani, N. (2006).

Analyzing imbalance among homogeneous index servers in aweb search system.

Information Processing & Management.

Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V.,Silvestri, F., and Plachouras, V. (2007).

The impact of caching on search engines.

In Proceedings of the Internation ACM SIGIR Conference (toappear), Amsterdam, Neatherlands.

Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004).

Ubicrawler: a scalable fully distributed web crawler.

Software, Practice and Experience, 34(8):711–726.

Page 73: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Junqueira, F. and Marzullo, K. (2005).

Coterie availability in sites.

In Proceedings of the International Conference on DistributedComputing (DISC), number 3724 in LNCS, pages 3–17,Krakow, Poland. Springer Verlag.

Larkey, L. S., Connell, M. E., and Callan, J. (2000).

Collection selection and results merging with topicallyorganized u.s. patents and trec data.

In CIKM ’00: Proceedings of the ninth international conferenceon Information and knowledge management, pages 282–289,New York, NY, USA. ACM Press.

Liu, X. and Croft, W. B. (2004).

Cluster-based retrieval using language models.

In SIGIR ’04: Proceedings of the 27th annual internationalACM SIGIR conference on Research and development ininformation retrieval, pages 186–193, New York, NY, USA.ACM Press.

Page 74: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. (2007).

Mining query logs to optimize index partitioning in parallel websearch engines.

To Appear in Proceedings of The 2nd International Conferenceon Scalable Information Systems (INFOSCALE 2007).

Moffat, A., Webber, W., and Zobel, J. (2006).

Load balancing for term-distributed parallel retrieval.

In SIGIR ’06: Proceedings of the 29th annual internationalACM SIGIR conference on Research and development ininformation retrieval, pages 348–355, New York, NY, USA.ACM Press.

Puppin, D., Silvestri, F., and Laforenza, D. (2006).

Query-driven document partitioning and collection selection.

In InfoScale ’06: Proceedings of the 1st internationalconference on Scalable information systems, page 34, NewYork, NY, USA. ACM Press.

Page 75: Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Webber, W., Moffat, A., Zobel, J., and Baeza-Yates, R.(2006).

A pipelined architecture for distributed text query evaluation.

Information Retrieval.

published online October 5, 2006.