View
3.071
Download
1
Embed Size (px)
DESCRIPTION
Presentation done by Ricardo Baeza-Yates
Citation preview
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching Challenges in DistributedInformation Retrieval
Ricardo Baeza-Yates1,2
Joint work with: C. Castillo1, F. Junqueira1,V. Plachouras1 and F. Silvestri3
1. Yahoo! Research Barcelona – Catalunya, Spain2. Yahoo! Research Latin America – Santiago, Chile
3. ISTI-CNR – Pisa, Italy
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
1 Crawling
2 Indexing
3 Query Processing
4 Caching
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Main Modules and Issues
Partition Dependability Communication External(sync.) factors
Crawling URL assignment Re-crawl URLexchanges
Web growth,Content change,Network topology,Bandwidth, DNS,QoS of servers
Indexing Doc. partition,Term partition
Re-index Partialindexing,updating,merging
Web growth,Content change,Global statistics
Querying Query routing,Collectionselection, Loadbalancing
Replication,caching
Rankaggregation,Personaliza-tion
Changing userneeds, User basegrowth, DNS
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching1 Crawling2 Indexing3 Query Processing4 Caching
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Crawling
In theory it is simple: fetch, parse, fetch, parse, . . .
In practice it is difficult: implies using other people’sresources (web servers’ CPU and network)
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Crawling
In theory it is simple: fetch, parse, fetch, parse, . . .
In practice it is difficult: implies using other people’sresources (web servers’ CPU and network)
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Issues
How to partition the crawling task?
What to do when one agent fails?
How to communicate among agents?
How to deal with external factors?
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Issues
How to partition the crawling task?
What to do when one agent fails?
How to communicate among agents?
How to deal with external factors?
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Issues
How to partition the crawling task?
What to do when one agent fails?
How to communicate among agents?
How to deal with external factors?
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Issues
How to partition the crawling task?
What to do when one agent fails?
How to communicate among agents?
How to deal with external factors?
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning
Host-based partitioning exploits locality of links
Balance improves if large/small hosts are treateddifferently
Performance improves if geographic location is considered
Consistent hashing
Allows to add and remove agents from thepool [Boldi et al., 2004]
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning
Host-based partitioning exploits locality of links
Balance improves if large/small hosts are treateddifferently
Performance improves if geographic location is considered
Consistent hashing
Allows to add and remove agents from thepool [Boldi et al., 2004]
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Communication
Host-based partitioning reduces communication
Highly-linked URLs should be cached
Communication with the server can be improved if servercooperates
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
External factors
DNS can be a bottleneck
Varying quality of implementation of HTTP
Varying quality of HTML coding
Varying quality of service in general
SPAM
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching1 Crawling2 Indexing3 Query Processing4 Caching
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
What’s Indexing
Indexing in Database and IR is the process of building anindex over a collection of documents
Inverted Indexes are typically used in IR indexes
Lexicon: contains distinct terms appearing in thecollection’s documentsPosting Lists: contains descriptions of occurrences ofrelative terms within the corresponding documents
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
What’s Indexing
Indexing in Database and IR is the process of building anindex over a collection of documents
Inverted Indexes are typically used in IR indexes
Lexicon: contains distinct terms appearing in thecollection’s documentsPosting Lists: contains descriptions of occurrences ofrelative terms within the corresponding documents
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
What’s Indexing
Indexing in Database and IR is the process of building anindex over a collection of documents
Inverted Indexes are typically used in IR indexes
Lexicon: contains distinct terms appearing in thecollection’s documents
Posting Lists: contains descriptions of occurrences ofrelative terms within the corresponding documents
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
What’s Indexing
Indexing in Database and IR is the process of building anindex over a collection of documents
Inverted Indexes are typically used in IR indexes
Lexicon: contains distinct terms appearing in thecollection’s documentsPosting Lists: contains descriptions of occurrences ofrelative terms within the corresponding documents
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Index and Distributed Indexing
Document
Partition
Term
PartitionD
T
D
T
T1
T2
Tn
D1
D2
Dm
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Document Partitioning
split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)
pros:
higher throughputnew documents are easily added to existing indexesload balanced
cons:
high number of disk operationshigh volume of data read from disk
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Document Partitioning
split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)
pros:
higher throughputnew documents are easily added to existing indexesload balanced
cons:
high number of disk operationshigh volume of data read from disk
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Document Partitioning
split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexesload balanced
cons:
high number of disk operationshigh volume of data read from disk
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Document Partitioning
split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)
pros:
higher throughputnew documents are easily added to existing indexes
load balanced
cons:
high number of disk operationshigh volume of data read from disk
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Document Partitioning
split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)
pros:
higher throughputnew documents are easily added to existing indexesload balanced
cons:
high number of disk operationshigh volume of data read from disk
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Document Partitioning
split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)
pros:
higher throughputnew documents are easily added to existing indexesload balanced
cons:
high number of disk operationshigh volume of data read from disk
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Document Partitioning
split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)
pros:
higher throughputnew documents are easily added to existing indexesload balanced
cons:
high number of disk operations
high volume of data read from disk
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Document Partitioning
split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)
pros:
higher throughputnew documents are easily added to existing indexesload balanced
cons:
high number of disk operationshigh volume of data read from disk
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Term Partitioning
split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it intopartitionsnot scalable with large collections
cons:
reduced number of disk accessesreduced volume of exchanged data
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Term Partitioning
split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it intopartitionsnot scalable with large collections
cons:
reduced number of disk accessesreduced volume of exchanged data
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Term Partitioning
split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it intopartitions
not scalable with large collections
cons:
reduced number of disk accessesreduced volume of exchanged data
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Term Partitioning
split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it intopartitionsnot scalable with large collections
cons:
reduced number of disk accessesreduced volume of exchanged data
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Term Partitioning
split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it intopartitionsnot scalable with large collections
cons:
reduced number of disk accessesreduced volume of exchanged data
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Term Partitioning
split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it intopartitionsnot scalable with large collections
cons:
reduced number of disk accesses
reduced volume of exchanged data
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Term Partitioning
split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it intopartitionsnot scalable with large collections
cons:
reduced number of disk accessesreduced volume of exchanged data
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning Goals
partitioning is the first design issue to be faced indistributed indexing
a distributed index should allow for efficient query routingand resolution
reduction of the number of nodes queried, is desirable too
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning Goals
partitioning is the first design issue to be faced indistributed indexing
a distributed index should allow for efficient query routingand resolution
reduction of the number of nodes queried, is desirable too
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning Goals
partitioning is the first design issue to be faced indistributed indexing
a distributed index should allow for efficient query routingand resolution
reduction of the number of nodes queried, is desirable too
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning Techniques
random partitioning
documents are assigned u.a.r. to various partitions
topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)
usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])
clustering is induced by the way users interact with theindex
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning Techniques
random partitioning
documents are assigned u.a.r. to various partitions
topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)
usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])
clustering is induced by the way users interact with theindex
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning Techniques
random partitioning
documents are assigned u.a.r. to various partitions
topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)
usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])
clustering is induced by the way users interact with theindex
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning Techniques
random partitioning
documents are assigned u.a.r. to various partitions
topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)
usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])
clustering is induced by the way users interact with theindex
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning Techniques
random partitioning
documents are assigned u.a.r. to various partitions
topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)
usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])
clustering is induced by the way users interact with theindex
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning Techniques
random partitioning
documents are assigned u.a.r. to various partitions
topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)
usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])
clustering is induced by the way users interact with theindex
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Load Balancing Issues
In document partitioned indexes not adopting collectionselection strategies, load is almost balanced among allthe query processorsIn term partitioned indexes (even the new pipelinedschema [Webber et al., 2006]) load balancing is an issueIn federated document partitioned systems wherecollection selection is applied, balancing the load is stillan unexplored issue.
1 2 3 4 5 6 7 8
Document-distributed
0.0
20.0
40.0
60.0
80.0
100.0
Load
per
cent
age
1 2 3 4 5 6 7 8
Pipelined
0.0
20.0
40.0
60.0
80.0
100.0
Load
per
cent
age
Figure 3: Average per-processor busy load for k = 8 and TB/01, for document-distributed pro-cessing and pipelined processing. The dashed line in each graph is the average busy load over theeight processors.
nodes become starved for work as queries with a term on node two queue up for processing.
The document-partitioned system has a much higher average busy load than the partitioned
one, 95.3% compared to 58.0%. On the one hand, this is to the credit of document-distribution, in
that it demonstrates that it is better able to make use of all system resources, whereas pipelining
leaves the system underutilized. On the other hand, the fact that in this configuration pipelining
is able to achieve roughly 75% of the throughput of document-distribution using only 60% of the
resources is encouraging, and confirms the model’s underlying potential.
Figure 3 summarized system load for the 10,000-query run as a whole; it is also instructive to
examine the load over shorter intervals. Figure 4 shows the busy load for document-distribution
(top) and pipelining (bottom) when measured every 100 queries, with the eight lines in each graph
showing the aggregate load of this and lower numbered processors, meaning that the eighth of the
lines shows the overall total as a system load out of the available 8.0 total resource. Note how
the document distributed approach is consistent in its performance over all time intervals, and the
total system utilization remains in a band between 7.1 and 7.8 out of 8.0 (ignoring the trail-off as
the system is finishing up).
The contrast with pipelining (the bottom graph in Figure 4) is stark. The total system utilization
varies between 3.1 and 5.9 out of 8.0, and is volatile, in that nodes are not all busy or all quiet at
the same time. The only constant is that node two is busy all the time, acting as a throttle on the
performance of the system.
The reason for the uneven system load lies in the different ways the document-partitioned and
pipelined systems split the collection. The two chief determinants of system load in a text query
evaluation engine are the number of terms to be processed, and the length of each term’s inverted
lists. Document partitioning divides up this load evenly – each node processes every term in the
25
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Exploiting Usage Information
Query logs contain features that are critical foroptimizing efficiency of different parts of search engines
query distributionquery arrival timeclickthrough information. . .
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Exploiting Usage Information
Query logs contain features that are critical foroptimizing efficiency of different parts of search engines
query distribution
query arrival timeclickthrough information. . .
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Exploiting Usage Information
Query logs contain features that are critical foroptimizing efficiency of different parts of search engines
query distributionquery arrival time
clickthrough information. . .
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Exploiting Usage Information
Query logs contain features that are critical foroptimizing efficiency of different parts of search engines
query distributionquery arrival timeclickthrough information
. . .
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Exploiting Usage Information
Query logs contain features that are critical foroptimizing efficiency of different parts of search engines
query distributionquery arrival timeclickthrough information. . .
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Usage Information in Term Partitioned Systems
frequency of query terms can be exploited to partition acollection with the aim of balancing the load of queryprocessors
bin-packing approach [Moffat et al., 2006]
data mining approach [Lucchese et al., 2007]
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Usage Information in Term Partitioned Systems
frequency of query terms can be exploited to partition acollection with the aim of balancing the load of queryprocessors
bin-packing approach [Moffat et al., 2006]
data mining approach [Lucchese et al., 2007]
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Usage Information in Term Partitioned Systems
frequency of query terms can be exploited to partition acollection with the aim of balancing the load of queryprocessors
bin-packing approach [Moffat et al., 2006]
data mining approach [Lucchese et al., 2007]
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Usage Information in Document PartitionedSystems
random partitioning does not ensure loadbalancing [Badue et al., 2006]
broadcast-based systems perform unnecessary operationson sub-collections containing few or no relevantdocuments
Usage-based mapping can be adopted to partitionsub-collections that can be effectively discriminated uponquery receipt [Puppin et al., 2006]
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Usage Information in Document PartitionedSystems
random partitioning does not ensure loadbalancing [Badue et al., 2006]
broadcast-based systems perform unnecessary operationson sub-collections containing few or no relevantdocuments
Usage-based mapping can be adopted to partitionsub-collections that can be effectively discriminated uponquery receipt [Puppin et al., 2006]
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Usage Information in Document PartitionedSystems
random partitioning does not ensure loadbalancing [Badue et al., 2006]
broadcast-based systems perform unnecessary operationson sub-collections containing few or no relevantdocuments
Usage-based mapping can be adopted to partitionsub-collections that can be effectively discriminated uponquery receipt [Puppin et al., 2006]
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Challenges in Distributed Indexing
in document partitioned system it is needed to findpartitioning strategies for enhancing collection selectionperformance in terms of effectiveness
in both systems it is a challenges to find effective loadbalancing strategies
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Challenges in Distributed Indexing
in document partitioned system it is needed to findpartitioning strategies for enhancing collection selectionperformance in terms of effectiveness
in both systems it is a challenges to find effective loadbalancing strategies
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Query processing
System components
Clients submitting queries
Sites consisting of servers
Servers are commodity computers
Query processing
System receives a query
Query routing: forwarding query to appropriate sites
Merging results
Challenges
Determine appropriate sites on the fly
WAN communication is costly
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Challenges in more detail
Large-scale systems
Large amount of data
Large data structures
Large number of clients and servers
Partitioning of data structures
Necessary due to very large data structures
Parallel processing
e.g. document collection split by topic, language, region
Replication of data structures
For availability, throughput, and response time
Conflict with resource utilization
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Framework for Distributed Query Processing
WANClient
1 2
3
Site ARegion X
Site BRegion Y
Site CRegion Z
Query processor matches documents to the received queries
Coordinator receives queries and routes them to appropriatesites
Cache stores results from previous queries
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Currently...
Multiple sites
Sites are full replicas of each other
Simple query routing: Dynamic DNS
According to the previous framework, opportunity to
Use storage resources more efficiently
More sophisticated query routing mechanisms
Effective partition strategies (e.g., language-based strategies)
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Partitioning
Goals
Achieve cost-effective scalability
Reduce response times
Potential solutions
Partition of large data structures by topic, language, etc.
Effective query routing first to local sites, then to global sites
Incremental presentation of results to alleviate networklatencies
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Dependability
Goals
Availability of query processors
Consistency of replicated query data (can be weak)
Consistency of user state: e.g., personalization, userpreferences
Potetial solutions
More network resources: multi-homed sites
Replication: within and across sites
Consistency: techniques for weak consistency (replicaseventually converge)
Caching: improve availability when query processors areunavailable
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Dependability
Achieving availability is not straighforward
BIRN system studied by Junqueira andMarzullo [Junqueira and Marzullo, 2005]
Partitions are quite frequent
0
2
4
6
8
10
12
< 97< 98< 99< 99.8< 100
Ave
rage
num
ber
of s
ites
Monthly availability
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Communication
Message latency
Communication is costly in wide-area networks
Latency is not neglible
Reduced capacity of servers as the latency to process a queryincreases
Potential solutions
Reduce as much as possible the number of sites contacted toprocess a query
Most queries processed by sites that are close according tonetwork distance
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Caching query results orpostings [Baeza-Yates et al., 2007]
Caching query answers:
44% of queries are singletons (appear only once)
88% of the unique queries are singletons
Infinite cache would achieve 56% hit-ratio
Caching postings of terms:
4% of terms are singletons
73% of the unique terms (the vocabulary) are singletons
Infinite cache would achieve 96% hit-ratio
Note: All statistics and graphs on caching refer to a one-year query
log from yahoo.co.uk
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Static or dynamic caching of postings
Static caching of postings (Qtf)
Cache terms with the highest query log frequency fq(t)
However, there is a tradeoff between fq(t) and fd(t)
Terms with high query log frequency fq(t) are good for thecache
Terms with high document frequency fd(t) occupy too muchspace
Static caching of postings as a KnapSack problem (QtfDf)
Cache posting lists of terms with the highest ratiofq(t)fd (t)
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Static or dynamic caching of postings
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Analysis of static caching
Trade-offs between caching postings and answers
Caching postings results in more hits
Caching answers is faster
To compare need to consider time/space parameters
Problem: Given a fixed amount of memory and the average
response times for a system, how much to allocate for caching
answers and how much for caching postings?
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Analysis of static caching
Scenario 1: Centralized retrieval system, complete/partial queryevaluation, un/compressed postings
Postings cache can answer more queries than answers cache
Most available memory for caching postings
Scenario 2: WAN distributed system, complete/partial queryevaluation, un/compressed postings
Network time dominates
Most available memory for caching answers
Query Dynamics
Slowly changing query dynamics makes static caching viable
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., andZiviani, N. (2006).
Analyzing imbalance among homogeneous index servers in aweb search system.
Information Processing & Management.
Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V.,Silvestri, F., and Plachouras, V. (2007).
The impact of caching on search engines.
In Proceedings of the Internation ACM SIGIR Conference (toappear), Amsterdam, Neatherlands.
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004).
Ubicrawler: a scalable fully distributed web crawler.
Software, Practice and Experience, 34(8):711–726.
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Junqueira, F. and Marzullo, K. (2005).
Coterie availability in sites.
In Proceedings of the International Conference on DistributedComputing (DISC), number 3724 in LNCS, pages 3–17,Krakow, Poland. Springer Verlag.
Larkey, L. S., Connell, M. E., and Callan, J. (2000).
Collection selection and results merging with topicallyorganized u.s. patents and trec data.
In CIKM ’00: Proceedings of the ninth international conferenceon Information and knowledge management, pages 282–289,New York, NY, USA. ACM Press.
Liu, X. and Croft, W. B. (2004).
Cluster-based retrieval using language models.
In SIGIR ’04: Proceedings of the 27th annual internationalACM SIGIR conference on Research and development ininformation retrieval, pages 186–193, New York, NY, USA.ACM Press.
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. (2007).
Mining query logs to optimize index partitioning in parallel websearch engines.
To Appear in Proceedings of The 2nd International Conferenceon Scalable Information Systems (INFOSCALE 2007).
Moffat, A., Webber, W., and Zobel, J. (2006).
Load balancing for term-distributed parallel retrieval.
In SIGIR ’06: Proceedings of the 29th annual internationalACM SIGIR conference on Research and development ininformation retrieval, pages 348–355, New York, NY, USA.ACM Press.
Puppin, D., Silvestri, F., and Laforenza, D. (2006).
Query-driven document partitioning and collection selection.
In InfoScale ’06: Proceedings of the 1st internationalconference on Scalable information systems, page 34, NewYork, NY, USA. ACM Press.
Challenges inDistributed IR
RicardoBaeza-Yates
Crawling
Indexing
QueryProcessing
Caching
Webber, W., Moffat, A., Zobel, J., and Baeza-Yates, R.(2006).
A pipelined architecture for distributed text query evaluation.
Information Retrieval.
published online October 5, 2006.