Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)

Challenges inDistributed IR

RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching Challenges in DistributedInformation Retrieval

Ricardo Baeza-Yates1,2

Joint work with: C. Castillo1, F. Junqueira1,V. Plachouras1 and F. Silvestri3

1. Yahoo! Research Barcelona – Catalunya, Spain2. Yahoo! Research Latin America – Santiago, Chile

3. ISTI-CNR – Pisa, Italy


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

1 Crawling

2 Indexing

3 Query Processing

4 Caching


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Main Modules and Issues

Partition Dependability Communication External(sync.) factors

Crawling URL assignment Re-crawl URLexchanges

Web growth,Content change,Network topology,Bandwidth, DNS,QoS of servers

Indexing Doc. partition,Term partition

Re-index Partialindexing,updating,merging

Web growth,Content change,Global statistics

Querying Query routing,Collectionselection, Loadbalancing

Replication,caching

Rankaggregation,Personaliza-tion

Changing userneeds, User basegrowth, DNS


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching1 Crawling2 Indexing3 Query Processing4 Caching


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Crawling

In theory it is simple: fetch, parse, fetch, parse, . . .

In practice it is difficult: implies using other people’sresources (web servers’ CPU and network)


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Crawling

In theory it is simple: fetch, parse, fetch, parse, . . .

In practice it is difficult: implies using other people’sresources (web servers’ CPU and network)


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Issues

How to partition the crawling task?

What to do when one agent fails?

How to communicate among agents?

How to deal with external factors?


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Issues






RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Issues






RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Issues






RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning

Host-based partitioning exploits locality of links

Balance improves if large/small hosts are treateddifferently

Performance improves if geographic location is considered

Consistent hashing

Allows to add and remove agents from thepool [Boldi et al., 2004]


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning

Host-based partitioning exploits locality of links

Balance improves if large/small hosts are treateddifferently

Performance improves if geographic location is considered

Consistent hashing

Allows to add and remove agents from thepool [Boldi et al., 2004]


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Communication

Host-based partitioning reduces communication

Highly-linked URLs should be cached

Communication with the server can be improved if servercooperates


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

External factors

DNS can be a bottleneck

Varying quality of implementation of HTTP

Varying quality of HTML coding

Varying quality of service in general

SPAM


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching1 Crawling2 Indexing3 Query Processing4 Caching


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

What’s Indexing

Indexing in Database and IR is the process of building anindex over a collection of documents

Inverted Indexes are typically used in IR indexes

Lexicon: contains distinct terms appearing in thecollection’s documentsPosting Lists: contains descriptions of occurrences ofrelative terms within the corresponding documents


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

What’s Indexing





RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

What’s Indexing



Lexicon: contains distinct terms appearing in thecollection’s documents

Posting Lists: contains descriptions of occurrences ofrelative terms within the corresponding documents


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

What’s Indexing





RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Index and Distributed Indexing

Document

Partition

Term

PartitionD

T

D

T

T1

T2

Tn

D1

D2

Dm


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Document Partitioning

split the collection into several sub-collections and indexeach one of them separately (corresponding to verticallyslicing the T × D matrix)

pros:

higher throughputnew documents are easily added to existing indexesload balanced

cons:

high number of disk operationshigh volume of data read from disk


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



pros:


cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



pros:

higher throughput

new documents are easily added to existing indexesload balanced

cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



pros:

higher throughputnew documents are easily added to existing indexes

load balanced

cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



pros:


cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



pros:


cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



pros:


cons:

high number of disk operations

high volume of data read from disk


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



pros:


cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning

split terms of the lexicon (and the corresponding invertedlists) among search systems (corresponding tohorizontally slicing the T × D matrix)

pros:

require the entire index to be built before slicing it intopartitionsnot scalable with large collections

cons:

reduced number of disk accessesreduced volume of exchanged data


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning


pros:


cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning


pros:

require the entire index to be built before slicing it intopartitions

not scalable with large collections

cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning


pros:


cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning


pros:


cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning


pros:


cons:

reduced number of disk accesses

reduced volume of exchanged data


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Term Partitioning


pros:


cons:



RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Goals

partitioning is the first design issue to be faced indistributed indexing

a distributed index should allow for efficient query routingand resolution

reduction of the number of nodes queried, is desirable too


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Goals





RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Goals





RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning Techniques

random partitioning

documents are assigned u.a.r. to various partitions

topical organization using clustering (e.g.k-means [Larkey et al., 2000, Liu and Croft, 2004])

documents are firstly clustered and then each partition iscomposed by one (or more) cluster(s)

usage-induced partitioning (e.g. Query-Vector DocumentModel [Puppin et al., 2006])

clustering is induced by the way users interact with theindex


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching


random partitioning







RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching


random partitioning







RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching


random partitioning







RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching


random partitioning







RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching


random partitioning







RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Load Balancing Issues

In document partitioned indexes not adopting collectionselection strategies, load is almost balanced among allthe query processorsIn term partitioned indexes (even the new pipelinedschema [Webber et al., 2006]) load balancing is an issueIn federated document partitioned systems wherecollection selection is applied, balancing the load is stillan unexplored issue.

1 2 3 4 5 6 7 8

Document-distributed

0.0

20.0

40.0

60.0

80.0

100.0

Load

per

cent

age

1 2 3 4 5 6 7 8

Pipelined

0.0

20.0

40.0

60.0

80.0

100.0

Load

per

cent

age

Figure 3: Average per-processor busy load for k = 8 and TB/01, for document-distributed pro-cessing and pipelined processing. The dashed line in each graph is the average busy load over theeight processors.

nodes become starved for work as queries with a term on node two queue up for processing.

The document-partitioned system has a much higher average busy load than the partitioned

one, 95.3% compared to 58.0%. On the one hand, this is to the credit of document-distribution, in

that it demonstrates that it is better able to make use of all system resources, whereas pipelining

leaves the system underutilized. On the other hand, the fact that in this configuration pipelining

is able to achieve roughly 75% of the throughput of document-distribution using only 60% of the

resources is encouraging, and confirms the model’s underlying potential.

Figure 3 summarized system load for the 10,000-query run as a whole; it is also instructive to

examine the load over shorter intervals. Figure 4 shows the busy load for document-distribution

(top) and pipelining (bottom) when measured every 100 queries, with the eight lines in each graph

showing the aggregate load of this and lower numbered processors, meaning that the eighth of the

lines shows the overall total as a system load out of the available 8.0 total resource. Note how

the document distributed approach is consistent in its performance over all time intervals, and the

total system utilization remains in a band between 7.1 and 7.8 out of 8.0 (ignoring the trail-off as

the system is finishing up).

The contrast with pipelining (the bottom graph in Figure 4) is stark. The total system utilization

varies between 3.1 and 5.9 out of 8.0, and is volatile, in that nodes are not all busy or all quiet at

the same time. The only constant is that node two is busy all the time, acting as a throttle on the

performance of the system.

The reason for the uneven system load lies in the different ways the document-partitioned and

pipelined systems split the collection. The two chief determinants of system load in a text query

evaluation engine are the number of terms to be processed, and the length of each term’s inverted

lists. Document partitioning divides up this load evenly – each node processes every term in the

25


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Exploiting Usage Information

Query logs contain features that are critical foroptimizing efficiency of different parts of search engines

query distributionquery arrival timeclickthrough information. . .


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



query distribution

query arrival timeclickthrough information. . .


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



query distributionquery arrival time

clickthrough information. . .


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



query distributionquery arrival timeclickthrough information

. . .


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching



query distributionquery arrival timeclickthrough information. . .


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Usage Information in Term Partitioned Systems

frequency of query terms can be exploited to partition acollection with the aim of balancing the load of queryprocessors

bin-packing approach [Moffat et al., 2006]

data mining approach [Lucchese et al., 2007]


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching






RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching






RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Usage Information in Document PartitionedSystems

random partitioning does not ensure loadbalancing [Badue et al., 2006]

broadcast-based systems perform unnecessary operationson sub-collections containing few or no relevantdocuments

Usage-based mapping can be adopted to partitionsub-collections that can be effectively discriminated uponquery receipt [Puppin et al., 2006]


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching






RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching






RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Challenges in Distributed Indexing

in document partitioned system it is needed to findpartitioning strategies for enhancing collection selectionperformance in terms of effectiveness

in both systems it is a challenges to find effective loadbalancing strategies


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Challenges in Distributed Indexing

in document partitioned system it is needed to findpartitioning strategies for enhancing collection selectionperformance in terms of effectiveness

in both systems it is a challenges to find effective loadbalancing strategies


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Query processing

System components

Clients submitting queries

Sites consisting of servers

Servers are commodity computers

Query processing

System receives a query

Query routing: forwarding query to appropriate sites

Merging results

Challenges

Determine appropriate sites on the fly

WAN communication is costly


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Challenges in more detail

Large-scale systems

Large amount of data

Large data structures

Large number of clients and servers

Partitioning of data structures

Necessary due to very large data structures

Parallel processing

e.g. document collection split by topic, language, region

Replication of data structures

For availability, throughput, and response time

Conflict with resource utilization


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Framework for Distributed Query Processing

WANClient

1 2

3

Site ARegion X

Site BRegion Y

Site CRegion Z

Query processor matches documents to the received queries

Coordinator receives queries and routes them to appropriatesites

Cache stores results from previous queries


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Currently...

Multiple sites

Sites are full replicas of each other

Simple query routing: Dynamic DNS

According to the previous framework, opportunity to

Use storage resources more efficiently

More sophisticated query routing mechanisms

Effective partition strategies (e.g., language-based strategies)


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Partitioning

Goals

Achieve cost-effective scalability

Reduce response times

Potential solutions

Partition of large data structures by topic, language, etc.

Effective query routing first to local sites, then to global sites

Incremental presentation of results to alleviate networklatencies


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Dependability

Goals

Availability of query processors

Consistency of replicated query data (can be weak)

Consistency of user state: e.g., personalization, userpreferences

Potetial solutions

More network resources: multi-homed sites

Replication: within and across sites

Consistency: techniques for weak consistency (replicaseventually converge)

Caching: improve availability when query processors areunavailable


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Dependability

Achieving availability is not straighforward

BIRN system studied by Junqueira andMarzullo [Junqueira and Marzullo, 2005]

Partitions are quite frequent

0

2

4

6

8

10

12

< 97< 98< 99< 99.8< 100

Ave

rage

num

ber

of s

ites

Monthly availability


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Communication

Message latency

Communication is costly in wide-area networks

Latency is not neglible

Reduced capacity of servers as the latency to process a queryincreases

Potential solutions

Reduce as much as possible the number of sites contacted toprocess a query

Most queries processed by sites that are close according tonetwork distance


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Caching query results orpostings [Baeza-Yates et al., 2007]

Caching query answers:

44% of queries are singletons (appear only once)

88% of the unique queries are singletons

Infinite cache would achieve 56% hit-ratio

Caching postings of terms:

4% of terms are singletons

73% of the unique terms (the vocabulary) are singletons

Infinite cache would achieve 96% hit-ratio

Note: All statistics and graphs on caching refer to a one-year query

log from yahoo.co.uk


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Static or dynamic caching of postings

Static caching of postings (Qtf)

Cache terms with the highest query log frequency fq(t)

However, there is a tradeoff between fq(t) and fd(t)

Terms with high query log frequency fq(t) are good for thecache

Terms with high document frequency fd(t) occupy too muchspace

Static caching of postings as a KnapSack problem (QtfDf)

Cache posting lists of terms with the highest ratiofq(t)fd (t)


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Static or dynamic caching of postings


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Analysis of static caching

Trade-offs between caching postings and answers

Caching postings results in more hits

Caching answers is faster

To compare need to consider time/space parameters

Problem: Given a fixed amount of memory and the average

response times for a system, how much to allocate for caching

answers and how much for caching postings?


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Analysis of static caching

Scenario 1: Centralized retrieval system, complete/partial queryevaluation, un/compressed postings

Postings cache can answer more queries than answers cache

Most available memory for caching postings

Scenario 2: WAN distributed system, complete/partial queryevaluation, un/compressed postings

Network time dominates

Most available memory for caching answers

Query Dynamics

Slowly changing query dynamics makes static caching viable


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., andZiviani, N. (2006).

Analyzing imbalance among homogeneous index servers in aweb search system.

Information Processing & Management.

Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V.,Silvestri, F., and Plachouras, V. (2007).

The impact of caching on search engines.

In Proceedings of the Internation ACM SIGIR Conference (toappear), Amsterdam, Neatherlands.

Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004).

Ubicrawler: a scalable fully distributed web crawler.

Software, Practice and Experience, 34(8):711–726.


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Junqueira, F. and Marzullo, K. (2005).

Coterie availability in sites.

In Proceedings of the International Conference on DistributedComputing (DISC), number 3724 in LNCS, pages 3–17,Krakow, Poland. Springer Verlag.

Larkey, L. S., Connell, M. E., and Callan, J. (2000).

Collection selection and results merging with topicallyorganized u.s. patents and trec data.

In CIKM ’00: Proceedings of the ninth international conferenceon Information and knowledge management, pages 282–289,New York, NY, USA. ACM Press.

Liu, X. and Croft, W. B. (2004).

Cluster-based retrieval using language models.

In SIGIR ’04: Proceedings of the 27th annual internationalACM SIGIR conference on Research and development ininformation retrieval, pages 186–193, New York, NY, USA.ACM Press.


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. (2007).

Mining query logs to optimize index partitioning in parallel websearch engines.

To Appear in Proceedings of The 2nd International Conferenceon Scalable Information Systems (INFOSCALE 2007).

Moffat, A., Webber, W., and Zobel, J. (2006).

Load balancing for term-distributed parallel retrieval.

In SIGIR ’06: Proceedings of the 29th annual internationalACM SIGIR conference on Research and development ininformation retrieval, pages 348–355, New York, NY, USA.ACM Press.

Puppin, D., Silvestri, F., and Laforenza, D. (2006).

Query-driven document partitioning and collection selection.

In InfoScale ’06: Proceedings of the 1st internationalconference on Scalable information systems, page 34, NewYork, NY, USA. ACM Press.


RicardoBaeza-Yates

Crawling

Indexing

QueryProcessing

Caching

Webber, W., Moffat, A., Zobel, J., and Baeza-Yates, R.(2006).

A pipelined architecture for distributed text query evaluation.

Information Retrieval.

published online October 5, 2006.

Technology

Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)