60
A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

Embed Size (px)

Citation preview

Page 1: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

A Search Engine Architecture Based on Collection Selection

Diego Puppin

University of Pisa, Italy

Supervisors: D. Laforenza, M. Vanneschi

Page 2: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

Introduction

Page 3: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Motivations The Web is getting bigger and bigger, and

users are more and more picky! Precise results are needed very fast The index is growing, due to added page and

advanced indexing Big IR problems for the Web, books,

multimedia search engine

Page 4: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Motivations (2) There is the need for new solutions, able to give

high quality results with reduced computing load Parallel Computing looks like the most natural

choice to help algorithms to face this growth rate [Baeza-Yates et al. 2007a]

Billions of pages and data available (several TB): the index is still very big (about 5X the collection size)

New approaches to partitioning are key to the next phase

Page 5: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Parallel (Distributed) IRSs

Page 6: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Term vs Doc partitioning

Page 7: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Term vs Doc partitioning Reduced computing load for term part.

Only the servers with relevant terms Problems of load balancing Heavier communication patterns Doc.part. better balancing but all

documents are scanned How to reduce the load with doc.part.?

Page 8: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Page 9: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Main contributions

1. Query vector doc model More efficient for partitioning and

selection (co-clustering and PCAP)

2. Load-driven routing Exploits better the available load Based on the effective load of the system

3. Incremental Caching Improves throughput AND quality

Page 10: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Acknowledgments Fabrizio Silvestri Raffaele Perego Ricardo Baeza-Yates Adbur Chowdury, Ophir Frieder,

Gerhard Weikum, and the various reviewers…

Page 11: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Other contributions More compact collection representation

1/5 CORI and outperforming A way to select documents (50%) to move out

of the index The documents in the supplemental index

contribute to only 3% top results A simple way to update the index in a doc.

partitioned system Extended simulation

6 M documents, 800k test queries, real computing costs, several configurations tested

Page 12: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Reviewers’ Request: Frieder More detailed discussion of the

coclustering algorithm Improved cost scheme Experiments to be extended in the

future

Page 13: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Reviewers’ Requests: Weikum Improved description of pipelined term-

partitioned IR system Improved description of coclustering Better definition of shingles New realistic cost model Deeper discussion of cache and silent

documents

Page 14: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

How to Improve Partitions

Page 15: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Partitioning Strategy

p1 p2 pp

Partitioning Strategy

DocumentCollection

Random

Content-based(e.g. K-Means,Link-based Clust.)

Usage-Based

Page 16: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

The QV Model

Co-clustering

qu

erie

s

documents

4 8 10 1 12 2 9 7 11 5 3 6

11

6

1

9

7

3

10

5

12

2

8

4

qu

erie

s

documents

4 8 101 122 97 1153 6

11

6

1

7

9

3

10

5

12

2

8

4

Document j is returned in answer to query i.

Document j is not relevant to query i.

QueryCluster

DocumentCluster

Each document cluster corresponds to a different partition. In this case three

partitions are generated

For each query cluster a vocabulary is built out of all the different query terms of the queries in the cluster

Page 17: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Theoretical Model of Co-clustering The algorithm we use [Dhillon et al., 2003]

finds the clustering that minimizes the loss of information between the original matrix and the clustered matrix (given the number of row and column clusters)

Efficient implementation, very robust solution Stable to test period, number of clusters, training

set used, matrix model (scores, boolean, repeated)

Page 18: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

QV for Collection Selection

Que

ry c

lust

ers

Query

Partitions are ranked according to their relevance to the query

Document clusters

We called this strategy PCAP

Page 19: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

PCAP collection selection

Page 20: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Experimental Settings Experiments were carried out using

WBR99: 5,939,061 documents; 22 GB uncompressed text Snapshot of the Brazilian Web (domain .br) back in 1999.

A query log from todobr.com relative to the period Jan-Oct 2003. Zettair as the IR Core

Training: 190,000 queries, Test: 800,000 queries We created 16 + 1 doc. clusters and 128 query clusters. Model tested on the successive week (the fourth week). Metrics

used: Intersection: percentage of relevant results returned using only k

servers out of 16+1 (from [Puppin et al., 2006]). Competitive similarity: percentage of relevance score obtained using

only k servers out of 16+1 (adapted from [Chierichetti et al., 2007]).

Page 21: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Quality Metrics

Page 22: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Very Effective Partitioning and Selection

CORI on Random Partitioning

Intersection at1 2 4 8 16 17

5

10

20

0.30

0.59

1.20

0.57

1.16

2.49

1.27

2.55

5.04

2.62

5.00

9.77

4.60

9.30

18.71

5.00

10.00

20.00

CORI on QV Partitioning

Intersection at1 2 4 8 16 17

5

10

20

1.55

3.05

5.97

2.29

4.48

8.77

3.01

5.92

11.61

3.83

7.62

15.10

4.89

9.77

19.54

5.00

10.00

20.00

PCAP on QV Partitioning

Intersection at1 2 4 8 16 17

5

10

20

1.73

3.47

6.92

2.26

4.51

9.02

2.89

5.75

11.47

3.76

7.50

14.98

4.84

9.66

19.29

5.00

10.00

20.00

In the case of Random CORI performs really bad!Almost equal to relevants/Nclusters. E.g. 5/17 = 0.29411765 ~ 0.3

CORI on QV vs. CORI on random performs about 5.2 times better.

PCAP on QV vs. CORI on random performs about 5.8 times better.

PCAP on QV vs. CORI on QV performs about 1.1 times better.

Page 23: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Page 24: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Strength Popular queries are driving the

distribution Low-dimensional space to represent

documents More efficient collection representation QV may be built while answering

queries

Page 25: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Weakness Dependent from the training set

Actually… NOT! Cannot manage new query terms

Very small fraction, CORI does not help Inc. caching can help

Collection selection dependent from assignment But addition does not break performance

Page 26: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

Issues with Load Distribution

Page 27: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Load BalancingPeak Load on Each IR Core

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Core ID

Peak Load

Still the maximum load is ~ 25% of the maximum capacity available at each IR Core

Load is measured as the maximum number of queries answered by each IR core within a sliding query window of 1000 queries.

Page 28: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Load Balancing Strategies Load-driven basic <L>

Servers are ranked according to their relevance, using a collection selection function. The first gets priority 1, then linearly down to 1/17. Every server i has to answer if: L(i) < p(i) * L

Load-driven boost <L,T> Priority is 1 for the first T server, then

linearly down to 1/(17-T)

Page 29: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Experimental Settings (2) The broker models the load in the cores as

the number of queries served from the last W queries

Assumption: cost =1, for each query and collection We will change this

We count the number of relevant results we can get by polling the servers, up to the chosen load threshold

Page 30: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Load Balancing Results

Peak Load on Each IR Core

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Core ID

Peak Load

FIXED 4

BASIC 24.7

BOOST 4 24.7

FIXED 4 BASIC<24.7> BOOST<4, 24.7>

5 3.10 3.40 3.55

10 6.00 6.80 7.00

20 12.20 13.60 14.00

Intersection (# of relevant results retrieved)

FIXED 4 BASIC<24.7> BOOST<4, 24.7>

5 0.88 0.91 0.92

10 0.87 0.90 0.90

20 0.85 0.89 0.90

Competitive Similarity (% of rank score retrieved)

Page 31: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

Caching and Collection Selection

Page 32: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Interaction with a Cache Result caching is commonly used in WSEs

[Baeza-Yates et al., 2007a; Baeza-Yates et al., 2007b].

Caching has the effect of reshaping the power-law underlying the query distribution [Baeza-Yates et al., 2007a].

We designed a novel caching strategy (i.e. Incremental Caching) integrated with collection selection

Page 33: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Incremental Caching

IRCore1

IRCore2

IRCore3

IRCore4

Incr

emen

tal

Cac

he

Q…

…Q…

…Q…

…Q…

Q

Q

Q

Q

Q

Results

ServersPolled X XX X

An incremental cache is effective both at load reduction, and at improving result quality.

Page 34: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Incremental Caching Results

BASIC<24.7> BOOST<4, 24.7> INCREMENTAL

5 3.40 3.55 4.00

10 6.80 7.00 7.80

20 13.60 14.00 15.60

Intersection (like P@N - # of relevant res retrieved)

BASIC<24.7> BOOST<4, 24.7> INCREMENTAL

5 0.91 0.92 0.94

10 0.90 0.90 0.93

20 0.89 0.90 0.93

Competitive Similarity (% of rank score retrieved)

Peak Load on Each IR Core

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Core ID

Peak Load

FIXED 4

BASIC 24.7

BOOST 4 24.7

BOOST 4 24.7 + INC

Page 35: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Page 36: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

Refined Cost Model and Prioritization

Page 37: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Collection Prioritization We reverse the load control from the

broker to the cores The broker broadcasts the query, and

sends info about the relative rank of each core (the priority)

Each core serves query if L(i) < p(i) L L(i) = sum of the comp. cost (timing) of

served queries

Page 38: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Page 39: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Extended Tests We actually partitioned the documents

onto different servers We indexed locally, and we measured

the timing of each query The actual timing is used to compute

the load and drive the system Load cap is AVERAGE load

The peak can heavily vary!

Page 40: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Page 41: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

…the bill, please!

Page 42: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Conclusions We presented an architecture for a

distributed search engine, based on collection selection

The load-driven strategy and the incremental caching can retrieve very high quality results, with reduced load

Verified with an extensive simulation

Page 43: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Impact and Benefits If a given precision is expected, we can

use FEWER servers With a given number of servers, we get

HIGHER precision Confirmed with different metrics

Smaller load for the IR system, with more focus on top results

Nice trade-off cost vs. quality

Page 44: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Impact and Benefits (2) Load-driven routing can be used

to absorb query peaks to offer higher/lower quality results to

selected users Consistent ranking due to local indexing Inc. caching can be used to reduce the

negative effects of selection

Page 45: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Furthermore Caching posting lists is very effective on

local indices Simple way to add new documents Inc. caching could help with impact-

ordered posting lists Caching could be based on line value

(query frequency, number of polled servers)

Page 46: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Future Work Comparison with other results in

clustering (k-means, link-based, P2P, LSI, SVD)

Test on a large-scale, real-world search engine

Real-world implementation at Google TOIS paper to wrap up

Page 47: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

References [Puppin et al., 2006]

Diego Puppin, Fabrizio Silvestri, Domenico Laforenza. “Query-Driven Document Partitioning and Collection Selection”. Invited Paper. Proceedings of INFOSCALE ‘06.

[Puppin & Silvestri, 2006] Diego Puppin, Fabrizio Silvestri. “The Query-Vector

Document Model”. Proceedings of CIKM ‘06. [Puppin et al., 2007]

Diego Puppin, Ricardo Baeza-Yates, Raffaele Perego, Fabrizio Silvestri. “Incremental Caching for Collection Selection Architectures”. Proceedings of INFOSCALE ‘07.

Page 48: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

References [Baeza-Yates et al., 2007a]

Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras, Fabrizio Silvestri. “Challenges in Distributed Information Retrieval”. Invited Paper. Proceedings of ICDE 2007.

[Chierichetti et al., 2007] F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A.

Tiberi, E. Upfal. “Finding Near Neighbors Through Cluster Pruning”. Proceedings of PODS 2007.

[Baeza-Yates et al., 2007b] Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira,

Vanessa Murdock, Vassilis Plachouras, Fabrizio Silvestri. “The Impact of Caching on Search Engines”. Proceedings of SIGIR 2007.

Page 49: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

References [Dhillon et al., 2003]

Dhillon, I. S. and Mallela, S. and Modha, D. S., “Information-Theoretic Co-Clustering”. Proceedings of The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2003)

Page 50: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

Backup Slides

Page 51: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Page 52: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Page 53: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Page 54: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

Adding Documents

Page 55: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Adding Documents It is important to assign new documents

to the fittest clusters New versions, New pages etc.

The new documents will be found along with the previously assigned documents

Hopefully the coll. selection will find them with similar docs

Page 56: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

A Modest Proposal The body of the new document is used

as query for the PCAP selection The body is compared to the query

clusters We will find a similarity between doc.

body and query cluster We use PCAP to rank doc. collections

Page 57: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Implementation The first 1000 byte of (stripped) body

doc are used The new doc is assigned to the doc.

cluster with the top PCAP score New docs are locally indexed No need to re-train / re-assign New docs have consistent score and

ranking

Page 58: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Test Configurations

Page 59: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection

Page 60: A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi

September 2007 University of Pisa

Diego Puppin, A Search Engine Architecture Based on Collection Selection