MINERVA Infinity: A Scalable Efficient Peer-to-Peer
Search Engine
Middleware 2005 Grenoble, France
Sebastian MichelMax-Planck-Institut für Informatik
Saarbrücken, [email protected]
Peter TriantafillouUniversity of Patras
Rio, [email protected]
Gerhard WeikumMax-Planck-Institut für Informatik
Saarbrücken, [email protected]
MINERVA Infinity: A Scalable Efficient P2P Search Engine 2
Vision
• P2P approach best suitable– large number of peers– exploit mostly idle resources– intellectual input of user community
• Today: Web Search is dominated by centralized engines (“to google”)
- censorship?- single point of attack/abuse- coverage of the web?
• Ultimate goal: “Distributed Google” to break information monopolies
MINERVA Infinity: A Scalable Efficient P2P Search Engine 3
Challenges
• large scale networks – 100,000 to 10,000,000 users
• large collections> 10^10 documents– 1,000,000 terms
• high dynamics
MINERVA Infinity: A Scalable Efficient P2P Search Engine 4
Questions
• Network Organization– structured?– hierarchical?– unstructured?
• Data Placement– move data around?– data remains at the owner?
• Scalability?
• Query Routing/Execution– Routing indexes?– Message flooding?
MINERVA Infinity: A Scalable Efficient P2P Search Engine 5
Overview
• Motivation (Vision/Challenges/Questions)• Introduction to IR and P2P Systems• P2P- IR• Minerva Infinity• Network Organization• Data Placement• Query Processing• Data Replication• Experiments• Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine 6
Information Retrieval Basics
Document Terms
5 x
7 x
4 x
# of terms(term frequency)
MINERVA Infinity: A Scalable Efficient P2P Search Engine 7
Information Retrieval Basics (2)
index lists with(DocId: tf*idf)sorted by Score
B+ tree on terms
Query Execution: Usually using some kind of threshold algorithm*: - sequential scans over the index lists (round-robin) - (random accesses to fetch missing scores) - aggregate scores - stop when the threshold is reached
Top-k Query Processing: find k documents with the highest total score
e.g. Fagin’s algorithmTA or a variant without random accesses
d17: 0.3d44: 0.4
...d52: 0.1
d53: 0.8d55: 0.6 d12: 0.5
d14: 0.4
...
d28: 0.1
d51: 0.6
d52: 0.3
d28: 0.7
...
d17: 0.1
d44: 0.2
d11: 0.6
MINERVA Infinity: A Scalable Efficient P2P Search Engine 8
P2P Systems
• Peer: – “one that is of equal standing with another”
(source: Merriam-Webster Online Dictionary )
• Benefits:– no single point of failure– resource/data sharing
• Problems/Challenges:– authority/trust/incentives– high dynamics– …
• Applications:– File Sharing– IP Telephony– Web Search– Digital Libraries
MINERVA Infinity: A Scalable Efficient P2P Search Engine 9
Structured P2P Systems based on Distributed Hash Tables (DHTs)
• “structured” P2P networks• provide one simple method:
lookup:key->peer
• CAN [SIGCOMM 2001]• CHORD [SIGCOMM 2001]• Pastry [Middleware 2001]• P-Grid [CoopIS 2001]
robustness to load skew,failures, dynamics
MINERVA Infinity: A Scalable Efficient P2P Search Engine 10
p1
p8
p14
p21
p32p38
p42
p48
p51
p56
Chord
• Peers and keys are mapped to the same cyclic ID space using a hash function
• Key k (e.g., hash(file name))is assigned to the node withkey p (e.g., hash(IP address))such that k p and there isno node p‘ with k p‘ and p‘<p
k10
k24
k30k38
k54
MINERVA Infinity: A Scalable Efficient P2P Search Engine 11
Chord (2)
• Using finger tables to speed up lookup process
• Store pointers to few distant peers
• Lookup in O(log n) steps
Chord Ring
p1
p8
p56
p51
p42
p38 p32p21
p14
k54
Lookup(54)
MINERVA Infinity: A Scalable Efficient P2P Search Engine 12
Overview
• Motivation (Vision/Challenges/Questions)• Introduction to IR and P2P Systems• P2P- IR• Minerva Infinity• Network Organization• Data Placement• Query Processing• Data Replication• Experiments• Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine 13
P2P - IR
• Share documents (e.g. Web pages) in an efficient and scalable way
• Ranked retrieval– simple DHT is insufficient
MINERVA Infinity: A Scalable Efficient P2P Search Engine 14
Possible Approaches
• Each peer is responsible for storing the COMPLETE index list for a subset of terms.
p1
p8
p14
p21
p32p38
p42
p48
p51
p56Query Routing: DHT lookupsQuery Execution: Distributed Top-k
[TPUT ’04, KLEE ‘05]
capacity overload of peers withhighly frequent / popular terms(data load AND query load)
MINERVA Infinity: A Scalable Efficient P2P Search Engine 15
Possible Approaches (2)
• Each peer has its own local index (e.g., created by web crawls)
Query Routing: 1. DHT lookups2. Retrieve Metadata3. Find most promising peers
Query Execution: - Send the complete Query and merge the incoming results
capacity overload of peers with - highly frequent terms - high-quality collections
a: P1 P6 P4
b: P5 P3 P1 P6 ...
Distributed DirectoryTerm List of Peers
P1
P5
P6 P4
P2
P3
MINERVA Infinity: A Scalable Efficient P2P Search Engine 16
Overview
• Motivation (Vision/Challenges/Questions)• Introduction to IR and P2P Systems• P2P- IR• Minerva Infinity• Network Organization• Data Placement• Query Processing• Data Replication• Experiments• Conclusion
MINERVA Infinity: A Scalable Efficient P2P Search Engine 17
Minerva Infinity
• Idea:– assign (term, docId, score)
triplets to the peers • order preserving
• load balancing
– hash(score)+
hash(term) as offset– guarantee 100% recall
high
low
high
low
high
low
MINERVA Infinity: A Scalable Efficient P2P Search Engine 18
Hash Function
• Requirements:– Load balancing (to avoid overloading peers)– Order preserving (to make the QP work)
• One without the other is trivial ...– Load balancing: apply a pseudo random hash function– Order preserving:
• Both together is challenging …
S-Smin----------------- * NSmax - Smin
MINERVA Infinity: A Scalable Efficient P2P Search Engine 19
Hash Function (2)
• Assume an exponential score distribution• Place the first half of the data to the first peer• The next quarter to the next peer• and so on …
…
1
0
MINERVA Infinity: A Scalable Efficient P2P Search Engine 20
Term Index Networks (TINs)
• Reduce # of hops during QP by reducing the number of peers that maintain the index list for a particular term
Only a small subset of peers is used to store an index list.
GlobalNetwork
B
AC
2
7
12
15
162024
37
41
45
62
12
24
62
2
41
24
7
16
45
MINERVA Infinity: A Scalable Efficient P2P Search Engine 21
How to Create/Find a TIN
• Use u Beacon-Peers to bootstrap
the TIN for term T
Global Network
p = 1/uFor i=0 to i<n‘ do
id = hash(t, i*p)if (i>0) use hash(t,(i-1)*p)
as a gateway to the TINelse node with id creates the TIN
End for
T
Beacon nodes act as gateways to the TIN
MINERVA Infinity: A Scalable Efficient P2P Search Engine 22
Publish Data / Join a TIN
• Peer with id = hash(t, score) not in the TIN for term t
• Randomly select a beacon node(Beacon nodes act as gateways to the TIN)
• Call the join method
• Store the item (docId, t, score)
MINERVA Infinity: A Scalable Efficient P2P Search Engine 23
Query Processing
Data Peers Coordinator
1 1
2-keyword Query
Alternative: Collect data and send in one batch.
MINERVA Infinity: A Scalable Efficient P2P Search Engine 24
QP with Moving Coordinator
Data Peers Coordinator
1 1 1
3-keyword Query
MINERVA Infinity: A Scalable Efficient P2P Search Engine 25
Data Replication
• Vertical: Replicate data inside a TIN via a ‘reverse’ communication.
• Horizontal: Replicate complete TINs
123
1 2 31 2 31 2 3
B34
150
B
B20
864
B C12
62
28
C12
62
A49 5
19
A
A41 7
16
A41 7
16B
24
245
B24
245
C7
55
22
CB46
1157
B
C11
64
24
C24
A1
16
A
31
MINERVA Infinity: A Scalable Efficient P2P Search Engine 26
Experiments
Test bed:10,000 peers
Benchmarks:• GOV: TREC .GOV collection + 50 TREC-2003 Web
queries, e.g. juvenile delinquency
• XGOV: TREC .GOV collection + 50 manually expanded queries, e.g. juvenile delinquency youth minor crime law jurisdiction offense prevention
• SCALABILITY: One query executed multiple times ……….
MINERVA Infinity: A Scalable Efficient P2P Search Engine 27
Experiments: Metrics
Metrics• Network traffic (in KB)
• Query response time (in s)- network cost (150ms RTT,
800Kb/s data transfer rate)
- local I/O cost (8ms rotation latency
+ 8MB/s transfer delay)
- processing cost
• Number of Hops
MINERVA Infinity: A Scalable Efficient P2P Search Engine 28
Scalability Experiment
• Measure time for a different query loads.
– identical queries– inserted into a queue
100
1000
10000
100000
1000000
10000000
1 10 100 1000 10000
Query Load: Queue Size
To
tal
Exe
cuti
on
Tim
e in
Sec
on
ds
Minerva Infinity
no parallelprocessing
MINERVA Infinity: A Scalable Efficient P2P Search Engine 29
Experiments: Results
GOV
0
200
400
600
800
1000
1200
2 3 4
Number of Query Terms
To
tal T
ime
in S
eco
nd
s
GOV
0.00
10000.00
20000.00
30000.00
40000.00
50000.00
60000.00
2 3 4
Number of Query Terms
To
tal B
and
wid
th in
KB
MINERVA Infinity: A Scalable Efficient P2P Search Engine 30
Conclusion
• Novel architecture for P2P web search.
• High level of distribution both in data and processing.
• Novel algorithms to create the networks, place data, and execute queries.
• Support of two different data replication strategies.
MINERVA Infinity: A Scalable Efficient P2P Search Engine 31
Future Work
• Support of different score distributions
• Adapt TIN sizes to the actual load
• Different top-k query processing algorithms
MINERVA Infinity: A Scalable Efficient P2P Search Engine 32
Thank you for your attention