Structured overlays: Self-organization and Scalability @ SASO 2009 PP 2 1 Structured Overlays - Self-organization and Scalability by Anwitaman Datta –

Structured overlays: Self-organization and Scalability @ SASO 2009 1

PP

2

Structured Overlays- Self-organization and Scalability

by

Anwitaman Datta – Nanyang Technological University, Singapore – [email protected]

Ali Ghodsi – UC Berkeley, USA –

[email protected]


PP

2

The P2P paradigm- A brief introduction

Part I


PP

23

Outline

• The P2P paradigm– History and philosophy

• P2P in the realm of distributed systems

– Concepts• Decentralization• Self-organization• Overlays

• Resource location problem at the large– Structured overlay networks – Unstructured overlay networks


PP

24

P2P is more than justPirate-to-Pirate

file-sharing!&

distributingillegal copies

The P2P paradigm


PP

25

<rdf:Description about='' xmlns:xap='http://ns.abode.com/xap/1.0/'> <xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> <xap:Creator> Brahma </xap:Creator></rdf:Description>…

knowledge

bandwidth

storage

processing

content

Sharing resources in large-scale networks

Homo sapiens

The P2P paradigm: Application Perspective


PP

26

• Centralized solution undesirable or unattainable• Exploit resources at the edge

- no dedicated infrastructure/servers- peers act as both clients and servers (servent)

• Autonomous participants- large scale- dynamic system and workload- source of unpredictability

- e.g., correlated failures• Lack of global control or knowledge

- rely on self-organization

The P2P paradigm: Systems Perspective


PP

27

• So where does P2P fit in the realm of distributed systems?

A collection of (probably heterogeneous) automata whose distribution is transparent to the user so that the system appears as one local machine. This is in contrast to a network, where the user is aware that there are several machines, and their location, storage replication, load balancing and functionality is not transparent.

[http://foldoc.org/index.cgi?distributed+system]

– In its loosest sense, distributed system is any system with several nodes and a network between them

P2P is just distributed systems

Acknowledgement: The following discussion on how p2p paradigm fits in the realm of distributed systems is inspired by J. Kangasharju’s take on the same issue.


PP

28

• The definition (represents traditional view of distributed systems) implies a managed and controlled entity which acts as a single, logical system– Often also relies on dedicated infrastructure

• In contrast, P2P is decentralized and is not controlled or managed. P2P uses individually unreliable autonomous participants and generally rely on self-organization. – Still, ideally, the system should provide some

overall reliability guarantees

P2P in the realm of distributed systems


PP

29

• Grid– Coordinated resource sharing and problem

solving in dynamic, multi-institutional virtual organizations. - Ian Foster

• Note that a Grid is generally centralized



PP

210

• Ad-hoc networks– A wireless ad hoc network is a decentralized wireless network.

The network is ad hoc because each node is willing to forward data for other nodes, and so the determination of which nodes forward data is made dynamically based on the network connectivity. This is in contrast to wired networks in which routers perform the task of routing. It is also in contrast to managed (infrastructure) wireless networks, in which a special node known as an access point manages communication among other nodes.

– Can be seen as a `kind of´ peer-to-peer network• Though often very different research communities are involved,

and the focus of problems and functionalities are also very different.



PP

211

Self-organization

• Self-organizing systems common in nature– Physics, biology, ecology, economics, sociology, cybernatics– Microscopic (local) interactions– Limited information, individual decisions

• Distribution of control => decentralization– Symmetry in roles/peer-to-peer

– Emergence of macroscopic (global) properties• Resilience

– Fault tolerance as well as recovery– Adaptivity


PP

2

Resource discovery in the large - Structured overlay basics

Part II


PP

2

Structured overlays/Distributed hash tables

what it is


PP

2

What’s a Distributed Hash Table?

• An ordinary hash table

• Every node provides a lookup operation–Given a key: return the associated value

• Nodes keep routing pointers–If item not found locally, route to another node

Key ValueAnwitaman

Singapore

Ali Berkeley

Alberto Trento

Kurt Kassel

Ozalp Bologna

Randy Berkeley

, which is distributed


PP

2

Why’s that interesting?

• Characteristic properties– Self-management in presence joins/leaves/failures

• Routing information • Data items

– Scalability• Number of nodes can be huge• Number of items can be huge


PP

2

short interlude

applications


PP

2Name-based communication Pattern• Map node names to location

– Can store all kinds of contact information• Mediator peers for NAT hole punching• Profile information

• Used this way by:– Host Identity Payload (HIP)– P2P Session Initiation Protocol (P2PSIP)– Wuala– Internet Indirection Infrastructure (i3)

130.237.32.51anwita

193.10.64.99ali

18.7.22.83alberto

128.178.50.12ozalp

……

ValueKey

node A

node D

node B

node C


PP

2

Global File System

• Similar to DFS (eg NFS, AFS)– But files/metadata stored in directory– E.g. Wuala, WheelFS…

• What is new?– Application logic self-managed

• Add/remove servers on the fly• Automatic faliure handling• Automatic load-balancing

– No manual configuration for these ops

130.237.32.51/home/...

193.10.64.99/usr/…

18.7.22.83/boot/…

128.178.50.12/etc/…

……

ValueKey

node A

node D

node B

node C


PP

2

• A distributed web proxy/cache– Every node in the LAN runs a DHT client

• Browsing for a page:– Check DHT

• If page exists locally download from peer– Otherwise, fetch and cache

• Seamlessly add/remove workstations– No central servers

• Example:– Squirrel

130.237.32.51www.s...

193.10.64.99www2…

18.7.22.83www3…

128.178.50.12cs.edu

……

ValueKey

node A

node D

node B

node C

P2P Proxy


PP

2

P2P Web Servers

• Distributed Web Server– Pages stored in the directory

• What is new?– Application logic self-managed

• Automatically load-balances• Add/remove servers on the fly• Automatically handles failures

• Example:– CoralCDN

130.237.32.51www.s...

193.10.64.99www2

18.7.22.83www3

128.178.50.12cs.edu

……

ValueKey

node A

node D

node B

node C


PP

2

Access Layers for DHTs

• A relational view of the DHT (PIER)– Use SQL to fetch data– Standard operations (projection, selection,

equi-join)

• Approximate Matching (CUBIT)– Get k items with keys most similar to given key

130.237.32.51www.s...

193.10.64.99www2

18.7.22.83www3

128.178.50.12cs.edu

……

ValueKey

node A

node D

node B

node C

select name,salary

from emp, sal

where emp.id=sal.f_id

130anwita...

223alberto

141ali

221ozalp

……

ValueKey

node A

node D

node B

node C

get(”arwitanam”,1):

(”anwita”:”130”)


PP

2

towards DHT construction

consistent hashing


PP

2

Hash tables

• Ordinary hash tables– put(key,value)

• Store <key,value> in bucket (hash(key) mod 7)

– get(key)• Fetch <key,v> s.t. <key,v> is in bucket

(hash(key) mod 7)

0 1 2 3 4 5 6


PP

2DHT by mimicking Hash Tables

• Let each bucket be a server– n servers means n buckets

• Problem– How do we remove or add buckets?– A single bucket change requires re-shuffling a

large fraction of items


PP

2

Consistent Hashing Idea

• Logical name space, called the identifier space, consisting of identifiers {0,1,2,…, N-1}

• Identifier space is a logical ring modulo N

• Every node picks a random identifier

• Example:

– Space N=16 {0,…,15}

– Five nodes a, b, c, d• a picks 6• b picks 5• c picks 0• d picks 5• e picks 2

2

11

6

5

01

3

4

789

10

15

14

13

12


PP

2

Definition of Successor

• The successor of an identifier is the first node met going in clockwise direction

starting at the identifier

• Example– succ(12)=14

– succ(15)=2

– succ(6)=6

2

11

6

5

01

3

4

789

10

15

14

13

12


PP

2

Where to store items?

• Use globally known hash function, H

• Each item <key,value> gets the

identifier H(key)

• Store item at successor of H(key)– Term: node is responsible for item k

• Example– H(“Anwitaman”)=12

– H(“Ali”)=2

– H(“Alberto”)=9

– H(“Ozalp”)=14

2

11

6

5

01

3

4

789

10

15

14

13

12

Key Value

Anwitaman

Singapore

Ali BerkeleyAlberto TrentoKurt KasselOzalp Bologna


PP

2Consistent Hashing: Summary• Scalable

– Each node stores avg D/n items (for D total items, n nodes)– Reshuffle on avg D/n items for every join/leave/failure

• Everybody knows everybody– Akamai works this way– Amazon Dynamo too

• Load balancing– Whp O(log n) imbalance– Eliminate imbalance by

having each server ”simulate”log(n) random buckets

2

11

6

5

01

3

4

789

10

15

14

13

12


PP

2

towards dht construction

reducing neighbors


PP

2

Where to point (Chord)?• Each node points to its successor

– The successor of a node p is succ(p+1)– Known as a node’s succ pointer

• Each node points to its predecessor– First node met in anti-clockwise direction starting at n-1 – Known as a node’s pred pointer

• Example– 0’s successor is succ(1)=2– 2’s successor is succ(3)=5– 5’s successor is succ(6)=6– 6’s successor is succ(7)=11– 11’s successor is succ(12)=0

2

11

6

5

01

3

4

789

10

15

14

13

12


PP

2

DHT Lookup

• To lookup a key k

– Calculate H(k)

– Follow succ pointers until item k is found

• Example– Lookup ”Alberto” at node 2

• H(”Alberto”)=9

• Traverse nodes:2, 5, 6, 11 (BINGO)

• Return “Trento” to initiator

2

11

6

5

01

3

4

789

10

15

14

13

12

Key Value

Anwitaman Singapore

Ali Berkeley

Alberto Trento

Kurt Kassel

Ozalp Bologna


PP

2

towards dht construction

handling joins/leaves/failures


PP

2

Dealing with failures• Each node keeps a successor-list

– Pointer to f closest successors• succ(p+1)• succ(succ(p+1)+1)• succ(succ(succ(p+1)+1)+1)• ...

• Rule: If successor fails– Replace with closest alive successor

• Rule: If predecessor fails– Set pred to nil

• Set f=log(n)– With failure probability 0.5, w.h.p. all nodes in list

will not fail: 1/2log(n)=1/n

2

11

6

5

01

3

4

789

10

15

14

13

12


PP

2

Handling Dynamism

• Periodic stabilization used to make pointers eventually correct

– Try pointing succ to closest alive successor

– Try pointing pred to closest alive predecessor

Periodically at node p:

1. set v:=succ.pred2. if v≠nil and v is in

(p,succ]3. set succ:=v4. send a notify(p) to succ

When receiving notify(q) at node p:

1. if pred=nil or q is in (pred,p]

2. set pred:=q


PP

2

Handling joins

• When new node n joins– Find n’s successor with lookup(n)– Set succ to n’s successor– Stabilization fixes the rest






2. set pred:=q

11

1513


PP

2

Handling leaves

• When n leaves– Just dissappear (like failure)

• When pred detected failed– Set pred to nil

• When succ detected failed– Set succ to closest alive in successor list

11

1513






2. set pred:=q


PP

2Speeding up lookups with fingers

• If only pointer to succ(p+1) is used– Worst case lookup time is n, for n nodes

• Improving lookup time (binary search)– Point to succ(p+1)– Point to succ(p+2)– Point to succ(p+4)– Point to succ(p+8)– …– Point to succ(p+2(log N)-1)

• Distance always halved to

the destination, log hops

2

11

6

5

01

3

4

789

10

15

14

13

12


PP

2Handling Dynamism of Fingers and SList

• Node p periodically:

– Update fingers• Lookup p+21, p+22, p+23,…,p+2(log N)-1

– Update successor-list• slist := trunc(succ · succ.slist)


PP

2

Chord: Summary

• Lookup hops is logarithmic in n– Fast routing/lookup like in a dictionary

• Routing table size is logarithmic in n– Few nodes to ping


PP

2

Reliable Routing

• Iterative lookup– Generally slow (handling NATs, fw)– Reliability easy to achieve

• Initiator in full control

• Recursive lookup– Generally fast (use established links)– Several ways to do reliability

• End-to-end timeouts• Any node timeouts

– Difficult to determine timeout value

• Transitive lookup– Reliability: end-to-end timeouts


PP

2

Replication of items

• Successor-list replication (Chord,Pastry)– Idea: replicate nodes

• If node p responsible for set of items K• Replicate K on p’s immediate successors

• Symmetric Replication (DKS)– Idea: replicate identifiers

• Items with key 0,16,32,48 equivalent• Whoever is responsible for 0, also stores 16,32,48• Whoever is responsible for 16, also stores 0,32,48• …


PP

2

towards proximity awareness

plaxton-mesh (PRR)pastry/tapestry


PP

2

Plaxton Mesh [PRR]

• Identifiers represented with radix/base k– We use k=16, hexadecimal radix– Ring size N is a large power of k, e.g. 1640


PP

2

Plaxton Mesh (2)

• Additional routing table on top of ring• Routing table construction by example

– Node 3a7f keeps following routing table

• Kleene star * for wildcards– Flexibility to choose proximate neighbors

• Invariant: row i of any node in row i interchangeable

30* 31* 32* 33* 34* 35* 36* 37* 38* 39* self 3b* 3c* 3d* 3e* 3f*

3a0* 3a1* 3a2* 3a3* 3a4* 3a5* 3a6* self 3a8* 3a9* 3aa* 3ab* 3ac* 3ad* 3ae* 3af*

0* 1* 2* self 4* 5* 6* 7* 8* 9* a* b* c* d* e* f*

3a70* 3a71* 3a72* 3a73* 3a74* 3a75* 3a76* 3a77* 3a78* 3a79* 3a7a* 3a7b* 3a7c* 3a7d* 3a7e* self


PP

2

Plaxton Routing

• To route from 1234 to abcd:1. 1234 uses rt row 1: jump to a*, eg a999

2. a999 uses rt row 2: jump to ab*, eg ab11

3. ab11 uses rt row 3: jump to abc*, eg abc0

4. abc0 uses rt row 4: jump to abcd

• Routing terminates in log(N) hops– In practise log(n),

where N is id size and n is number of nodes


PP

2

Pastry

• Leaf set– Successor-list in both directions– Periodically gossiped to all leafs O(n2) [Bamboo]

• Plaxton-mesh on top of ring– Failures in routing table

• Get replacement from any node on same row

• Routing1) Route directly to responsible node in leaf set,

otherwise2) Route to closer (prefix) node, otherwise3) Route on ring


PP

2

Routing Table Initialization

• How does a new node initialize RT?– New node lookup its own id– At step i copy row i of the node

• Good if latencies are symmetric

– Example:Assume new node abcd knows 1234

1. 1234 uses rt row 1: jump to a*, eg a9992. a999 uses rt row 2: jump to ab*, eg ab113. ab11 uses rt row 3: jump to abc*, eg abc04. abc0 uses rt row 4: jump to abcd


PP

2

constant number of neighbors

De-bruijn graphsKoorde


PP

2

Even less routing info…

• How much routing state necessary?• Moore bound from graph theory

– Assume each node has k neighbors– How many nodes (at most) reachable in d hops?

– 0 hops: 1

– 1 hop: 1+k

– 2 hops: 1+k+k(k-1)

– 3 hops: 1+k+k(k-1)+k(k-1)2

– d hops: 1+k∑(k-1)i

=1+k((k-1)d-1)/(k-2)


PP

2

Moore bound

• Given k pointers per node– In d hops maximum n nodes are reachable– n ≤ 1+k((k-1)d-1)/(k-2)

– Solve d as a function of n– d ≥ logk-1[(n(k-2)+2)/k] ≈ logk n

• In DHTs, each node has k=log(n) neighbors– d ≈ loglog nn = log n/log(log n)

• So, optimally, for n nodes, with log(n) pointers we should reach everyone in log(n)/log(log n) hops


PP

2

Optimal Graphs

• De Bruijn graphs provide our bounds

• Example k=2– Consider each node’s identifier in binary– Each node i should know 2 neighbors:

• 2i (mod N)• 2i+1 (mod N)

• Example:– Node 011011 knows 110110 and 110111


PP

2

Routing in De Bruijn Graphs

• Example– k=2, n=23=8

• Routing– Main idea:

• Each hop shifts in one final digit (left-to-right)

– Eg node 110 wants to find 011• 110 jumps to 100 [010]

• 100 jumps to 001 [010]

• 001 jumps to 011 [011]


PP

2

Routing in De Bruijn Graphs

• Lookup algorithm at node m– Initially kshift=k (key to lookup)– All operations (<<) mod N


PP

2

Making a DHT of De Bruijn Graphs

• With d=2 pointers we get log(N) hops, where N is id space size (2160)– How to achieve log(n), n=number of nodes

• Main idea– Route on an imaginary 2160 graph,

• Invariant: go to predecessor of imaginary node

– Store pointer called d to predecessor of 2i


PP

2

Koorde DHT

– Algorithm at node m• i is imaginary node

– Initially i=m.successor

• Initially k=kshift


PP

2

2-hop Lemma

• The number of hops is w.h.p. at most 3log(N)– i.e. we need 2 succ traversals per De Bruijn hop

• When at m=predecessor(i)– Jump to 2m

– Traverse successor to reach predecessor(2i)• (2i-2m)/N fraction of space, with n(2i-2m)/N nodes• On average i-m = N/n • So n(2N/n)/N = 2 nodes traversed

QED


PP

2

Koorde works! (2)

• Still O(log N), how to get O(log n)?

• Use flexibility in i parameter– Can set i to any node in range m, m.succ

– Set low bits of i to maximize number of final digits


PP

2O(log n) Hops Koorde Theorem

• Distance between m and i – on avg N/n– Whp the distance is more than N/n2

– Number of low-order bits in range• log(N)-2log(n) bits can be set arbitrarily

• Need to route total log(N) bits– log(N)-2log(n) already done– 2log(n) bits needed to be shifted

QED


PP

2

architecture of structured overlays

a formal view of DHTs


PP

2General Architecture for DHTs

• Metric space S with distance function d– d(x,y)≥0

– d(x,x)=0

– d(x,y)=0 x=y

– d(x,y) + d(y,z) ≤ d(x,z)

– d(x,y)=d(y,x) (not always)

• Eg:– d(x,y) = y – x (mod N) Chord

– d(x,y) = x xor y Kademlia

– d(x,y) = sqrt( (x1-y1)2 + … + (xd-yd)2 ) CAN


PP

2

Graph Embedding

• Embed a virtual graph for routing– Powers of 2 (Chord)– Plaxton mesh (Pastry/Tapestry)– Hypercube – De-bruijn (Koorde, DH)– Butterfly (Viceroy)

• A node responsible for many virtual identifiers– Eg Chord nodes responsible for all virtual ids between

node id and predecessor


PP

262

Abstracting a tree Actual connectivity graph

0 1

00 01

000 001 010 011 100 101 110 111

A B C D E F G H

A

B

C

D

E

F

G

H

• Structural replication– Multiple peers responsible for the same key-space– Multiple routes resolving same prefix

P-Grid (EPFL)


PP

263

Query at A for 010 - A forwards it to D

000

011

A

B

C

D

E

F

G

H

0 1

00 01

000 001 010 011 100 101 110 111

A B C D E F G H

Query forwarding in P-Grid


PP

264

011

Query at A for 010 - D forwards it to C who has the answers!

000010A

B

C

D

E

F

G

H

0 1

00 01

000 001 010 011 100 101 110 111

A B C D E F G H

Query forwarding in P-Grid


PP

265

z x

y

*** 0 *** 1

New node y wants to join the network

z

x

y

*** 0 *** 1

Nodes y and z negotiates to repartition the key-space(alternatively, they could have decided to be replicas)

*** 01*** 00

Node joining in P-Grid

•Multiple peers can also decide to be replicas of the same partition

–Structural replication (a.k.a. Zone overloading)–Different kind of replication than in Chord


PP

266

directory(logical ID <-> IP address)

(if local cache does not work)lookup IP address

P-Grid

routing based on logical address (and cached IP)

Self-referential directory

routing based on logical address lookup IP address

in case of failure

Churn: Membership dynamics (peers leave and re-join) Peers rejoin with dynamic IP addresses

You may want to reconnect with the same guy

• Social/trust networks …

• Storage systems … (returning back with content)

P-Grid’s Self-referential directory and overlay maintenance


PP

268

Stale cache

1 : 12, 1301 : 5, 10001: 9,4

1 1

1 : 12, 1301 : 5,14001: 9,4

7 1

1 : 6,1301 :10,14000: 1,7

4 2,3

1 : 8,201 : 3, 10000: 1,7

9 2,3

1 : 8, 1300 : 7,9011: 3,10

5 4,5

1 : 2,1200 : 9,4011: 3,10

14 4,5

1 : 6,800 : 1,7010: 5,14

10 6,7

1 : 11,1200 : 1,9010: 5,14

3 6,7

0 : 4,711 : 2,12101: 8,13

11 8,9

1 : 1,311 : 2,12101: 8,13

6 8,9

0 : 5,911 : 2,12100: 6,11

13 10,11

0 : 4,911 : 2,12100: 6,11

8 10,11

0 : 5,710 : 6,13

12 12,13,14

0 : 1,1410 : 11,13

2 12,13,14

0 1

00

000 001

01

010 011

10

100 101

11

ID

ID

1 : 2 ,12

Up-to-date cache

Presently online

Presently offnline

LEGEND

4, 5 at 5,14

This toy example uses 4-bit representation of ID as the corresponding keyInformation about peer 4 is stored corresponding to key 0100 at peers 5,14

[Aberer04]


PP

269

1 : 12, 1301 : 5, 10001: 9,4

1 1

1 : 12, 1301 : 5,14001: 9,4

7 1

1 : 6,1301 :10,14000: 1,7

4 2,3

1 : 8,201 : 3, 10000: 1,7

9 2,3

1 : 8, 1300 : 7,9011: 3,10

5 4,5

1 : 2,1200 : 9,4011: 3,10

14 4,5

1 : 6,800 : 1,7010: 5,14

10 6,7

1 : 11,1200 : 1,9010: 5,14

3 6,7

0 : 4,711 : 2,12101: 8,13

11 8,9

1 : 1,311 : 2,12101: 8,13

6 8,9

0 : 5,911 : 2,12100: 6,11

13 10,11

0 : 4,911 : 2,12100: 6,11

8 10,11

0 : 5,710 : 6,13

12 12,13,14

0 : 1,1410 : 11,13

2 12,13,14

0 1

00

000 001

01

010 011

10

100 101

11

ID

ID

1 : 2 ,12

Stale cache

Up-to-date cache

Presently online

Presently offnline

query(01*) @ 7…query(0101) @ 7 (for stale entry 5, cycle -> abort)…query(1110) @ 7 (for stale entry 14, forward to 12 or 13)…query(1110) @ 12 (is offline)…query(1110) @ 13 (for stale entry 2)……query(0010) @ 13 (forward to 5)……query(0010) @ 5 (forward to 7)……query(0010) @ 7 (forward to 9)……query(0010) @ 9 (new entry for 2 found !)…query(1110) @ 2 (new entry for 14 found !)query(01*) @ 14 (finally )


PP

270

• Encountering unusable routes trigger queries recursively

• Recursive queries heal the network - A family of more efficient and adaptive overlay maintenance schemes (than proactive approaches)

- two extremes (of this family): Correction on Use, Correction on Failure

• System operates at a dynamic equilibrium

Self-healing recursive queries


PP

271

At steady-state: Effects of churn and self-healing cancel out Churn => ID-to-IP changes (unusable routing entries) Healing => make routes usable again

Dynamic equilibrium under churn


PP

272

Steady state: Probability distribution of the number of stale reference does not change We can obtain the repair cost and routing performance (latency / message cost) corresponding to this steady state.

0 refstale

1 refstale

2 refstale

r refstale…

repairs

IDchange

IDchange

IDchange

IDchange

r references (redundancy) per routing level per peer



PP

273

Contour map of cost/resilience trade-offs



PP

274

Comparison of maintenance mechanisms based on degree of laziness• Breakdown of the lazy mechanism


Analogous to Bamboo’s empirical experience of positive feedbacks!


PP

275

Reactivestrategies

Taxonomy of route maintenance mechanisms (circa 2004)


PP

276

Prevention is better than cure

•Predictive and proactive strategies for routing table maintenance

–Kademlia•Like P-Grid but uses XOR metric for routing

– Accordion•Like Chord but exploiting properties of algebraic small-world networks


PP

277

Kademlia

• Note similarity of topology with P-Grid - but uses different (XOR) routing mechanism

[Maymounkov02]


PP

278

XOR routing


PP

279

Reducing the effect of churn

Empirical observation from Gnutella trace: Probability of remaining online for another hour (y-axis) as a function of uptime (x-axis in minutes).

• Least recently seen eviction policy for `k-bucket´- but never evicts live nodes


PP

280

• . [Kleinberg00]• Guarantees poly-log n lookup hops • Allows smooth expansion of routing table

xx

1 space] IDin away isneighbor Pr[

Proactive route maintenanceBased on: Small-world distribution is flexible - useful for only long distance routes

Accordion [Li05]


PP

281

Main idea: Evicting stale entries efficiently

• Delete proactively before a lookup times out

• Pinging uses bandwidth inefficiently

• Predict each entry’s Pr(alive)

• Delete entries with Pr(alive) < threshold

Proactive route maintenance


PP

282

Analytic results

h(p

)Avg

Loo

kup

hops

+ t

imeo

uts

Delete entries with Pr(alive) < x

Best threshold

Delete AggressivelyDelete lazily

Choosing best deletion threshold


PP

283

U: known uptime A: time since last contacted Age

joined

UA

UUlifetimeAUlifetimealive

)|Pr()Pr(

• With Pareto session time:

Last contacted now

Timeline

• Delete entry if < thresholdUA

U

Predicting routing entry liveness


PP

284

CostBandwidth budget (bytes/node/sec)

Per

form

ance

Avg

lo

oku

p l

aten

cy (

mse

c)

Evaluation: performance/cost tradeoff


PP

285

Comparing with parameterized DHTs

Avg

lo

oku

p l

aten

cy (

mse

c)

Avg bandwidth consumed (bytes/node/sec)


PP

286

Avg

Lo

oku

p l

aten

cy (

mse

c)

Avg bandwidth consumed (bytes/node/sec)

Convex hull outlines best tradeoffs


PP

287

Lowest latency for varying churn

Median node session time (hours)

Avg

look

up la

tenc

y (m

sec)

• Accordion has lowest latency at low churn • Accordion’s latency increases slightly at high churn

Fixed budget,Variable churn


PP

288

Accordion stays within budget

Median node session time (hours)

Avg

ban

dwid

th (

byte

s/no

de/s

ec)

• Other protocols’ bandwidth increases with churn

Fixed budget,Variable churn


PP

289

Conclusions• Reactive strategies

– Redundancy can be exploited• To determine the degree of laziness• Trade-off between cost/resilience

– May lead to catastrophic failures under high churn (particularly for a lazy reactive strategy)

• e.g., because of positive feed-back

• Proactive strategies– Reduces the chance of catastrophic failure

• At the cost of continuous bandwidth usage– Sometimes unneccessarily

• Most maintenance strategies ignore the fact that persistent IDs may be useful

• E.g., does not look into the storage maintenance costs that need be carried out as a collateral


PP

2

Bootstrapping structured overlays

Part III


PP

291

Issues: – Properties of the resulting overlay

Load-balance, proximity, …

– Bootstrapping mechanisms• Sequential, Parallelized

- some implicit centralization• Decentralized

– Cost and overheads

• Construction cost & latency, …

Bootstrapping structured overlays


PP

292

In the beginning, there was …

Trivia:

The term "bootstrapping" alludes to a German legend about Baron Münchhausen, who claimed to have been able to lift himself out of a swamp by pulling himself up by his own hair. In later versions of the legend, he used his own boot straps to pull himself out of the sea which gave rise to the term bootstrapping. The term is believed to have entered computer jargon during the early 1950s by way of Heinlein's short story By His Bootstraps first published in 1941. (from Wikipedia)

Bootstrapping


PP

293

Load-balancing in DHTs

Load balancing in peer-to-peer (P2P) systems is a mechanism to spread various kinds of loads like storage, access and message forwarding among participating peers in order to achieve a fair or optimal utilization of contributed resources such as storage and bandwidth. For example,

Bootstrapping overlays

While bootstrapping an overlay network, we need to ensure good load-balancing characteristics.

– System with N homogeneous nodes

– The load is optimally balanced, • Load of each node is around 1/N of the total load.


PP

294


A First step: DHT

Use uniform hashing

The basic idea: Generate keys for each object to be stored by applying uniform (consistent) hashing (e.g. SHA-1)

• The keys are then uniformly distributed over the key-space

Assign peers to a part of the key-space by also applying (the same) hashing, on lets say on the peers’ IP address*

• Peers are then distributed uniformly over the key-space

This was expected to achieve load-balance

* Hashing was also expected to provide security in the original design of Chord, etc.


PP

295

• Analysis of distribution of data

• Example– Parameters

• 4,096 nodes• 500,000 documents

– Optimum• ~122 documents

per node

Optimal distribution of documents across nodes


[Rieche06]


PP

296

• Number of nodes storing no document– Parameters

• 4,096 nodes• 100,000 to 1,000,000

documents

– Some nodes w/o any load


Something’s wrong! What? Why??


PP

297

Balls into bins analogy

• n number of intervals (bins)– Intervals of equal size

• m number of items (balls)• sequentially choose a bin randomly for

each ball– A bin is hit with probability p = 1/n

• The number of balls in a bin is then given by the binomial distribution– Binomial distribution– Standard deviation

)(1

11

)(imi

b nni

miloadp

nn

mb

11


PP

298

A quick example (balls into bins)

Using mathematica

Expected value

In[61]:= Arraya,50; Doai0,i,1,50;DoxCeiling50Random;axax1,j,1,1000HistogramArraya,50,FrequencyData True

10 20 30 40 50

5

10

15

20

25

30


PP

299

Node C

Node A

Node B

Load-balancing in DHTs: Virtual Servers

• Each node is responsible for several intervals– "Virtual server"

• Example– Chord

Chord Ring

Increase the effective “n” by having many virtual peers for the same physical computer

nn

mb

11[Rao03]


PP

2100

• Each node is responsible for several intervals– log (n) virtual servers

• Load balancing– Different possibilities to change servers

• One-to-one• One-to-many• Many-to-many

– Copy of an interval is like removing and inserting a node in a DHT

Virtual Server


PP

2101

L L

L

L

LHH

HL

Load stealing/shedding

• One-to-One– Light node picks a random ID– Contacts the node x responsible for it– Accepts load if x is heavy

Slide has animation


PP

2102

Light nodes

L1

L4

L2

L3

Heavy nodes

H3

H2

H1

Directories

D1

D2

L5

• One-to-Many– Light nodes report their load information to directories– Heavy node H gets this information by contacting a directory– H contacts the light node which can accept the excess load


Slide has animation


PP

2103

Heavy nodes

H3

H2

H1

Directories

D1

D2L4

Light nodes

L1

L2

L3

L4

L5

• Many-to-Many– Many heavy and light nodes rendezvous at each step– Directories periodically compute the transfer schedule and report it

back to the nodes, which then do the actual transfer


Slide has animation


PP

2104

• Advantages– Easy shifting of load

• Whole Virtual Servers are shifted

– Can be extended for heterogeneous environments*• More virtual servers for a resource rich node

• Disadvantages– Increased administrative and message overheads

• Maintenance of all Finger-Tables

– Much load is shifted– Much more overlay traffic

Load-balancing in DHTs: Virtual Servers

* [Godfrey05]


PP

2105

• Idea– One hash function for all nodes

• h0

– Multiple hash functions for data• h1, h2, h3, …hd

• Two options– Data is stored at one node– Data is stored at one node &

other nodes store a pointer

Load-balancing in DHTs: Power of 2 choices

[Byers03]


PP

2106

• Inserting Data– Results of all hash functions are calculated

• h1(x), h2(x), h3(x), …hd(x)

– Data is stored on the retrieved node with the lowest load

– Alternative• Other nodes stores pointer



PP

2107

• Retrieving– Without pointers

• Results of all hash functions are calculated• Request all of the possible nodes in parallel• One node will answer

– With pointers• Request only one of the possible nodes.• Node can forward the request directly to the final

node



PP

2108

• Advantages– Simple– Generic randomized algorithm

• Disadvantages (with the specific realization)

– Message overhead at inserting data– With pointers

• Additional administration of pointers – More load– More adverse effect of churn

– Without pointers• Message overhead at every search



PP

2109

A quick example (power of two choices)

Using mathematica

Expected value

In[67]:=

Arrayb,50; Dobi0,i,1,50;DoxCeiling50Random;yCeiling50Random; Ifbx by,by by1,bx bx1,j,1,1000HistogramArrayb,50,FrequencyData True

10 20 30 40 50

5

10

15

20


PP

2110

11 21

2

3

1

4

3

2

d=2d=2

So far ...

• Bootstrapping DHTs– Uniform key distributions

…– Peers joined the network

quasi-sequentially• The network was

partitioned incrementally

• Next– Non-uniform keys– Parallelized construction

CAN network construction


PP

2111

Beyond DHTs: Data-oriented overlays

Preserve ordering information As occurring in natural language (say).

Needs more sophisticated (storage) load-balancing mechanisms to support

range partitioned data

Uniform hashing (used in DHTs) destroys ordering information!

Resource Key What is a suitable function? Depends on the application needs!

Figure courtesy Sarunas


PP

2112

Beyond DHTs: Data-oriented overlays

• Complex queries– Approximate or similarity queries

• DHTs can only support exact search

– Range queries– etc.

• e.g., Skyline queries

• Overlay supporting arbitrarily skewed load-distributions– DHTs are just a special case


PP

2113

Parallelized construction of overlays

• Shortcomings of sequential construction– Implicitly assumes some coordinator

• Implicit centralization

– Slow• Since peers join one by one

• Parallelized construction– Faster– Analogous to (re-)indexing a new attribute in a

DB • Can be useful for recovery from catastrophic

failures


PP

2114


Skewed load-distribution

1

23

45

6 7

8

• Given– A mechanism to meet other random peers

• e.g., an existing unstructured overlay

– A parameter p• Determined according to the load-skew

[Aberer05]


PP

2115

Distributed proportional partitioning:

- p fraction of peers take one half of the space (partition 0)

- 1-p fraction of peers take the other half (partition 1)- Needed for partitioning the key-space

in a granularity adaptive to load-skew


1

23

45

67

8

p = 0.75

0 1



PP

2116

Referential integrity:

- Each peer needs to know some peer from the complimentary partition- Needed for overlay routing- This constraint necessitates a

non-trivial algorithm (in order to reduce communication cost during overlay construction)


1

23

45

6 7

8

0 1

Each of themneeds to know

7 or 8 vice versa



PP

2117

Markov Partitioning process - peers are decided 0/1 or undecided - each undecided peer interacts with

some random peer - which has decided 0/1 or is still undecided


1

23

45

6 7

8

0 1

know s 6

know s 1

vice versa


[Aberer05]


PP

2118

Used recursively: Partitions are repartitioned (using

appropriate parameters) A load-balanced overlay is formed


1

2

3

4

5

6 7

8

00

010 011

1

Several other practical issues - local estimates of parameter p - replication factor (re-)balancing

Now we can builda load-balanced overlay in a parallelized manner

for rather arbitrary load-skews



PP

2119

• Advantages– Parallelized and fast construction– No need of coordination

• Since no need of sequential joins

– Load-balancing for arbitrary load-skews

• Disadvantages (with the specific realization)

– Complex• Algorithm design, analysis and implementation• Needs partial global information

– e.g., parameter choices (based on sampling)

Distributed proportional partitioning


PP

2120

• Pairing and merging virtual trees– Pair nodes randomly

• By probing potential successors – and accepting/rejecting probes

• Paired nodes act as virtual supernode• Repeat the process

– needs a mechanism to merge such virtual trees

Other parallelized construction mechanism: Sorting peer-IDs to build a ring

1 7

1

5 9

5

1 5

1

7 9

7

1Virtual nodes

Pairing Merging tree (with sorted peers)

… …

[Angluin05]


PP

2121

• Gossip based mechanism– Nodes start with random

subsets• Leaf-set: Maintain a

constant number of nodes– Arrange them as potential

predecessors/successors» Ideally equal number

of each• Gossip leaf-set information

with nodes it knows– E.g., its current leaf-set

nodes (may also include past ones)

– Refine information & repeat process

Other parallelized construction mechanism: Sorting peer-IDs to build a ring

7: 4, 5, 9, 109: 6, 8, 12, 14…

node leaf-set

Node 7 gossips its leaf-set with nodes it knows (including node 9), and each node refreshes their leaf-sets.

after gossip

7: 5, 6, 8, 99: 7, 8, 10, 12…

recalculated leaf-set

Gradually converges to form a sorted list [Montresor05]


PP

2122

• Advantages– Parallelized and fast construction– No need of coordination

• Since no need of sequential joins– Relatively simple (no global information)– Gossip based mechanism is robust against churn

during sorting process• Disadvantages (with the specific realizations)

– Do not take into account load-balancing issues– Not directly applicable for systems using structural

replication/zone over-loading– Just builds the basic ring, but not the long range links

• Though not complicated to build once the ring is in place– Pairing & merging mechanism is vulnerable to churn

during the sorting process

Sorting peer IDs


PP

2123

So far …

Bootstrapping issues: – Load-balance– How?

• Sequential• Parallelized

Assumes (implicitly) that any peer can potentially meet any other peer, and thus already are part of one connected network, and then build a single structured

overlay composed of all these peers.


PP

2124

Figure from http://www.tellagate.com/kojima/blog/

Cluster A

Cluster B

Cluster C

Cluster X

Network 1 formed over time

Network 2 formed over time

join

Merger is accomplished trivially & transparently

The network can thus grow by organic merger of smaller (originally isolated) networks, allowing decentralized bootstrapping of Gnutella like unstructured overlays

Bootstrapping in unstructured networks


PP

2125

• Overlay merger– Needed for decentralized bootstrapping

– Needed for recovery from partitioning• Ignored in P2P literature!

– Lack of experience with real deployments

– Focus on other issues like churn

• Trivial in unstructured and super-peer networks

– Merger of index is a standard DB issue

Merging structured overlays

[Datta07]


PP

2126

• Overlay merger– Correctness of routing

• Maintain routing table

– Correct and complete key binding• Ship data to responsible peer(s)

– Replica synchronization

For locating the desired data/content, both are essential!



PP

2127

• Merging tree topology with structural replication (e.g., P-Grid) has significantly different challenges than merging ring based networks.– Merging P-Grid networks transparently is much

simpler algorithmically.• So we use this case to illustrate the idea, the

challenges, important metrics …



PP

2128

• If they have the same path– Synchronize replicas

• If one has a strict prefix path– Extend path and routing table, synchronize replica

• Stimulate new interactions

When peers from different networks meet


PP

2129



PP

2130

• Keys continue to be accessible to peers– Replica synchronization needed to access keys

from the other network

The merger process should be transparent. At application level, all the keys which are once accessible

continue to be accessible to individual users(unless the application deletes them).



PP

2131

- This transparency may be violated until the replicas are actually synchronized!- Ideally, if we could determine when the sync process is completed throughout the network, but could distinguish peers from originally different networks in the meanwhile, then we could retain transparency.- Instead of detecting global completion, use a heuristic time out (once local sync is completed).



PP

2132

• 3 axis

– Network sizes

– Duplicate content in the original unmerged networks

– Heuristic parameter (time out)

Parameter space


PP

2133

• Recall (over time) Ri/j

– Metric from Information retrieval • Recall is the fraction of the documents that are

relevant to the query that are successfully retrieved.

– Ri/i should always be 1 for the merger to be transparent to applications

• Volume of data transferred

Important performance metrics


PP

2134

Recall


PP

2135

Volume of data transferred


PP

2136

Increases when there is more common data! - The allotment of peers’ key-space change, so … Merger o

f two networks with originally arbitra

rily diffe

rent key-space partitions



PP

2137



PP

2138

Recall (with worst choice of timeout)


PP

2

Concluding remarks

Part IV


PP

2

DHTs

• Characteristic property– Self-manage responsibilities in presence:

• Node joins• Node leaves• Node failures• Load-imbalance• Replicas

• Basic structure of DHTs– Metric space– Embed graph with efficient search algo– Let each node simulate many virtual nodes


PP

2

The future of DHTs

• DHTs automatically handle– Replication, faults, load-balancing, joins, leaves, …

• One-size-fits-all?– Need dynamically auto-tunable DHTs– Difference applications have different needs

• Stronger guarantees– Consistency models– Transactions– Access layers


PP

2142

To P2P or not to P2P

• P2P vs. dedicated infrastructure– Is it technically feasible to realize everything using

P2P?• May be: Its an open issue

– Unlikely, in terms of performance– Harder to guarantee reliability (no one is fully accountable)


PP

2143


• P2P vs. dedicated infrastructure– Shall P2P be preferred whenever technically

possible?• Don’t think so …

– A matter of risk/cost vs. benefit trade-offs


PP

2144


• P2P vs. dedicated infrastructure– How about scalability?

• Again, depends– With enough money, client-server can scale in many cases

» e.g., Google – Network resource consuming applications like content

distribution may scale better using a P2P approach than client-server


PP

2145

To summarize …

• P2P makes sense if:– Budget/resource is limited

• Dedicated infrastructure is unsustainable or makes less economic sense

– Wide interest and relevance • To form a critical mass of users contributing resources

– Trust between participants is reasonably `high’• What’s `high’ depends on the application

– Rate of change is manageable• E.g., membership dynamics is not `too high’

– Criticality is `low’• Since it is harder to guarantee reliability or QoS in P2P• E.g., Skype’s disclaimer states its not for making

emergency calls!


PP

2146

To summarize …

• P2P systems exhibit following characteristics:– Autonomy from central servers– Use of edge resources

• Instead of dedicated infrastructure– Intermittent connectivity– Reliance on self-organizing mechanisms using

limited (locally available) information • No global coordination and control

– Unlike other distributed systems like Grid


PP

2147

References

[Aberer05] Indexing data-oriented overlay networks. Karl Aberer, Anwitaman Datta, Manfred Hauswirth, Roman Schmidt (VLDB 2005)

[Angluin05] Fast construction of overlay networks. D. Angluin, J. Aspnes, J. Chen, Y. Wu and Y. Yin (SPAA 2005) [Byers03] Simple Load Balancing for Distributed Hash Tables.J. Byers, J. Considine, and M. Mitzenmacher (IPTPS 2003)

[Datta07] Merging Intra-Planetary Index Structures: Decentralized Bootstrapping of Overlays. Anwitaman Datta (SASO 2007)

[Ghodsi06] Distributed k-ary System: Algorithms for Distributed Hash TablesAli Ghodsi, Dissertation, KTH—Royal Institute of Technology, Sweden, 2006

[Godfrey05] Heterogeneity and Load Balance in Distributed Hash Tables. P. Brighten Godfrey and Ion Stoica (INFOCOM 2005)

[Montresor05] Chord on Demand. A. Montresor, M. Jelasity and O. Babaoglu (P2P 2005) [Rao03] Load Balancing in Structured P2P Systems.A. Rao, K. Lakshminarayanan, S. Surana, R. Karp, I. Stoica (IPTPS 2003)

[Rieche06] Ref: Ralf Steinmetz, Klaus Wehrle (Eds): Peer-to-Peer Systems and Applications.Reliability and Load-Balancing in DHTs, Simon Rieche, Klaus Wehrle, Heiko Niedermayer, Stefan Götz

[Xu03] On the Fundamental Tradeoffs between Routing Table Size and Network Diameter in Peer-to-Peer Networks.J. Xu, A. Kumar and X. Yu (JSAC 2003)


PP

2148

References

[Aberer04] Efficient, self-contained handling of identity in Peer-to-Peer systems.Karl Aberer, Anwitaman Datta, Manfred Hauswirth IEEE Transactions on Knowledge and Data Engineering (TKDE) 16(7), 2004.

[Kleinberg00] The small-world phenomenon: An algorithmic perspective.J. Kleinberg. Proc. 32nd ACM Symposium on Theory of Computing (STOC) 2000.

[Li05] Bandwidth-efficient Management of DHT Routing Tables. Jinyang Li, Jeremy Stribling, Robert Morris, and M. Frans Kaashoek.Usenix Symposium on Networked Systems Design and Implementation (NSDI) 2005.

[Rhea04] Handling Churn in a DHT. Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz. Proceedings of the USENIX Annual Technical Conference, June 2004.

[Maymounkov02] Kademlia: A Peer-to-peer Information System Based on the XOR MetricPetar Maymounkov and David Mazières (IPTPS 2002)

Documents

Structured overlays: Self-organization and Scalability @ SASO 2009 PP 2 1 Structured Overlays - Self-organization and Scalability by Anwitaman Datta –