1 Structured Overlays - self-organization and scalability Acknowledgement: based on slides by Anwitaman Datta–Nanyang and Ali Ghodsi

1

Structured Overlays- self-organization and scalability

Acknowledgement: based on slides by Anwitaman Datta–Nanyang and Ali Ghodsi

22

Self-organization

• Self-organizing systems common in nature– Physics, biology, ecology, economics, sociology, cybernatics– Microscopic (local) interactions– Limited information, individual decisions

• Distribution of control => decentralization– Symmetry in roles/peer-to-peer

– Emergence of macroscopic (global) properties• Resilience

– Fault tolerance as well as recovery– Adaptivity

33

• Centralized solutions undesirable or unattainable• Exploit resources at the edge

- no dedicated infrastructure/servers- peers act as both clients and servers (servent)

• Autonomous participants- large scale- dynamic system and workload- source of unpredictability

- e.g., correlated failures• No global control or knowledge

- rely on self-organization

A Distributed Systems Perspective (P2P)

4

One solution: structured overlays/

distributed hash tables

5

What’s a Distributed Hash Table?

• An ordinary hash table

• Every node provides a lookup operation–Given a key: return the associated value

• Nodes keep routing pointers–If item not found locally, route to another node

Key ValueAnwitaman

Singapore

Ali Berkeley

Alberto Trento

Kurt Kassel

Ozalp Bologna

Randy Berkeley

, which is distributed

6

Why’s that interesting?

• Characteristic properties– Self-management in presence joins/leaves/failures

• Routing information • Data items

– Scalability• Number of nodes can be huge (to store a huge number of

items)• However: search and maintenance costs scale sub-linearly

(often logarithmically) with the number of nodes.

7

short interlude

applications

8

Global File System

• Similar to DFS (eg NFS, AFS)– But files/metadata stored in directory– E.g. Wuala, WheelFS…

• What is new?– Application logic self-managed

• Add/remove servers on the fly• Automatic faliure handling• Automatic load-balancing

– No manual configuration for these ops

130.237.32.51/home/...

193.10.64.99/usr/…

18.7.22.83/boot/…

128.178.50.12/etc/…

……

ValueKey

node A

node D

node B

node C

9

P2P Web Servers

• Distributed community Web Server– Pages stored in the directory

• What is new?– Application logic self-managed

• Automatically load-balances• Add/remove servers on the fly• Automatically handles failures

• Example:– CoralCDN

130.237.32.51www.s...

193.10.64.99www2

18.7.22.83www3

128.178.50.12cs.edu

……

ValueKey

node A

node D

node B

node C

10

Name-based communication Pattern• Map node names to location

– Can store all kinds of contact information• Mediator peers for NAT hole punching• Profile information

• Used this way by:– Internet Indirection Infrastructure (i3)– Host Identity Payload (HIP)– P2P Session Initiation Protocol (P2PSIP)

130.237.32.51anwita

193.10.64.99ali

18.7.22.83alberto

128.178.50.12ozalp

……

ValueKey

node A

node D

node B

node C

11

towards DHT construction

consistent hashing

12

Hash tables

• Ordinary hash tables– put(key,value)

• Store <key,value> in bucket (hash(key) mod 7)

– get(key)• Fetch <key,v> s.t. <key,v> is in bucket

(hash(key) mod 7)

0 1 2 3 4 5 6

13

DHT by mimicking Hash Tables

• Let each bucket be a server– n servers means n buckets

• Problem– How do we remove or add buckets?– A single bucket change requires re-shuffling a

large fraction of items

14

Consistent Hashing Idea

• Logical name space, called the identifier space, consisting of identifiers {0,1,2,…, N-1}

• Identifier space is a logical ring modulo N

• Every node picks a random identifier

• Example:

– Space N=16 {0,…,15}

– Five nodes a, b, c, d• a picks 6• b picks 5• c picks 0• d picks 5• e picks 2

2

11

6

5

01

3

4

789

10

15

14

13

12

15

Definition of Successor

• The successor of an identifier is the first node met going in clockwise direction

starting at the identifier

• Example– succ(12)=14

– succ(15)=2

– succ(6)=6

2

11

6

5

01

3

4

789

10

15

14

13

12

16

Where to store items?

• Use globally known hash function, H

• Each item <key,value> gets the

identifier H(key)

• Store item at successor of H(key)– Term: node is responsible for item k

• Example– H(“Anwitaman”)=12

– H(“Ali”)=2

– H(“Alberto”)=9

– H(“Ozalp”)=14

2

11

6

5

01

3

4

789

10

15

14

13

12

Key Value

Anwitaman

Singapore

Ali BerkeleyAlberto TrentoKurt KasselOzalp Bologna

17

Consistent Hashing: Summary• + Scalable

– Each node stores avg D/n items (for D total items, n nodes)– Reshuffle on avg D/n items for every join/leave/failure

• - However: global knowledge - everybody knows everybody– Akamai works this way– Amazon Dynamo too

• + Load balancing– w.h.p. O(log n) imbalance– Can eliminate imbalance by

having each server ”simulate”log(n) random buckets

2

11

6

5

01

3

4

789

10

15

14

13

12

18

towards dht construction

giving up on global knowledge

19

Where to point (Chord)?• Each node points to its successor

– The successor of a node p is succ(p+1)– Known as a node’s succ pointer

• Each node points to its predecessor– First node met in anti-clockwise direction starting at n-1 – Known as a node’s pred pointer

• Example– 0’s successor is succ(1)=2– 2’s successor is succ(3)=5– 5’s successor is succ(6)=6– 6’s successor is succ(7)=11– 11’s successor is succ(12)=0

2

11

6

5

01

3

4

789

10

15

14

13

12

20

DHT Lookup

• To lookup a key k

– Calculate H(k)

– Follow succ pointers until item k is found

• Example– Lookup ”Alberto” at node 2

• H(”Alberto”)=9

• Traverse nodes:2, 5, 6, 11 (BINGO)

• Return “Trento” to initiator

2

11

6

5

01

3

4

789

10

15

14

13

12

Key Value

Anwitaman Singapore

Ali Berkeley

Alberto Trento

Kurt Kassel

Ozalp Bologna

21

towards dht construction

handling joins/leaves/failures

22

Dealing with failures• Each node keeps a successor-list

– Pointer to f closest successors• succ(p+1)• succ(succ(p+1)+1)• succ(succ(succ(p+1)+1)+1)• ...

• Rule: If successor fails– Replace with closest alive successor

• Rule: If predecessor fails– Set pred to nil

• Set f=log(n)– With failure probability 0.5, w.h.p. all nodes in list

will not fail: 1/2log(n)=1/n

2

11

6

5

01

3

4

789

10

15

14

13

12

23

Handling Dynamism

• Periodic stabilization used to make pointers eventually correct

– Try pointing succ to closest alive successor

– Try pointing pred to closest alive predecessor

Periodically at node p:

1. set v:=succ.pred2. if v≠nil and v is in

(p,succ]3. set succ:=v4. send a notify(p) to succ

When receiving notify(q) at node p:

1. if pred=nil or q is in (pred,p]

2. set pred:=q

24

Handling joins

• When new node n joins– Find n’s successor with lookup(n)– Set succ to n’s successor– Stabilization fixes the rest






2. set pred:=q

11

1513

25

Handling leaves

• When n leaves– Just dissappear (like failure)

• When pred detected failed– Set pred to nil

• When succ detected failed– Set succ to closest alive in successor list

11

1513






2. set pred:=q

26

Speeding up lookups with fingers

• If only pointer to succ(p+1) is used– Worst case lookup time is n, for n nodes

• Improving lookup time (binary search)– Point to succ(p+1)– Point to succ(p+2)– Point to succ(p+4)– Point to succ(p+8)– …– Point to succ(p+2(log N)-1)

• Distance always halved to

the destination, log hops

2

11

6

5

01

3

4

789

10

15

14

13

12

27

Handling Dynamism of Fingers and SList

• Node p periodically:

– Update fingers• Lookup p+21, p+22, p+23,…,p+2(log N)-1

– Update successor-list• slist := trunc(succ · succ.slist)

28

Chord: Summary

• Lookup hops is logarithmic in n– Fast routing/lookup like in a dictionary

• Routing table size is logarithmic in n– Few nodes to ping

29

Reliable Routing

• Iterative lookup– Generally slower– Reliability easy to achieve

• Initiator in full control

• Recursive lookup– Generally fast (use established links)– Several ways to do reliability

• End-to-end timeouts• Any node timeouts

– Difficult to determine timeout value

– .

30

Replication of items

• Successor-list replication (most systems)– Idea: replicate nodes

• If node p responsible for set of items K• Replicate K on p’s immediate successors

• Symmetric Replication– Idea: replicate identifiers

• Items with key 0,16,32,48 equivalent• Whoever is responsible for 0, also stores 16,32,48• Whoever is responsible for 16, also stores 0,32,48• …

31

towards proximity awareness

plaxton-mesh (PRR)pastry/tapestry

32

Plaxton Mesh [PRR]

• Identifiers represented with radix/base k– Often k=16, hexadecimal radix– Ring size N is a large power of k, e.g. 1640

33

Plaxton Mesh (2)

• Additional routing table on top of ring• Routing table construction by example

– Node 3a7f keeps following routing table

• Kleene star * for wildcards– Flexibility to choose proximate neighbors

• Invariant: row i of any node in row i interchangeable

30* 31* 32* 33* 34* 35* 36* 37* 38* 39* self 3b* 3c* 3d* 3e* 3f*

3a0* 3a1* 3a2* 3a3* 3a4* 3a5* 3a6* self 3a8* 3a9* 3aa* 3ab* 3ac* 3ad* 3ae* 3af*

0* 1* 2* self 4* 5* 6* 7* 8* 9* a* b* c* d* e* f*

3a70* 3a71* 3a72* 3a73* 3a74* 3a75* 3a76* 3a77* 3a78* 3a79* 3a7a* 3a7b* 3a7c* 3a7d* 3a7e* self

34

Plaxton Routing

• To route from 1234 to abcd:1. 1234 uses rt row 1: jump to a*, eg a999

2. a999 uses rt row 2: jump to ab*, eg ab11

3. ab11 uses rt row 3: jump to abc*, eg abc0

4. abc0 uses rt row 4: jump to abcd

• Routing terminates in log(N) hops– In practise log(n),

where N is id size and n is number of nodes

35

Pastry – extension to Plaxton mesh

• Leaf set– Successor-list in both directions– Periodically gossiped to all leafs O(n2) [Bamboo]

• Plaxton-mesh on top of ring– Failures in routing table

• Get replacement from any node on same row

• Routing1) Route directly to responsible node in leaf set,

otherwise2) Route to closer (prefix) node, otherwise3) Route on ring

36

architecture of structured overlays

a formal view of DHTs

37

General Architecture for DHTs

• Metric space S with distance function d– d(x,y)≥0

– d(x,x)=0

– d(x,y)=0 x=y

– d(x,y) + d(y,z) ≤ d(x,z)

– d(x,y)=d(y,x) (not always in reality)

• Eg:– d(x,y) = y – x (mod N) Chord

– d(x,y) = x xor y Kademlia

– d(x,y) = sqrt( (x1-y1)2 + … + (xd-yd)2 ) CAN

38

Graph Embedding

• Embed a virtual graph for routing– Powers of 2 (Chord)– Plaxton mesh (Pastry/Tapestry)– Hypercube – Butterfly (Viceroy)

• A node responsible for many virtual identifiers (keys)– Eg Chord nodes responsible for all virtual ids between node id

and predecessor

3939

XOR routing

40

numerous optimizations

4141

U: known uptime A: time since last contacted Age

joined

UA

UUlifetimeAUlifetimealive

)|Pr()Pr(

• With Pareto session time:

Last contacted now

Timeline

• Delete entry if < thresholdUA

U

Predicting routing entry liveness

4242

CostBandwidth budget (bytes/node/sec)

Per

form

ance

Avg

lo

oku

p l

aten

cy (

mse

c)

Evaluation: performance/cost tradeoff

4343

Comparing with parameterized DHTs

Avg

lo

oku

p l

aten

cy (

mse

c)

Avg bandwidth consumed (bytes/node/sec)

4444

Avg

Lo

oku

p l

aten

cy (

mse

c)

Avg bandwidth consumed (bytes/node/sec)

Convex hull outlines best tradeoffs

4545

Lowest latency for varying churn

Median node session time (hours)

Avg

look

up la

tenc

y (m

sec)

• Accordion has lowest latency at low churn • Accordion’s latency increases slightly at high churn

Fixed budget,Variable churn

4646

Accordion stays within budget

Median node session time (hours)

Avg

ban

dwid

th (

byte

s/no

de/s

ec)

• Other protocols’ bandwidth increases with churn

Fixed budget,Variable churn

47

DHTs

• Characteristic property– Self-manage responsibilities in presence:

• Node joins• Node leaves• Node failures• Load-imbalance• Replicas

• Basic structure of DHTs– Metric space– Embed graph with efficient search algo– Let each node simulate many virtual nodes

Documents

1 Structured Overlays - self-organization and scalability Acknowledgement: based on slides by Anwitaman Datta–Nanyang and Ali Ghodsi