View
220
Download
1
Embed Size (px)
Citation preview
1
Structured Overlays- self-organization and scalability
Acknowledgement: based on slides by Anwitaman Datta–Nanyang and Ali Ghodsi
22
Self-organization
• Self-organizing systems common in nature– Physics, biology, ecology, economics, sociology, cybernatics– Microscopic (local) interactions– Limited information, individual decisions
• Distribution of control => decentralization– Symmetry in roles/peer-to-peer
– Emergence of macroscopic (global) properties• Resilience
– Fault tolerance as well as recovery– Adaptivity
33
• Centralized solutions undesirable or unattainable• Exploit resources at the edge
- no dedicated infrastructure/servers- peers act as both clients and servers (servent)
• Autonomous participants- large scale- dynamic system and workload- source of unpredictability
- e.g., correlated failures• No global control or knowledge
- rely on self-organization
A Distributed Systems Perspective (P2P)
4
One solution: structured overlays/
distributed hash tables
5
What’s a Distributed Hash Table?
• An ordinary hash table
• Every node provides a lookup operation–Given a key: return the associated value
• Nodes keep routing pointers–If item not found locally, route to another node
Key ValueAnwitaman
Singapore
Ali Berkeley
Alberto Trento
Kurt Kassel
Ozalp Bologna
Randy Berkeley
, which is distributed
6
Why’s that interesting?
• Characteristic properties– Self-management in presence joins/leaves/failures
• Routing information • Data items
– Scalability• Number of nodes can be huge (to store a huge number of
items)• However: search and maintenance costs scale sub-linearly
(often logarithmically) with the number of nodes.
7
short interlude
applications
8
Global File System
• Similar to DFS (eg NFS, AFS)– But files/metadata stored in directory– E.g. Wuala, WheelFS…
• What is new?– Application logic self-managed
• Add/remove servers on the fly• Automatic faliure handling• Automatic load-balancing
– No manual configuration for these ops
130.237.32.51/home/...
193.10.64.99/usr/…
18.7.22.83/boot/…
128.178.50.12/etc/…
……
ValueKey
node A
node D
node B
node C
9
P2P Web Servers
• Distributed community Web Server– Pages stored in the directory
• What is new?– Application logic self-managed
• Automatically load-balances• Add/remove servers on the fly• Automatically handles failures
• Example:– CoralCDN
130.237.32.51www.s...
193.10.64.99www2
18.7.22.83www3
128.178.50.12cs.edu
……
ValueKey
node A
node D
node B
node C
10
Name-based communication Pattern• Map node names to location
– Can store all kinds of contact information• Mediator peers for NAT hole punching• Profile information
• Used this way by:– Internet Indirection Infrastructure (i3)– Host Identity Payload (HIP)– P2P Session Initiation Protocol (P2PSIP)
130.237.32.51anwita
193.10.64.99ali
18.7.22.83alberto
128.178.50.12ozalp
……
ValueKey
node A
node D
node B
node C
11
towards DHT construction
consistent hashing
12
Hash tables
• Ordinary hash tables– put(key,value)
• Store <key,value> in bucket (hash(key) mod 7)
– get(key)• Fetch <key,v> s.t. <key,v> is in bucket
(hash(key) mod 7)
0 1 2 3 4 5 6
13
DHT by mimicking Hash Tables
• Let each bucket be a server– n servers means n buckets
• Problem– How do we remove or add buckets?– A single bucket change requires re-shuffling a
large fraction of items
14
Consistent Hashing Idea
• Logical name space, called the identifier space, consisting of identifiers {0,1,2,…, N-1}
• Identifier space is a logical ring modulo N
• Every node picks a random identifier
• Example:
– Space N=16 {0,…,15}
– Five nodes a, b, c, d• a picks 6• b picks 5• c picks 0• d picks 5• e picks 2
2
11
6
5
01
3
4
789
10
15
14
13
12
15
Definition of Successor
• The successor of an identifier is the first node met going in clockwise direction
starting at the identifier
• Example– succ(12)=14
– succ(15)=2
– succ(6)=6
2
11
6
5
01
3
4
789
10
15
14
13
12
16
Where to store items?
• Use globally known hash function, H
• Each item <key,value> gets the
identifier H(key)
• Store item at successor of H(key)– Term: node is responsible for item k
• Example– H(“Anwitaman”)=12
– H(“Ali”)=2
– H(“Alberto”)=9
– H(“Ozalp”)=14
2
11
6
5
01
3
4
789
10
15
14
13
12
Key Value
Anwitaman
Singapore
Ali BerkeleyAlberto TrentoKurt KasselOzalp Bologna
17
Consistent Hashing: Summary• + Scalable
– Each node stores avg D/n items (for D total items, n nodes)– Reshuffle on avg D/n items for every join/leave/failure
• - However: global knowledge - everybody knows everybody– Akamai works this way– Amazon Dynamo too
• + Load balancing– w.h.p. O(log n) imbalance– Can eliminate imbalance by
having each server ”simulate”log(n) random buckets
2
11
6
5
01
3
4
789
10
15
14
13
12
18
towards dht construction
giving up on global knowledge
19
Where to point (Chord)?• Each node points to its successor
– The successor of a node p is succ(p+1)– Known as a node’s succ pointer
• Each node points to its predecessor– First node met in anti-clockwise direction starting at n-1 – Known as a node’s pred pointer
• Example– 0’s successor is succ(1)=2– 2’s successor is succ(3)=5– 5’s successor is succ(6)=6– 6’s successor is succ(7)=11– 11’s successor is succ(12)=0
2
11
6
5
01
3
4
789
10
15
14
13
12
20
DHT Lookup
• To lookup a key k
– Calculate H(k)
– Follow succ pointers until item k is found
• Example– Lookup ”Alberto” at node 2
• H(”Alberto”)=9
• Traverse nodes:2, 5, 6, 11 (BINGO)
• Return “Trento” to initiator
2
11
6
5
01
3
4
789
10
15
14
13
12
Key Value
Anwitaman Singapore
Ali Berkeley
Alberto Trento
Kurt Kassel
Ozalp Bologna
21
towards dht construction
handling joins/leaves/failures
22
Dealing with failures• Each node keeps a successor-list
– Pointer to f closest successors• succ(p+1)• succ(succ(p+1)+1)• succ(succ(succ(p+1)+1)+1)• ...
• Rule: If successor fails– Replace with closest alive successor
• Rule: If predecessor fails– Set pred to nil
• Set f=log(n)– With failure probability 0.5, w.h.p. all nodes in list
will not fail: 1/2log(n)=1/n
2
11
6
5
01
3
4
789
10
15
14
13
12
23
Handling Dynamism
• Periodic stabilization used to make pointers eventually correct
– Try pointing succ to closest alive successor
– Try pointing pred to closest alive predecessor
Periodically at node p:
1. set v:=succ.pred2. if v≠nil and v is in
(p,succ]3. set succ:=v4. send a notify(p) to succ
When receiving notify(q) at node p:
1. if pred=nil or q is in (pred,p]
2. set pred:=q
24
Handling joins
• When new node n joins– Find n’s successor with lookup(n)– Set succ to n’s successor– Stabilization fixes the rest
Periodically at node p:
1. set v:=succ.pred2. if v≠nil and v is in
(p,succ]3. set succ:=v4. send a notify(p) to succ
When receiving notify(q) at node p:
1. if pred=nil or q is in (pred,p]
2. set pred:=q
11
1513
25
Handling leaves
• When n leaves– Just dissappear (like failure)
• When pred detected failed– Set pred to nil
• When succ detected failed– Set succ to closest alive in successor list
11
1513
Periodically at node p:
1. set v:=succ.pred2. if v≠nil and v is in
(p,succ]3. set succ:=v4. send a notify(p) to succ
When receiving notify(q) at node p:
1. if pred=nil or q is in (pred,p]
2. set pred:=q
26
Speeding up lookups with fingers
• If only pointer to succ(p+1) is used– Worst case lookup time is n, for n nodes
• Improving lookup time (binary search)– Point to succ(p+1)– Point to succ(p+2)– Point to succ(p+4)– Point to succ(p+8)– …– Point to succ(p+2(log N)-1)
• Distance always halved to
the destination, log hops
2
11
6
5
01
3
4
789
10
15
14
13
12
27
Handling Dynamism of Fingers and SList
• Node p periodically:
– Update fingers• Lookup p+21, p+22, p+23,…,p+2(log N)-1
– Update successor-list• slist := trunc(succ · succ.slist)
28
Chord: Summary
• Lookup hops is logarithmic in n– Fast routing/lookup like in a dictionary
• Routing table size is logarithmic in n– Few nodes to ping
29
Reliable Routing
• Iterative lookup– Generally slower– Reliability easy to achieve
• Initiator in full control
• Recursive lookup– Generally fast (use established links)– Several ways to do reliability
• End-to-end timeouts• Any node timeouts
– Difficult to determine timeout value
– .
30
Replication of items
• Successor-list replication (most systems)– Idea: replicate nodes
• If node p responsible for set of items K• Replicate K on p’s immediate successors
• Symmetric Replication– Idea: replicate identifiers
• Items with key 0,16,32,48 equivalent• Whoever is responsible for 0, also stores 16,32,48• Whoever is responsible for 16, also stores 0,32,48• …
31
towards proximity awareness
plaxton-mesh (PRR)pastry/tapestry
32
Plaxton Mesh [PRR]
• Identifiers represented with radix/base k– Often k=16, hexadecimal radix– Ring size N is a large power of k, e.g. 1640
33
Plaxton Mesh (2)
• Additional routing table on top of ring• Routing table construction by example
– Node 3a7f keeps following routing table
• Kleene star * for wildcards– Flexibility to choose proximate neighbors
• Invariant: row i of any node in row i interchangeable
30* 31* 32* 33* 34* 35* 36* 37* 38* 39* self 3b* 3c* 3d* 3e* 3f*
3a0* 3a1* 3a2* 3a3* 3a4* 3a5* 3a6* self 3a8* 3a9* 3aa* 3ab* 3ac* 3ad* 3ae* 3af*
0* 1* 2* self 4* 5* 6* 7* 8* 9* a* b* c* d* e* f*
3a70* 3a71* 3a72* 3a73* 3a74* 3a75* 3a76* 3a77* 3a78* 3a79* 3a7a* 3a7b* 3a7c* 3a7d* 3a7e* self
34
Plaxton Routing
• To route from 1234 to abcd:1. 1234 uses rt row 1: jump to a*, eg a999
2. a999 uses rt row 2: jump to ab*, eg ab11
3. ab11 uses rt row 3: jump to abc*, eg abc0
4. abc0 uses rt row 4: jump to abcd
• Routing terminates in log(N) hops– In practise log(n),
where N is id size and n is number of nodes
35
Pastry – extension to Plaxton mesh
• Leaf set– Successor-list in both directions– Periodically gossiped to all leafs O(n2) [Bamboo]
• Plaxton-mesh on top of ring– Failures in routing table
• Get replacement from any node on same row
• Routing1) Route directly to responsible node in leaf set,
otherwise2) Route to closer (prefix) node, otherwise3) Route on ring
36
architecture of structured overlays
a formal view of DHTs
37
General Architecture for DHTs
• Metric space S with distance function d– d(x,y)≥0
– d(x,x)=0
– d(x,y)=0 x=y
– d(x,y) + d(y,z) ≤ d(x,z)
– d(x,y)=d(y,x) (not always in reality)
• Eg:– d(x,y) = y – x (mod N) Chord
– d(x,y) = x xor y Kademlia
– d(x,y) = sqrt( (x1-y1)2 + … + (xd-yd)2 ) CAN
38
Graph Embedding
• Embed a virtual graph for routing– Powers of 2 (Chord)– Plaxton mesh (Pastry/Tapestry)– Hypercube – Butterfly (Viceroy)
• A node responsible for many virtual identifiers (keys)– Eg Chord nodes responsible for all virtual ids between node id
and predecessor
3939
XOR routing
40
numerous optimizations
4141
U: known uptime A: time since last contacted Age
joined
UA
UUlifetimeAUlifetimealive
)|Pr()Pr(
• With Pareto session time:
Last contacted now
Timeline
• Delete entry if < thresholdUA
U
Predicting routing entry liveness
4242
CostBandwidth budget (bytes/node/sec)
Per
form
ance
Avg
lo
oku
p l
aten
cy (
mse
c)
Evaluation: performance/cost tradeoff
4343
Comparing with parameterized DHTs
Avg
lo
oku
p l
aten
cy (
mse
c)
Avg bandwidth consumed (bytes/node/sec)
4444
Avg
Lo
oku
p l
aten
cy (
mse
c)
Avg bandwidth consumed (bytes/node/sec)
Convex hull outlines best tradeoffs
4545
Lowest latency for varying churn
Median node session time (hours)
Avg
look
up la
tenc
y (m
sec)
• Accordion has lowest latency at low churn • Accordion’s latency increases slightly at high churn
Fixed budget,Variable churn
4646
Accordion stays within budget
Median node session time (hours)
Avg
ban
dwid
th (
byte
s/no
de/s
ec)
• Other protocols’ bandwidth increases with churn
Fixed budget,Variable churn
47
DHTs
• Characteristic property– Self-manage responsibilities in presence:
• Node joins• Node leaves• Node failures• Load-imbalance• Replicas
• Basic structure of DHTs– Metric space– Embed graph with efficient search algo– Let each node simulate many virtual nodes