View
224
Download
0
Category
Tags:
Preview:
Citation preview
Distributed Hash Tables: ChordBrad Karp
(with many slides contributed by
Robert Morris)UCL Computer Science
CS M038 / GZ0627th January, 2009
2
Today: DHTs, P2P
• Distributed Hash Tables: a building block
• Applications built atop them
• Your task: “Why DHTs?”– vs. centralized servers?– vs. non-DHT P2P systems?
3
What Is a P2P System?
• A distributed system architecture:– No centralized control– Nodes are symmetric in function
• Large number of unreliable nodes• Enabled by technology improvements
Node
Node
Node Node
Node
Internet
4
The Promise of P2P Computing
• High capacity through parallelism:– Many disks– Many network connections– Many CPUs
• Reliability:– Many replicas– Geographic distribution
• Automatic configuration• Useful in public and proprietary settings
5
What Is a DHT?
• Single-node hash table:key = Hash(name)put(key, value)get(key) -> value– Service: O(1) storage
• How do I do this across millions of hosts on the Internet?– Distributed Hash Table
6
What Is a DHT? (and why?)
Distributed Hash Table:key = Hash(data)lookup(key) -> IP address (Chord)send-RPC(IP address, PUT, key, value)send-RPC(IP address, GET, key) -> value
Possibly a first step towards truly large-scale distributed systems– a tuple in a global database engine– a data block in a global file system– rare.mp3 in a P2P file-sharing system
7
DHT Factoring
Distributed hash table
Distributed application
get (key) data
node node node….
put(key, data)
Lookup service
lookup(key) node IP address
• Application may be distributed over many nodes• DHT distributes data storage over many nodes
(DHash)
(Chord)
8
Why the put()/get() interface?
• API supports a wide range of applications– DHT imposes no structure/meaning on keys
• Key/value pairs are persistent and global– Can store keys in other DHT values– And thus build complex data structures
9
Why Might DHT Design Be Hard?
• Decentralized: no central authority
• Scalable: low network traffic overhead
• Efficient: find items quickly (latency)
• Dynamic: nodes fail, new nodes join
• General-purpose: flexible naming
10
The Lookup Problem
Internet
N1
N2 N3
N6N5
N4
Publisher
Put (Key=“title”Value=file data…) Client
Get(key=“title”)
?
• At the heart of all DHTs
11
Motivation: Centralized Lookup (Napster)
Publisher@
Client
Lookup(“title”)
N6
N9 N7
DB
N8
N3
N2N1SetLoc(“title”, N4)
Simple, but O(N) state and a single point of failure
Key=“title”Value=file data…
N4
12
Motivation: Flooded Queries (Gnutella)
N4Publisher@
Client
N6
N9
N7N8
N3
N2N1
Robust, but worst case O(N) messages per lookup
Key=“title”Value=file data…
Lookup(“title”)
13
Motivation: FreeDB, Routed DHT Queries
(Chord, &c.)
N4Publisher
Client
N6
N9
N7N8
N3
N2N1
Lookup(H(audio data))
Key=H(audio data)Value={artist, album
title, track title}
14
DHT Applications
They’re not just for stealing music anymore…– global file systems [OceanStore, CFS, PAST, Pastiche, UsenetDHT]
– naming services [Chord-DNS, Twine, SFR]– DB query processing [PIER, Wisc]– Internet-scale data structures [PHT, Cone, SkipGraphs]
– communication services [i3, MCAN, Bayeux]
– event notification [Scribe, Herald]– File sharing [OverNet]
15
Chord Lookup Algorithm Properties
• Interface: lookup(key) IP address
• Efficient: O(log N) messages per lookup– N is the total number of servers
• Scalable: O(log N) state per node• Robust: survives massive failures• Simple to analyze
16
Chord IDs
• Key identifier = SHA-1(key)• Node identifier = SHA-1(IP address)
• SHA-1 distributes both uniformly
• How to map key IDs to node IDs?
17
Consistent Hashing [Karger 97]
A key is stored at its successor: node with next higher ID
K80
N32
N90
N105 K20
K5
Circular 7-bitID space
Key 5Node 105
19
Simple lookup algorithm
Lookup(my-id, key-id)n = my successorif my-id < n < key-id
call Lookup(key-id) on node n // next hop
elsereturn my successor // done
• Correctness depends only on successors
22
Lookup with Fingers
Lookup(my-id, key-id)look in local finger table for
highest node n s.t. my-id < n < key-idif n exists
call Lookup(key-id) on node n // next hop
elsereturn my successor // done
27
Join (4)
N36
N40
N25
4. Set N25’s successorpointer
Predecessor pointer allows link to new hostUpdate finger pointers in the backgroundCorrect successors produce correct lookups
K30K38
K30
28
Failures Might Cause Incorrect Lookup
N120
N113
N102
N80
N85
N80 doesn’t know correct successor, so incorrect lookup
N10
Lookup(90)
29
Solution: Successor Lists
• Each node knows r immediate successors
• After failure, will know first live successor
• Correct successors guarantee correct lookups
• Guarantee is with some probability
30
Choosing Successor List Length
• Assume 1/2 of nodes fail• P(successor list all dead) =
(1/2)r – i.e., P(this node breaks the Chord ring)
– Depends on independent failure
• P(no broken nodes) = (1 – (1/2)r)N
– r = 2log(N) makes prob. = 1 – 1/N
31
Lookup with Fault Tolerance
Lookup(my-id, key-id)look in local finger table and successor-listfor highest node n s.t. my-id < n < key-idif n existscall Lookup(key-id) on node n // next hopif call failed,remove n from finger tablereturn Lookup(my-id, key-id)else return my successor // done
32
Experimental Overview
• Quick lookup in large systems• Low variation in lookup costs• Robust despite massive failure
Experiments confirm theoretical results
34
Failure Experimental Setup
• Start 1,000 CFS/Chord servers– Successor list has 20 entries
• Wait until they stabilize• Insert 1,000 key/value pairs
– Five replicas of each
• Stop X% of the servers• Immediately perform 1,000 lookups
35
DHash Replicates Blocks at r Successors
N40
N10
N5
N20
N110
N99
N80
N60
N50
Block17
N68
• Replicas are easy to find if successor fails• Hashed node IDs ensure independent failure
36
Massive Failures Have Little Impact
0
0.2
0.4
0.6
0.8
1
1.2
1.4
5 10 15 20 25 30 35 40 45 50
Failed Lookups (Percent)
Failed Nodes (Percent)
(1/2)6 is 1.6%
37
DHash Properties
• Builds key/value storage on Chord
• Replicates blocks for availability– What happens when DHT partitions, then heals? Which (k, v) pairs do I need?
• Caches blocks for load balance• Authenticates block contents
Recommended