Peer-to-peer file systems Presented by: Serge Kreiker

peer-to-peer file systems

Presented by: Serge Kreiker

“P2P” in the Internet

Napster: A peer-to-peer file sharing application allow Internet users to exchange files directly simple idea … hugely successful fastest growing Web application 50 Million+ users in January 2001 shut down in February 2001

similar systems/startups followed in rapid succession Napster,Gnutella, Freenet

Napster

Central Napster server

(xyz.mp3, 128.1.2.3)

128.1.2.3

Napster


xyz.mp3 ?

128.1.2.3

128.1.2.3

Napster


128.1.2.3xyz.mp3 ?

Gnutella

Gnutella

xyz.mp3 ?

Gnutella

Gnutella

xyz.mp3

So Far Centralized : Napster - Table size – O(n)

- Number of hops – O(1) Flooded queries: Gnutella

- Table size – O(1)- Number of hops – O(n)

Storage Management Systems challenges Distributed Nodes have identical capabilities and responsibilities anonymity Storage management : spread storage burden evenly Tolerate unreliable participants Robustness : surviving massive failures Resilience to DoS attacks, censorship, other node

failures.

Cache management :cache additional copies of popular files

Routing challanges Efficiency : O(log(N)) messages

per lookup N is the total number of servers

Scalability : O(log(N)) state per node

Robustness : surviving massive failures

We are going to look at

PAST (Rice and Microsoft Research, routing substrate - Pastry)

CFS (MIT, routing substrate - Chord)

What is PAST ?

Archival storage and content

distribution utility

Not a general purpose file system

Stores multiple replicas of files

Caches additional copies of popular files

in the local file system

How it works

Built over a self-organizing, Internet-based overlay network

Based on Pastry routing scheme Offers persistent storage services for

replicated read-only files Owners can insert/reclaim files Clients just lookup

PAST Nodes

The collection of PAST nodes form an

overlay network

Minimally, a PAST node is an access

point

Optionally, it contributes to storage and

participate in the routing

PAST operations

fileId = Insert(name, owner-credentials,

k, file);

file = Lookup(fileId);

Reclaim(fileId, owner-credentials);

Insertion

fileId computed as the secure hash of name, owner’s public key, salt

Stores the file on the k nodes whose nodeIds are numerically closest to the 128 msb of fileId

How to map Key IDs to Node IDs? Use Pastry

Insert contd

The required storage is debited against the owner’s storage quota

A file certificate is returned Signed with owner’s private key Contains: fileId, hash of content, replication factor + others

The file & certificate are routed via Pastry Each node of the k replica storing nodes attach a

store receipt Ack sent back after all k-nodes have accepted the

file

Insert file with fileId=117, k=4

124115

120

2001.Node 200 insert file 117

source

1222 .122 is one of the 4 closest nodes

to 117, 125 was reached first Because it is the nearest node to 200.dest

3 .122 returns store receipt

Lookup & Reclaim

Lookup: Pastry locates a “near” node that has a copy and retrieves it

Reclaim: weak consistency After it, a lookup is no longer

guaranteed to retrieve the file But, it does not guarantee that the file

is no longer available

Pastry: Peer-to-peer routing

Provide generic, scalable indexing, data location and routing

Inspiration from Plaxton’s algorithm (used in web content distribution eg. Akamai) and Landmark hierarchy routing

Goals Efficiency Scalability Fault Resilience Self-organization (completely decentralized)

Pastry:How it works?

Each node has Unique nodeId. Each Message has a key. Both are uniformly distributed and lie in the

same namespace Pastry node routes the message to the

node with the closest nodeId to the key. Number of routing steps is O(log N). Pastry takes into account network locality. PAST – uses fileID as key, and stores the file

in k closest nodes.

Pastry: Node ID space

Each node is assigned a 128-bit node identifier - nodeId.

nodeId is assigned randomly when joining the system. (e.g. using SHA-1 hash of its IP or nodes public Key)

Nodes with adjacent nodeId’s are diverse in geography, ownership, network attachment, etc.

nodeId and keys are in base 2b. b is configuration param with typical value 4.

Pastry:Node ID space

128 bits (=> max. 2128 nodes)

Node id =

L levelsb = 128/L bits per level

NodeId = sequence of L, base 2b (b-bit) digits

0 1 L–1 … …

b bits

Circular Namespace

112128|0

2128 - 1

Pastry: Node State (1)

Each node maintains: routing table-R, neighborhood set-M, leaf set-L.

Routing table is organized into log2bN

rows with 2b-1 entry each. Each entry n contains the IP address of

a close node which ID matches in the first n digits, differs in digit (n+1)

Choice of b - tradeoff between size of routing table and length of route.

Pastry: Node State (2)

Neighborhood set - nodeId’s , IP addresses of M nearby nodes based on proximity in nodeId space

Leaf set – set of L nodes withclosest nodeId to current node.

L - divided into 2 : L /2 closest larger, L /2 closest smaller.

values for L and M are 2b

Example: NodeId=10233102, b=2, nodeId is 16 bit. All numbers in base 4.

Pastry: Routing RequestsRoute (my-id, key-id, message)if (key-id in range of my leaf-set) forward to the numerically closest node in leaf

set;else

forward to a node node-id in the routing table s. th. node-id shares a longer prefix with key-id than my-id;

else forward to a node node-id that shares the same length prefix with key-id as my-id but is numerically closer

Routing takes O(log N) messages

B=2, l=4,key = 1230

1331

X1: 1030,1123,1211,1301

1211

X2: 1201,1213,1223,12331

1233

L: 1232,1223,1300,1301

2331

X0: 0130,1331,,2331,3001

source

1223

dest

Pastry:Node Addition

X – joining node A – node nearby X (network

proximity) Z – node numerically closest to X2

Routing Table of X leaf-set(X) = leaf-set(Z) neighborhood-set(X) = neighborhood-set(A) routing table X, row i =

routing table Ni, row i, where Ni is the ith node encountered along the route from A to Z

X notifies all-nodes in leaf-set(X); which update their state.

N36

Lookup(216)

Z = 210

240

A = 10

N1

N2

X joins the system , first stage

B

C

Z

X

X joins

AJoin message

Route message Key =X

->

LeafSet

B1

C2

->

A0 M-set

Pastry: Node Failures, Recovery

Rely on a soft-state protocol to deal with node failures Neighboring nodes in the nodeId space periodically exchange keepalive msgs unresponsive nodes for a period T removed from leaf-sets recovering nodes contacts last known leaf set,

updates its own leaf set, notifies members of its presence.

Randomized routing to deal with malicious nodes that can cause repeated query failures

Security

Each PAST node and each user of the

system hold a smartcard

Private/public key pair is associated

with each card

Smartcards generate and verify

certificates and maintain storage quotas

More on Security

Smartcards ensures integrity of nodeId and

fileId assignments

Store receipts prevent malicious nodes to

create fewer than k copies

File certificates allow storage nodes and

clients to verify integrity and authenticity of

stored content, or to enforce the storage quota

Storage Management

Based on local coordination among nodes nearby with nearby nodeIds

Responsibilities: Balance the free storage among nodes Maintain the invariant that replicas for

each file are are stored on k nodes closest to its fileId

Causes for storage imbalance & solutions

The number of files assigned to each node may

vary

The size of the inserted files may vary

The storage capacity of PAST nodes differs

Solutions

Replica diversion

File diversion

Replica diversion

Recall: each node maintains a leaf set l nodes with nodeIds numerically closest to

given node If a node A cannot accommodate a copy

locally, it considers replica diversion A chooses B in its leaf set and asks it to

store the replica Then, enters a pointer to B’s copy in its

table and issues a store receipt

Policies for accepting a replica

If (file size/remaining free storage) > t Reject t is a fixed threshold

T has different values for primary replica ( nodes among k numerically closest ) and diverted replica ( nodes in the same leaf set, but not k closest ) t(primary) > t(diverted)

File diversion

When one of the k nodes declines to store a replica try replica diversion

If the chosen node for diverted replica also declines the entire file is diverted

Negative ack is sent, the client will generate another fileId, and start again

After 3 rejections the user is announced

Maintaining replicas

Pastry uses keep-alive messages and it adjusts the leaf set after failures The same adjustment takes place at

join What happens with the copies stored by a

failed node ? How about the copies stored by a node

that leaves or enters a new leaf set ?

Maintaining replicas contd

To maintain the invariant ( k copies )

the replicas have to be re-created in

the previous cases

Big overhead Proposed solution for join: lazy re-creation

First insert a pointer to the node that holds

them, then migrate them gradually

Caching

The k replicas are maintained in PAST

for availability

The fetch distance is measured in terms

of overlay network hops ( which doesn’t

mean anything for the real case )

Caching is used to improve performance

Caching contd

PAST uses the “unused” portion of their advertised disk space to cache files

When store a new primary or a diverted replica, a node evicts one or more cached copies

How it works: a file that is routed through a node by Pastry ( insert or lookup ) is inserted into the local cache f its size < c c is a fraction of the current cache size

Evaluation

PAST implemented in JAVA Network Emulation using JavaVM 2 workloads (based on NLANR

traces) for file sizes 4 normal distributions of node

storage sizes

Key Results STORAGE

Replica and file diversion improved global storage utilization from 60.8% to 98% compared to without;

insertion failures drop to < 5% from 51%. Caveat: Storage capacities used in experiment,

1000x times below what might be expected in practice.

CACHING Routing Hops with caching lower than without

caching even with 99% storage utilization Caveat: median file sizes very low, likely caching

performance will degrade if this is higher.

CFS:Introduction

Peer-to-peer read only storage system Decentralized architecture focusing mainly on

efficiency of data access robustness load balance scalability

Provides a distributed hash table for block storage Uses Chord to map keys to nodes. Does not provide

anonymity strong protection against malicious participants

Focus is on providing an efficient and robust lookup and storage layer with simple algorithms.

CFS Software Structure

FS

DHASH

CHORD

DHASH

CHORD

DHASH

CHORD

CFS Client CFS ServerCFS Server

RPC API

Local API

CFS: Layer functionalities

The client file system uses the DHash layer to retrieve blocks

The Server Dhash and the client DHash layer uses the client Chord layer to locate the servers that hold desired blocks

The server DHash layer is responsible for storing keyed blocks, maintaining proper levels of replication as servers come and go, and caching popular blocks

Chord layers interact in order to integrate looking up a block identifier with checking for cached copies of the block

Client identifies the root block using a public key generated by the publisher. Uses the public key as the root block identifier to fetch the root block and checks for the validity of the block using the signature File inode key is obtained by usual search through directory blocks . These contain the keys of the file inode blocks which are used to fetch the inode blocks. The inode block contains the block numbers and their corr. keys which are used to fetch the data blocks.

CFS: Properties decentralized control – no administrative relationship

between servers and publishers. scalability – lookup uses space and messages at most

logarithmic in the number of servers. availability – client can retrieve data as long as at least

one replica is reachable using the underlying network. load balance – for large files, it is done through

spreading blocks over a number of servers. For small files, blocks are cached at servers involved in the lookup.

persistence – once data is inserted, it is available for the agreed upon interval.

quotas – are implemented by limiting the amount of data inserted by any particular IP address

efficiency - delay of file fetches is comparable with FTP due to efficient lookup, pre-fetching, caching and server selection.

Chord Consistent hashing

maps node IP address + Virtual host number into a m-bit node identifier.

maps block keys into the same m bit identifier space. Node responsible for a key is the successor of the key’s id

with wrap-around in the m bit identifier space. Consistent hashing balances the keys so that all nodes

share equal load with high probability. Minimal movement of keys as nodes enter and leave the network.

For scalability, Chord uses a distributed version of consistent hashing in which nodes maintain only O(log N) state and use O(log N) messages for lookup with a high probability.

Chord details

two data structures used for performing lookups Successor list : This maintains the next r successors of the node.

The successor list can be used to traverse the nodes and find the node which is responsible for the data in O(N) time.

Finger table : ith entry in the finger table contains the identity of the first node that succeeds n by at least 2i –1 on the ID circle.

lookup pseudo code find id’s predecessor, its successor is the node responsible for the

key to find the predecessor, check if the key lies between the node-id

and its successor. Else, using the finger table and successor list, find the node which is the closest predecessor of id and repeat this step.

since finger table entries point to nodes at power-of-two intervals around the ID ring, each iteration of above step reduces the distance between the predecessor and the current node by half.

Finger i points to successor of n+2i

N80

½¼

1/8

1/161/321/641/128

112

N120

Chord: Node join/failure

Chord tries to preserve two invariants Each node’s successor is correctly maintained. For every key k, node successor(k) is responsible for k.

To preserve these invariants, when a node joins a network Initialize the predecessors, successors and finger table of node n Update the existing finger tables of other nodes to reflect the

addition of n Notify higher layer software so that state can be transferred.

For concurrent operations and failures, each Chord node runs a stabilization algorithm periodically to update the finger tables and successor lists to reflect addition/failure of nodes.

If lookups fail during the stabilization process, the higher layer can lookup again. Chord provides guarantees that the stabilization algorithm will result in a consistent ring.

Chord: Server selection

added to Chord as part of CFS implementation. Basic idea: reduce lookup latency by

preferentially contacting nodes likely to be nearby in the underlying network

Latencies are measured during finger table creation, so no extra measurements necessary.

This works only well for latencies such that low latencies from a to b and from b to c => that the latency is low between a and c

Measurements suggest this is true. [A case study of server selection, Masters thesis]

CFS: Node Id Authentication

Attacker can destroy chosen data by selecting a node ID which is the successor of the data key and then deny the existence of the data.

To prevent this, when a new node joins the system, existing nodes check

If the hash (node ip + virtual number) is same as the professed node id

send a random nonce to the claimed IP to check for IP spoofing

To succeed, the attacker would have to control a large number of machines so that he can target blocks of the same file (which are randomly distributed over multiple servers)

CFS: Dhash Layer

Provides a distributed hash table for block storage reflects a key CFS design decision – split each file into

blocks and randomly distribute the blocks over many servers.

This provides good load distribution for large files . disadvantage is that lookup cost increases since lookup is

executed for each block. The lookup cost is small though compared to the much higher cost of block fetches.

Also supports pre-fetching of blocks to reduce user perceived latencies.

Supports replication, caching, quotas , updates of blocks.

CFS: Replication

Replicates the blocks on “k” servers to increase availability. Places the replicas at the “k” servers which are the immediate

successors of the node which is responsible for the key Can easily find the servers from the successor list (r >=k) Provides fault tolerance since when the successor fails, the

next server can serve the block. Since in general successor nodes are not likely to be physically

close to each other , since the node id is a hash of the IP + virtual number, this provides robustness against failure of multiple servers located on the same network.

The client can fetch the block from any of the “k” servers. Latency can be used as a deciding factor. This also has the side-effect of spreading the load across multiple servers. This works under the assumption that the proximity in the underlying network is transitive.

CFS: Caching Dhash implements caching to avoid overloading servers

for popular data. Caching is based on the observation that as the lookup

proceeds more and more towards the desired key, the distance traveled across the key space with each hop decreases. This implies that with a high probability, the nodes just before the key are involved in a large number of lookups for the same block. So when the client fetches the block from the successor node, it also caches it at the servers which were involved in the lookup .

Cache replacement policy is LRU. Blocks which are cached on servers at large distances are evicted faster from the cache since not many lookups touch these servers. On the other hand, blocks cached on closer servers remain alive in the cache as long as they are referenced.

CFS: Implementation Implemented in 7000 lines of C++ code including 3000 lines

of Chord User level programs communicate over UDP with RPC

primitives provided by the SFS toolkit. Chord library maintains the successor lists and the finger

tables. For multiple virtual servers on the same physical server, the routing tables are shared for efficiency.

Each Dhash instance is associated with a chord virtual server. Has its own implementation of the chord lookup protocol to increase efficiency.

Client FS implementation exports an ordinary Unix like file system. The client runs on the same machine as the server, uses Unix domain sockets to communicate with the local server and uses the server as a proxy to send queries to non-local CFS servers.

CFS: Experimental results Two sets of tests

To test real-world client-perceived performance , the first test explores performance on a subset of 12 machines of the RON testbed.

1 megabyte file split into 8K size blocks All machines download the file one at a time . Measure the download speed with and without server selection

The second test is a controlled test in which a number of servers are run on the same physical machine and use the local loopback interface for communication. In this test, robustness, scalability, load balancing etc. of CFS are studied.

Future Research

Support keyword search By adopting an existing centralized search engine (like

Napster) use a distributed set of index files stored on CFS

Improve security against malicious participants. Can form a consistent internal ring and can route all lookups

to nodes internal to the ring and then deny the existence of the data

Content hashes help guard against block substitution. Future versions will add periodic “routing table” consistency

check by randomly selected nodes to see try to detect malicious participants.

Lazy replica copying to reduce the overhead for hosts which join the network for a short period of time.

Conclusions

PAST(Pastry) and CFS(Chord)represent peer-to-peer routing and location schemes for storage

The ideas are almost the same in all of them CFS load management is less complex Questions raised at SOSP about them:

Is there any real application for them ? Who will trust these infrastructures to store

his/her files ?

Documents

Peer-to-peer file systems Presented by: Serge Kreiker