SLPAY: Distributed Systems Made Simple

ᔕᕈᒪᐱᓭ

Distributed SystemsMade Simple

Pascal Felber, Raluca Halalai, Lorenzo Leonini, Etienne Rivière, Valerio Schiavoni, José Valerio

Université de Neuchâtel, Switzerland

www.splay-project.org

ᔕᕈᒪᐱᓭ

Monday, January 23, 12

http://www.splay-project.org


ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel

Motivations

• Developing, testing, and tuning distributed applications is hard

• In Computer Science research, fixing the gap of simplicity between pseudocode description and implementation is hard

• Using worldwide testbeds is hard

2



• Set of machines for testing distributed application/protocol

• Several different testbeds!

3

Networks of idle workstations

Your machine

A cluster@UniNE

Testbeds



What is PlanetLab?

4



What is PlanetLab?• Machines contributed by universities,

companies,etc.

• 1098 nodes at 531 sites (02/09/2011)

• Shared resources, no privileged access

• University-quality Internet links

• High resource contention

• Faults, churn, packet-loss is the norm

• Challenging conditions

4



1

Daily Job With Distributed Systems

code

1 • Write (testbed specific) code• Local tests, in-house cluster, PlanetLab...

5



21


code debug

2 • Debug (in this context, a nightmare)

5



321


code debug deploy

3 • Deploy, with testbed specific scripts

5



4321


code debug deploy get logs

4 • Get logs, with testbed specific scripts

5



54321


code debug deploy get logs plots

5 • Produce plots, hopefully

5



code debug deploy get logs plots

at a Glance

• Supports the development, evaluation, testing, and tuning of distributed applications on any testbed:

• In-house cluster, shared testbeds, emulated environments...

• Provides an easy-to-use pseudocode-like language based on Lua

• Open-source: http://www.splay-project.org

ᔕᕈᒪᐱᓭ

gnuplot...

6


http://splay-project.org

http://splay-project.org


The Big Picture

SplayController

SQL DB

application

splayd

splayd

splayd

splayd

7



The Big Picture

SplayController

SQL DB

applicationsplayd

splayd

splayd

splayd

7



The Big Picture

SplayController

SQL DB

application

splayd

splayd

splayd

splayd

7



SPLAY architecture

8

• Portable daemons (ANSI C) on the testbed(s)

• Testbed agnosticfor the user

• Lightweight(memory & CPU)

• Retrieve resultsby remote logging



SPLAY language• Based on Lua

• High-level scripting language, simple interaction with C, bytecode-based, garbage collection, ...

• Close to pseudo code (focus on the algorithm)

• Favors natural ways of expressing algorithms (e.g., RPCs)

• SPLAY applications can be run locally without the deployment infrastructure (for debugging & testing)

• Comprehensive set oflibraries (extensible)

9

tributed system. It is noteworthy that the churn manage-ment system relieves the need for fault injection systemssuch as Loki [13]. Another typical use of the churn man-agement system is for long-running applications, e.g., aDHT that serves as a substrate for some other distributedapplication under test and needs to stay available for thewhole duration of the experiments. In such a scenario,one can ask the churn manager to maintain a fixed sizepopulation of nodes and to automatically bootstrap newones as faults occur in the testbed.

3.3 Language and Applications

SPLAY applications are written in the Lua language [17],a highly efficient scripting language. This design choicewas dictated by four majors factors. First, the most im-portant reason is the support of sandboxing for remoteprocesses and, as a result, increased security both for thetestbed owner and its users. As mentioned earlier, sand-boxing is a sound basis for execution in non-dedicatedenvironments, where resources need to be constrainedand where the hosting operating system must be shieldedfrom possibly buggy or ill-behaved code. Second, oneof SPLAY’s goals is to support large numbers of pro-cesses within a single host of the testbed. This calls fora low footprint for both the daemons and the associatedlibraries. This excludes languages such as Java that re-quire several megabytes of memory just for their exe-cution environment. Third, SPLAY must ensure that theachieved performance is as good as the host system per-mits, and the features offered to the distributed systemdesigner shall not interfere with the performance of theapplication. Fourth, SPLAY allows deployment of appli-cations on any hardware and on any operating systems.This requires a “write-once, run everywhere” approachthat calls for either an interpreted of bytecode-based lan-guage. Lua’s unique features allow us to meet these goalsof lightweightness, simplicity, performance, security andgenericity.

Lua was designed from the ground up to be an efficientscripting language with very low footprint. Accordingto recent benchmarks [2], Lua is among the fastest inter-preted scripting languages. It is reflective, imperative, andprocedural with extensible semantics. Lua is dynamicallytyped and has automatic memory management with incre-mental garbage collection. The small footprint from Luaresults from its design that provides flexible and extensi-ble meta-features, rather than a complete set of general-purpose facilities. The full interpreter is less than 200 kBand can be easily embedded or use libraries written in dif-ferent languages (especially C/C++). This allows for lowlevel programming if need be. Our experiments (Sec-tion 5) highlight the lightweightness of SPLAY applica-tions using Lua, in terms of memory footprint, load, andscalability.

Lua’s interpreter can directly execute source code,as well as hardware-dependent (but operating system-independent) bytecode. In SPLAY, the favored way ofsubmitting applications is in the form of source code, butbytecode programs are also supported (e.g., for intellec-tual property protection).

Isolation and sandboxing are achieved thanks to Lua’ssupport for first-class functions with lexical scoping andclosures, which allow us to restrict access to I/O and net-working libraries. We modify the behavior of these func-tions to implement the restriction imposed by the admin-istrator or by the user at the time he/she submits the ap-plication for deployment over SPLAY.

Lua also supports cooperative multitasking by themeans of coroutines, which are at the core of SPLAY’sevent-based model (discussed below).

events/threads

crypto*

io (fs)*

sb_fs

misc

sb_stdlib

stdlib*

log rpc

json*llenc

socketeventssb_socket

luasocket*

splay::app

* : main dependencies: third!party and lua libraries

Figure 6: Overview of the main SPLAY libraries.

3.4 The LibrariesSPLAY includes an extensible set of shared libraries (seeFigure 6) tailored for the development of distributed ap-plications and overlays. These libraries are meant to beused outside of the deployment system, when developingthe application. We briefly describe the major compo-nents of these libraries.

Networking. The luasocket library provides basicnetworking facilities. We have wrapped it into a restrictedsocket library, sb_socket, which includes a securitylayer that can be controlled by the local administrator (theperson who has instantiated the local daemon process)and further restricted remotely by the controller. This se-cure layer allows us to limit: (1) the total bandwidth avail-able for SPLAY applications (instantaneous bandwidthcan be limited using shaping tools if need be); (2) themaximum number of sockets used by an application and(3) the addresses that an application can or cannot con-nect to. Restrictions are specified declaratively in config-uration files by the local user that starts the daemon, or atthe controller via the command-line and Web-based APIs.

We have implemented higher-level abstractions forsimplifying communication between remote processes.Our API supports message passing over TCP and UDP,as well as access to remote function and variables usingRPCs. Calling a remote function is almost as simple ascalling a local one (see code in next section). Communi-cation errors are reported using extra return values.2

2Lua allows multiple values to be returned by a function.

6

Lua



Why ?

• Light & Fast

• (Very) close to equivalent code in C

• Concise

• Allow developers to focus on ideas more than implementation details

• Key for researchers and students

• Sandbox thanks to the possibility of easily redefine (even built-in) functions

10



Concise

Pseudo codeas publishedon original

paper

Executablecode using

SPLAYlibraries

11



12





Sandbox: Motivations• Experiments should access only

their own resources

• Required for non-dedicated testbeds

• Idle workstations in universities or companies

• Must protect the regular users from ill-behaved code (e.g., full disk, full memory, etc.)

• Memory-allocation, filesystem, network resources must be restricted

13



Sandboxing with Lua• In Lua, all functions can be re-defined transparently

• Resource control done in re-defined functions before calling the original function

• No code modification required for the user

14

SPLAY Lua libraries(includes all stdlib and system calls)

Sandboxing

SPLAY Application

Operating system



misc log rpc

crypto llenc/benc json

luasocket events io

for _,module in pairs(splay)

Modules sandboxed to prevent access to sensible resources

present novelties introduced due to dist systems

15



splay.events

events

• Distributed protocols can use message-passing paradigm to communicate

• Nodes react to events

• Local, incoming, outgoing messages

• The core of the Splay runtime

• Libraries splay.socket & splay.io provide non-blocking operational mode

• Based on Lua’s coroutines

17



splay.rpc

• Default mechanism to communicate between nodes

• Support for UDP/TCP

• Efficient BitTorrent-like encoding

• Experimental binary encodingrpc

19



package gossip.protocol.node.bamboo;

import gnu.trove.THashMap;

import gossip.protocol.messages.Message;

import gossip.protocol.messages.MessageType;

import gossip.protocol.node.Node;

import gossip.simulator.BigInteger;

import gossip.simulator.Simulator;

import java.util.Comparator;

import java.util.Set;

import org.apache.commons.lang.StringUtils;

/**

* This class represents a generic DHT node on a ring, so mainly it has an ID

* and methods to calculate distances, but it does not have any data structure

* and/or maintenance methods for them.

* */

public class DHTNode extends Node implements Comparable {

// ===========================================================================

// IMMUTABLE CONSTANTS

// ===========================================================================

/**

* Max number of binary digits for a Bamboo node identifier.

*/

public static final int MAX_BITS = 160;

/**

* The 2^160 value.

*/

public static final BigInteger MOD = new BigInteger(

"2")

.pow(160);

/**

* The 2^160-1 value.

*/

public static final BigInteger MAX_ID_VALUE = MOD

.subtract(new BigInteger(

"1"));

/**

* The id of half of the ring.

*/

public static final BigInteger HALF = MOD

.divide(new BigInteger(

"2"));

public static final long ACK_INFORMATIONS_SIZE = 10;

public static final long BAMBOO_NODE_INFO_SIZE = 28;

public static final long BAMBOO_KEY_SIZE = 20;

// ===========================================================================

// FIELDS

// ===========================================================================

/**

* This is for convenience the {@link Node#id} field casted to

* {@link BigInteger}.

*/

protected final BigInteger idAsBigInteger;

/**

* The id of this node as binary String.

*/

protected final String idAsBinaryString;

/**

* True if the node is already part of the ring.

*/

protected boolean isJoined = true;

/**

* Next id of message being sent by this node. To retrieve the next id, ALWAYS

* do getNextMessageId(), otherwise 2 messages sent by this node will have the

* same ID at any given time.

*/

protected long nextMessageId = 0;

/**

* To compare nodes by distance to this node.

*/

protected final Comparator<? super DHTNode> distanceComparator = new DistanceComparator();

// /**

// * Cached distances calculated so far to speed up

// * {@link #distance(BigInteger)} method invocation.

// */

// protected Map<BigInteger, BigInteger> cachedDistances = new

// THashMap<BigInteger, BigInteger>();

// ===========================================================================

// CALLBACKS

// ===========================================================================

/**

* When a response containing a given acked message arrives, the corresponding

* callback value is called.

*/

public final THashMap<Message, ResponseArrivedCallback> sentMessagesCallbacks = new THashMap<Message, ResponseArrivedCallback>();

/**

* This is just to put the callbacks when lookup responses arrive.

*/

public final THashMap<Long, ResponseArrivedCallback> sentLookupRequestsMessagesCallbacks = new THashMap<Long, ResponseArrivedCallback>();

/**

* Singleton callback which does nothing on response received.

*/

protected final DoNothingResponseArrivedCallback doNothingResponseArrivedCallback = new DoNothingResponseArrivedCallback(

this);

/**

* The constructed instance has {@link #isJoined}=<code>true</code>.

*

* @param identifier

* @param simulator

*/

public DHTNode(final BigInteger identifier, final Simulator simulator) {

super(identifier, simulator);

this.idAsBigInteger = (BigInteger) this.id;

this.idAsBinaryString = bigIntegerToBinaryString160(this.idAsBigInteger);

this.isJoined = true;

}

/**

* When this node is activated.

*/

public void onActivation() {

// DO NOTHING HERE

}

/**

* Called when this node is deactivated.

*/

public void onDeactivatation() {

// DO NOTHING HERE

}

@Override

public void run() {

// DO NOTHING.

}

/**

* Convert a BigInteger to a binary String of {@link #MAX_BITS} digits,

* filling the missing digits with 0.

*

* @param k

* @return

*/

protected static final String bigIntegerToBinaryString160(final BigInteger k) {

return normalizeBinaryString(k.toString(2));

}

/**

* Fill the missing digits with 0 to arrive at {@link #MAX_BITS} of length.

*

* @param input

* @return

*/

protected static final String normalizeBinaryString(final String input) {

String result = input;

if (input.length() < MAX_BITS) {

result = StringUtils.leftPad(input, MAX_BITS, '0');

}

return result;

}

/**

* @return the idAsBigInteger

*/

public BigInteger getIdAsBigInteger() {

return idAsBigInteger;

}

/**

* @return the idAsBinaryString

*/

public String getIdAsBinaryString() {

return idAsBinaryString;

}

/**

* @return the isJoined

*/

public boolean isJoined() {

return isJoined;

}

/**

* @param isJoined the isJoined to set

*/

public void setJoined(final boolean isJoined) {

this.isJoined = isJoined;

}

/**

* Compute the distance between two arbitrary nodes, give as node ids.

*

* @param id1

* @param id2

* @return the distance between two arbitrary nodes, give as node ids.

*/

public static final BigInteger distanceBetween(final BigInteger id1,

final BigInteger id2) {

final BigInteger diffId1Id2 = id1.subtract(id2).mod(MOD);

final BigInteger diffId2Id1 = id2.subtract(id1).mod(MOD);

if (diffId1Id2.compareTo(diffId2Id1) < 0) {

return diffId1Id2;

} else {

return diffId2Id1;

}

}

/**

* Compute the distance between the current node and the specified node id.

*

* @param from

* @return

*/

public BigInteger distance(final BigInteger from) {

// final BigInteger cachedValue = cachedDistances.get(from);

// if (cachedValue != null) return cachedValue;

final BigInteger distance = distanceBetween(this.idAsBigInteger, from);

// cachedDistances.put(from, distance);

return distance;

}

/**

* Compute the distance between the current node and the specified node.

*

* @param node

* @return

*/

public BigInteger distance(final DHTNode node) {

return this.distance(node.idAsBigInteger);

}

/**

* To order Bamboo nodes with respect to (<em>shortest</em>) distance to the

* <code>this</code> node.

*/

protected final class DistanceComparator implements Comparator<DHTNode> {

public int compare(final DHTNode o1, final DHTNode o2) {

final BigInteger first = distance(o1);

final BigInteger second = distance(o2);

return first.compareTo(second);

}

}

public final int compareTo(final DHTNode obj) {

return this.idAsBigInteger.compareTo(obj.idAsBigInteger);

}

/**

* @return the nextMessageId to be used in outgoing messages.

*/

public final long getNextMessageId() {

return ++nextMessageId;

}

// ===========================================================================

// send methods

// ===========================================================================

/**

* Try to send the given message for a total number of times: if an ACK or a

* response for it arrives in time, the response arrived callback is invoked,

* otherwise the response timeout callback.

*/

public final void sendExpectingResponse(final Message msg,

final ResponseArrivedCallback responseArrivedCallback,

final ResponseTimeoutCallback responseTimeoutCallback) {

/* register the success callback */

this.sentMessagesCallbacks.put(msg, responseArrivedCallback);

this.simulator.send(msg);

}

/**

* @see gossip.protocol.node.Node#onMessageSent(gossip.protocol.messages.Message)

*/

@Override

public void onMessageSent(final Message m) {

super.onMessageSent(m);

}

public void declareNeighbourInactive(final DHTNode node) {

// DO NOTHING HERE:

}

public void addToLeafSet(final DHTNode node) {

// DO NOTHING

}

/**

* Empty UDP message is 64 bytes, to which we must add 1 bytes for the header,

* 1 bytes for the messageId, 28 bytes (IP,port,Bamboo ID) for message src, 28

* bytes for message dest + bytes for the specific data carried in the

* message. NOTE: the size of informations to ack a message is 10 bytes,

*/

@Override

public long calculateMessageBytes(final Message msg) {

long result = 122; // 64 + 1 + 1 + 28 + 28;

if (msg.header == MessageType.PING) {

/* nothing to add */

return result;

}

if (msg.header == MessageType.PONG) {

/* some bytes for the informations about which message is being acked */

result += ACK_INFORMATIONS_SIZE;

return result;

}

if (msg.header == MessageType.BAMBOO_JOIN_LOOKUP_REQUEST

|| msg.header == MessageType.BAMBOO_LOOKUP_REQUEST

|| msg.header == MessageType.BAMBOO_JOIN_LOOKUP_RESPONSE

|| msg.header == MessageType.BAMBOO_LOOKUP_RESPONSE) {

final LookupInfo lookupInfo = (LookupInfo) msg.content;

result += BAMBOO_KEY_SIZE; // the key

result += BAMBOO_NODE_INFO_SIZE * lookupInfo.path.size();

result += BAMBOO_NODE_INFO_SIZE; // the requester (IP,port,ID)

return result;

}

if (msg.header == MessageType.BAMBOO_LEAF_SET_REQUEST) {

/* nothing to add */

return result;

}

if (msg.header == MessageType.BAMBOO_LEAF_SET_RESPONSE) {

// this below actually would not be needed

// result += ACK_INFORMATIONS_SIZE;

final BambooNode source = (BambooNode) msg.src;

result += BAMBOO_NODE_INFO_SIZE * source.leftLeafSet.size()

+ BAMBOO_NODE_INFO_SIZE * source.rightLeafSet.size();

return result;

}

if (msg.header == MessageType.BAMBOO_LOCAL_TUNING_REQUEST) {

/* which row is being asked */

result += 1;

return result;

}

if (msg.header == MessageType.BAMBOO_LOCAL_TUNING_RESPONSE) {

/* the result contains at most one entry */

result += BAMBOO_NODE_INFO_SIZE;

return result;

}

/* unexpected */

throw new IllegalStateException("Cannot calculate size for message: "

+ msg.header);

}

protected void removePendingHandlerForOurLookupRequestWithId(

final long messageId) {

final Set<Message> messagesToBeAcked = this.sentMessagesCallbacks.keySet();

for (final Message m : messagesToBeAcked) {

if (m.messageId == messageId

&& (m.header == MessageType.BAMBOO_JOIN_LOOKUP_REQUEST || m.header == MessageType.BAMBOO_LOOKUP_REQUEST)) {

final LookupInfo info = (LookupInfo) m.content;

if (info.requester == this) {

this.sentMessagesCallbacks.remove(m);

break;

}

}

}

}

/**

* Dispatch to response handler in the map of expected handlers, or simply

* send a PONG if there is no one (because it means it could be a duplicate

* response or a response arrived out of time).

*/

protected void dispatchResponseToAppropriateHandler(final Message msg) {

/* IF IT IS A LOOKUP or JOIN_LOOKUP RESPONSE */

if (msg.header == MessageType.BAMBOO_JOIN_LOOKUP_RESPONSE

|| msg.header == MessageType.BAMBOO_LOOKUP_RESPONSE) {

final ResponseArrivedCallback responseCallback = this.sentLookupRequestsMessagesCallbacks

.remove(msg.messageId);

if (responseCallback != null) {

this.removePendingHandlerForOurLookupRequestWithId(msg.messageId);

responseCallback.onResponseReceived(msg);

}/*

* CAN BE COMMENTED AS SEND RETRIAL OF MESSAGES IS OFF else { it is a

* duplicate lookup response, send a PONG to (hopefully) stop source to

* send us again that message final Message pong = new

* Message(MessageType.PONG, msg.messageId, this, msg.src); pong.ackedMsg

* = msg; simulator.send(pong); }

*/

}

/* ELSE IF IS SOMETHING NOT A LOOKUP RESPONSE */

else {

final Message beingAcked = msg.ackedMsg;

final ResponseArrivedCallback handler = this.sentMessagesCallbacks

.remove(beingAcked);

/* if there is a mapping */

if (handler != null) {

handler.onResponseReceived(msg);

}/*

* else { this may happen in case of a response arrive later than

* expected... log .info("%%% " + this + " received the response => " +

* msg +

* " but no handler was expecting it! So, simply send back a PONG to make it stop sending again..\n"

* ); send a PONG to (hopefully) stop source to send us again final

* Message pong = new Message(MessageType.PONG, msg.messageId, this,

* msg.src); pong.ackedMsg = msg; simulator.send(pong); }

*/

}

// /* it replied at last, so "reset" the value */

// this.nodeConsecutiveTimeouts.put((DHTNode) msg.src, 0);

}

/**

* @param msg

*/

@Override

public void receive(final Message msg) {

log.fine(this + " received => " + msg);

/* for response types we use the dispatcher function */

switch (msg.header) {

/* it's the "ack" received after a completed STORE */

case PONG :

this.dispatchResponseToAppropriateHandler(msg);

break;

case BAMBOO_JOIN_LOOKUP_REQUEST :


break;

case BAMBOO_JOIN_LOOKUP_RESPONSE :


break;

case BAMBOO_LOCAL_TUNING_RESPONSE :


break;

default :

throw new IllegalStateException(

"This class is not supposed to receive messages of header type "

+ msg.header);

}

}

}

Life Before • Time spent on developing testbed

specific protocols

• Or fallback to simulations...

• The focus should be on ideas

• Researchers usually have no time to produce industrial-quality code

ᔕᕈᒪᐱᓭ

there is more :-(

20



Life With

Chord robust enough for running on error-prone plat-forms such as PlanetLab. The first step is to take intoaccount the absence of a reply to an RPC. Consider thecall to predecessor in method stabilize(). Onesimply needs to replace this call by the code of Figure 4.

1 function stabilize() �� rpc.a call() returns both status and results2 local ok, r = rpc.a call(finger[1], ’predecessor’, 60) �� RPC, 1m timeout3 if not ok then4 suspect(finger[1]) �� will prune the node out of local routing tables5 else6 (...)

Listing 4: Fault-tolerant RPC call

We omit the code of function suspect() for brevity.Depending on the reliability of the links, this functionprunes the suspected node after a configurable numberof missed replies. One can tune the RPC timeout ac-cording to the target platform (here, 1 minute instead ofthe standard 2 minutes), or use an adaptive strategy (e.g.,exponentially increasing timeouts). Finally, as suggestedby [27] and similarly to the leafset structure used in Pas-try [25], we replace the single successor and predecessorby a list of 4 peers in each direction on the ring.

Our Chord implementation without fault-tolerance isonly 59 lines long, which represents an increase of 18%4

over the pseudo-code from the original paper. Our fault-tolerant version is only 101 lines long, i.e., 73% morethan the base implementation (29% for fault tolerance,and 44% for the leafset-like structure). We detail the pro-cedure for deployment and the results obtained with bothversions on a ModelNet cluster and on PlanetLab, respec-tively, in Section 5.2.

5 EvaluationThis section presents a thorough evaluation of SPLAY per-formance and capabilities. Evaluating such an infrastruc-ture is a challenging task as the way users will use it playsan important role. Therefore, our goal in this evaluationis twofold: (1) to present the implementation, deploy-ment and observation of real distributed systems by us-ing SPLAY’s capability to easily reproduce experimentsthat are commonly used in evaluations and (2) to studythe performance of SPLAY itself, both by comparing it toother widely-used implementations and by evaluating itscosts and scalability. The overall objective is to demon-strate the usefulness and benefits of SPLAY rather thanevaluate the distributed applications themselves.

We first demonstrate in Section 5.1 SPLAY’s capabili-ties to easily express complex system in a concise man-ner. We present in Section 5.2 the deployment and per-formance evaluation of the Chord DHT proposed in Sec-tion 4, using a ModelNet [29] cluster and PlanetLab [7].We then compare in Section 5.3 the performance and scal-ability of the Pastry [25] DHT written with SPLAY against

4Note that the pseudocode in [27] does not include initializationcode, while our code does.

a legacy Java implementation, FreePastry [3]. Sec-tions 5.4 and 5.5 evaluate SPLAY’s ability to easily (1) de-ploy applications in complex network settings (mixedPlanetLab and ModelNet deployment) and (2) reproducearbitrary churn conditions. Section 5.6 focuses on SPLAYperformance for deploying and undeploying applicationson a testbed. We conclude in Section 5.7 with an evalu-ation of SPLAY’s performance with bandwidth-intensiveapplications (BitTorrent and tree-based content dissemi-nation).Experimental setup. Unless specified otherwise, our ex-perimentations were performed either on PlanetLab, us-ing a set of 400 to 450 hosts,5 or on our local cluster.The cluster is composed of 11 nodes, each equipped witha 2.13 Ghz Core 2 Duo processor and 2 GB of mem-ory, linked by a 1 Gbps switched network. All nodesrun GNU/Linux 2.6.9. A separate node running FreeBSD4.11 is used as a ModelNet router, when required by theexperiment. Our ModelNet configuration emulates 1,100hosts connected to a 500-node transit-stub topology. Thebandwidth is set to 10Mbps for all links. RTT betweennodes of the same domain is 10 ms, stub-stub and stub-transit RTT is 30 ms, and transit-transit (i.e., long rangelinks) RTT is 100 ms. These settings result in delays thatare approximately twice higher than those experienced inPlanetLab.5.1 Development complexityWe developed the following applications using SPLAY:Chord [27] and Pastry [25], two DHTs; Scribe [12],a publish-subscribe system; SplitStream [11]6, abandwidth-intensive multicast protocol; BitTorrent [14],a content distribution infrastructure; and Cyclon [30],a gossip-based membership management protocol. Wehave also implemented a number of classical algorithms,such as epidemic diffusion on Erdos-Renyi randomgraphs [16] and various types of distribution trees [10](n-ary trees, parallel trees). As one can note from the fol-lowing figure, all implementations are extremely concisein terms of lines of code (LOC):7

ChordPastryScribe

SplitStreamBitTorrent

CyclonEpidemic

Trees

59 (base) + 17 (FT) + 26 (leafset) = 101 265

79 58 420

93 35

47

Pastry

Pastry Scribe

(base)

Although the number of lines is clearly just a roughindicator of the expressiveness of a system, it is still a

5Splay daemons have been continuously running on these hosts formore than one year.

6Scribe is based on Pastry, and SplitStream is based on both Pastryand Scribe.

7Note that we did not try to compact the code in a way that would im-pair readability. All lines have been counted, including those that onlycontain block delimiters. Number and darker bars represent LOC forthe protocol, while lighter bars represent protocols acting as a substrate.

9

• Lines of pseudocode ~== Lines of executable code

ᔕᕈᒪᐱᓭ21



Churn management• Churn is key for testing large-scale application

• “Natural churn” on PlanetLab not reproducible

• Hard to compare algorithms under same conditions

• SPLAY allows to finely control churn

• Traces (e.g., from file sharing systems)

• Synthetic descriptions: dedicated language

22

applications that must be terminated receive a FREE mes-sage. The state machine of a SPLAY job is depicted inFigure 3.

idle running

START

STOP

REGISTER

FREE

LIST

FREE

selected

Figure 3: State machine of a SPLAY job on a host.

The reason why we initially select a larger set ofnodes than requested clearly appears when consideringthe availability of hosts on testbeds like PlanetLab, wheretransient failures and overloads are the norm rather thanthe exception. Figure 4 shows the distribution of round-trip times (RTT) for a 20-Kbytes message over an alreadyestablished TCP connection from the controller to Plan-etLab hosts. Figure 4 shows both the cumulative distribu-tion of delays and their discretized distribution. One canobserve that only 17.10% of the nodes reply within 250milliseconds, and over 45% need more than 1 second. Se-lecting a larger set of candidates allows us to choose themost responsive nodes for deploying the application.

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7 8 9 10 0

20

40

60

80

100

Ratio (

%)

Ratio C

DF

(%

)

Delay (seconds)

Figure 4: RTT between the controller and PlanetLab hosts overpre-established TCP connections, with a 20 KB payload.

Daemons. SPLAY daemons are installed on participat-ing hosts by a local user or administrator. Ideally, theyshould be automatically started at boot time (e.g., by aninit.d script on Unix). The local administrator can config-ure the daemon via a configuration file, specifying vari-ous instance parameters (e.g., daemon name, access key,etc.) and restrictions on the resources available for SPLAYapplications. These restrictions encompass memory, net-work, and disk usage. If an application exceeds theselimitations, it is killed (memory usage) or I/O operationsfail (disk or network usage). The controller can spec-ify stricter—but not weaker—restrictions at deploymenttime.

Upon startup, a SPLAY daemon receives a blacklist offorbidden addresses expressed as IP or DNS masks. Bydefault, the addresses of the controllers are blacklisted sothat applications cannot actively connect to them. Black-lists can be updated at runtime.

The daemon also receives the address of a log processto connect to for logging, together with a unique identifi-cation key. SPLAY applications instantiated by the localdaemon can only connect to that log process; other pro-cesses will reject any connection request.

3.2 Churn Management

In order to fully understand the behavior and robust-ness of a distributed protocol, it is necessary to evalu-ate it under different churn conditions. Theses condi-tions can range from rare but unpredictable hardware fail-ures, to frequent application-level disconnections, as usu-ally found in user-driven peer-to-peer systems, or even tomassive failures scenarios. It is also important to allowcomparison of competing algorithm under the very samechurn scenarios. Relying on the natural, non-reproduciblechurn of testbeds such as PlanetLab often proves to be in-sufficient.

There exist several characterizations of churn that canbe leveraged to reproduce realistic conditions for the pro-tocol under test. First, synthetic description issued fromanalytical studies [23] can be used to generate churn sce-narios and replay them in the system. Second, severaltraces of the dynamics of real networks have been madepublicly available by the community (e.g., see the reposi-tory at [1]); they cover a wide range of applications suchas highly churned file-sharing system [8] or high perfor-mance computing clusters [26].

1 at 30s join 102 from 5m to 10m inc 103 from 10m to 15m const

churn 50%4 at 15m leave 50%5 from 15m to 20m inc 10

churn 150%6 at 20m stop 0

10 20 30 40

0 5 10 15 20

0 10 20 30

Le

ave

s/jo

ins p

er

min

.

No

de

s

Time (minutes)

1 < 2 > < 3 > 4 < 5 > 6

Nodes leavingNodes joining

Figure 5: Example of a synthetic churn description.

SPLAY incorporates a component, churn (see Fig-ure 2), dedicated to churn management. This componentcan send instructions to the daemons for stopping andstarting processes on-the-fly. Churn can be specified asa trace, in a format similar to that used by [1], or as a syn-thetic description written in a simple script language. Thetrace indicates explicitly when each node enters of leavesthe system while the script allows users to express phasesof the application’s lifetime, such as a steady increase ordecrease in the number of peers over a given time dura-tion, periods with continuous churn, massive failures, joinflashcrowds, etc. An example script is shown in Figure 5together with a representation of the evolution of the nodepopulation and the number of arrivals and departures dur-ing each one-minute period: an initial set of nodes joinsafter 30 seconds, then the system stabilizes before a reg-ular increase, a period with a constant population but achurn that sees half of the nodes leave and an equal num-ber join, a massive failure of half of the nodes, anotherincrease under high churn, and finally the departure of allthe nodes.

Section 5.5 presents typical uses of the churn manage-ment mechanism in the evaluation of a large scale dis-

5

Binned number of joinsand leaves per minute



Synthetic churn

23

0

20

40

60

80

100

0 1 2 3 4 5 6

Route

s C

DF

(%

)

Delay (seconds)

ModelNetPlanetLab

Mixed deployment

Figure 10: Pastry on PlanetLab, ModelNet, and both.

a mixed deployment over both testbeds at the same time(i.e., 500 nodes on each). We notice that the delays of themixed deployment are distributed between the delays ofPlanetLab and the higher delays of our ModelNet cluster.The “steps” on the ModelNet cumulative delays represen-tation are a result of routes of increasing number of hops(both in Pastry and in the emulated topology), and thefixed delays for ModelNet links.

5.5 Using Churn ManagementThis section evaluates the use of the churn managementmodule, both using traces and synthetic descriptions. Us-ing churn is as simple as launching a regular SPLAY appli-cation with a trace file as extra argument. SPLAY providesa set of tools to generate and process trace files. One can,for instance, speed-up a trace, increase the churn ampli-tude whilst keeping its statistical properties, or generate atrace from a synthetic description.

0

5

10

15

20

0 1 2 3 4 5 6 7 8 9 10

0

20

40

60

Dela

y (

ms)

Route

failu

res (

%)

Time (minutes)

50% of the network fail

90th

perc.75

th perc.

50th

perc.25

th perc.

5th

perc.Failure rate

Figure 11: Using churn management to reproduce massivechurn conditions for the SPLAY Pastry implementation.

Figure 11 presents a typical experiment of a massivefailure using the synthetic description. We ran Pastryon our local cluster with 1,500 nodes and, after 5 min-utes, trigger a sudden failure of half of the network (750nodes). This models, for example, the disconnection of ainter-continental link or a WAN link between two corpo-rate LANs. We observe that the number of failed lookupsreaches almost 50% after the massive failure due to rout-ing table entries referring to unreachable nodes. Pastryrecovers all its routing capabilities in about 5 minutesand we can observe that delays actually decrease afterthe failure because the population has shrunk (delays areshown for successful routes only). While this scenario is

amongst the simplest ones, churn descriptions allow usersto experiment with much more complex scenarios, as dis-cussed in Section 3.2.

Our second experiment is representative of a complextest scenario that would usually involve much engineer-ing, testing and post-processing. We use the churn traceobserved in the Overnet file sharing peer-to-peer sys-tem [8]. We want to observe the behavior of Pastry, de-ployed on PlanetLab, when facing churn rates that areway beyond the natural churn rates suffered in Planet-Lab. As we want increasing levels of Churn, we simply“speed-up” the trace, that is, with a speed-up factor of 2, 5and 10 a minute in the original trace represents 30, 12 and6 seconds respectively. Figure 12 presents both the churndescription and the evolution of delays and failure rates,for increasing levels of churn. Churn description presentsthe population of nodes and the number of joins/leaves asa function of time, and performance observations plot theevolution of the delay distribution as a function of time.We observe that (1) Pastry handles churn pretty well aswe do not observe a significant failure rate when as muchas 14% of the nodes are changing state within a singleminute; (2) running this experiment is neither more com-plex nor longer than on a single cluster without churn, aswe did for Figure 8(a). Based on our own experience, weestimate that it takes at least one order of magnitude lesshuman efforts to conduct this experiment using SPLAYthan with any other deployment tools. We strongly be-lieve that the availability of tools such as SPLAY will en-courage the community to further test and deploy theirprotocols under adverse conditions, and to compare con-current systems under published churn models.

0

2

4

6

8

10

0 50 100 150 200 250 300 350 400

Tim

e (

seconds)

Nodes requested (out of 450 daemons)

110%130%

150%170%

200%

Figure 13: Deployment times of Pastry for SPLAY, as a func-tion of (1) the number of nodes requested and (2) the size of thesuperset of daemons used.

5.6 Deployment PerformanceThis section presents an evaluation of the deploymenttime of an application on an adversarial testbed, Planet-Lab. This further conveys our position from Section 3.1that one needs to initially select a larger set of nodes thanrequested to ensure that one can rely on reasonably re-sponsive nodes for deploying the application. Tradition-ally, such a selection process is done by hand, or usingsimple heuristics based on the load or response time of thenodes. SPLAY relieves the need for the user to proceedwith this selection. Figure 13 presents the deploymenttime for the Pastry application on PlanetLab. We vary

12

Stabilized

Massive failure: half of the nodes fail

Self-repair

Routing performance in a Pastry DHT



Trace-driven churn• Traces from the OverNet file sharing network [IPTPS03]

• Speed up of 2, 5, 10 (1 minute = 30, 12, 6 seconds)

• Evolution of hit-ratio & delays of Pastry on PlanetLab

24

0

50

100

150

200

0 5 10 15 20 25 30 35 40 45 50 500

550

600

650

700

Leave

s and jo

ins

per

min

ute

Nodes


0

50

100

150

200

0 5 10 15 20 25 30 35 40 45 50 500

550

600

650

700

Leave

s and jo

ins

per

min

ute

Nodes


0

50

100

150

200

0 5 10 15 20 25 30 35 40 45 50 500

550

600

650

700

Leave

s and jo

ins

per

min

ute

Nodes


0 0.5

1 1.5

2

024681012

Dela

y (s

eco

nds)

Route

failu

res

(%)

Churn x2

0 0.5

1 1.5

2

024681012

Dela

y (s

eco

nds)

Route

failu

res

(%)

Churn x5

0 0.5

1 1.5

2

024681012

Dela

y (s

eco

nds)

Route

failu

res

(%)

Churn x10

Figure 11: Study of the effect of churn on Pastry deployed on PlanetLab. Churn is derived from the trace of the Overnet file sharingsystem and sped up for increasing volatility.

0

2

4

6

8

10

0 50 100 150 200 250 300 350 400

Tim

e (

se

co

nd

s)

Nodes requested (out of 450 daemons)

110%130%

150%170%

200%

Figure 12: Deployment times of Pastry for SPLAY, as a func-tion of (1) the number of nodes requested and (2) the size of thesuperset of daemons used.

evaluation of a cooperative data distribution algorithmbased on parallel trees using both SPLAY and a native (C)implementation on ModelNet and (2) a distributed coop-erative Web cache for HTTP accesses, which has beenrunning for several weeks under a constant and signifi-cant load.Dissemination using trees. This experiment comparestwo versions of a simple cooperative protocol [13] basedon parallel n-ary trees written with SPLAY and in C. Wecreate n = 2 distinct trees in the same manner as Split-Stream [14] does: each of the 63 nodes is an inner mem-ber in one tree and a leaf in the other. The data to be trans-mitted is split into blocks, which are propagated alongone of the 2 trees according to a round-robin policy. Thisexperiment allows us to observe how SPLAY comparesagainst a native application, CRCP, written in C [6]. Us-ing a tree for this comparison bears the advantage of high-lighting the additional delays and overheads of the plat-form and its network libraries (such as the sandboxingof network operations). These overheads accumulate ateach level of the tree, from the root to the leaves.

Tests were run in a ModelNet testbed configured witha symmetric bandwidth of 1 Mbps for each node. Resultsare shown in Figure 13 for binary trees, a 24 MB file, anddifferent block sizes (16 KB, 128 KB, 512 KB). We ob-serve that both implementations produce similar results,which tends to demonstrate that the overhead of SPLAY’slanguage and libraries is negligible. Differences in shapeare due to CRCP nodes sending chunks sequentially totheir children, while SPLAY nodes send chunks in paral-lel. In our settings (i.e., homogeneous bandwidth), thisshould not change the completion time of the last peer aslinks are saturated at all times.

0

10

20

30

40

50

60

70

220 230 240 250 260 270

Co

mp

letio

ns

Time (seconds)

16 KB 128 KB 512 KB

SPLAY trees CRCP trees

Figure 13: File distribution using trees.

Long-term experiment: cooperative Web cache. Ourlast experiment presents the performance over time of acooperative Web cache built using SPLAY following thesame base design as Squirrel [21]. This experiment high-lights the ability of SPLAY to support long-run applica-tions under constant load. The cache uses our Pastry DHTimplementation deployed in a cluster, with 100 nodesthat proxy requests and store remote Web resources forspeeding up subsequent accesses. For this experiment,we limit the number of entries stored by each nodes to100. Cached resources are evicted according to an LRUpolicy or when they are older than 120 seconds. The co-operative Web cache has been run for three weeks. Fig-ure 14 presents the evolution of HTTP requests delay dis-tribution for a period of 100 hours along with the cachehit ratio. We injected a continuous stream of 100 requestsper second extracted from real Web access traces [7] cor-responding to 1.7 million hits to 42,000 different URLs.We observe a steady cache hit ratio of 77.6%. The experi-enced delays distribution has remained stable throughoutthe whole run of the application. Most accesses (75th per-centile) are cached and served in less than 25 to 100 ms,compared to non-cached accesses that require 1 to 2 sec-onds on average.

6 ConclusionSPLAY is an infrastructure that aims at simplifying thedevelopment, deployment and evaluation of large-scaledistributed applications. It incorporates several novel fea-tures not found in existing tools and testbeds. SPLAYapplications are specified using in a high-level, efficientscripting language very close to pseudo-code commonlyused by researchers in their publications. They execute

Churn x2 Churn x5 Churn x10





Live Demo

www.splay-project.org

25





Live demo:Gossip-based dissemination

• Also called epidemic dissemination

• A message spreads like the flu among population

• Robust way to spread message in random network

• Two actions: push and pull

• Push: node chooses f other random node to infect

• A newly infected node infects f other in turn

• Done for a limited number of steps (TTL)

• Pull: node tries to fetch message from random node

26



SOURCE

Every node knows a random set of other nodes.

A node wants to send a message to all nodes.

27



A first initial push in which nodes that receive the messages for the first time forwards it to f random

neighbors, up to TTL times.

28



Periodically, nodes pull for new messages from one random

neighbor. The message eventually reaches all peers with

high probability.

29



1

Algorithm 1: A simple push-pull dissemination protocol, on node P .Constants

f : Fanout (push: P forwards a new message to f random peers).TTL: Time To Live (push: a message is forwarded to f peers at mostTTL times).�: Pull period (pull : P asks for new messages every � seconds).

VariablesR: set of received messages IDsN : set of node identifiers

/* Invoked when a message is pushed to node P by node Q */function Push(Message m, hops)

if m received for the first time then/* Store the message locally */R � R ⌅m/* Propage the message if required */if hops > 0 then

invoke Push(msg, hops-1 ) on f random peers from N

/* Periodic pull */thread PeriodicPull()

every � seconds doinvoke Pull(R, P) on a random node from N

/* Invoked when a node Q requests messages from node P */function Pull(RQ, Q)

foreach m | m ⇥ R ⇧ m ⇤⇥ RQ doinvoke Push(m, 0 ) on Q

30



require"splay.base"rpc = require"splay.urpc"

--[[ SETTINGS ]]--

fanout = 3ttl = 3pull_period = 3

rpc_timeout = 15

--[[ VARS ]]--

msgs = {}

Loading Base and UDP RPC Libraries.

Set the timeout for RPC operations (i.e., an RPC will return nil after 15 seconds).

function push(q, m) if not msgs[m.id] then msgs[m.id] = m log.print("NEW: "..m.id.." (ttl:"..m.t..") from "..q.ip..":"..q.port) m.t = m.t - 1 if m.t > 0 then for _, n in pairs(misc.random_pick(job.nodes, fanout)) do log.print("FORWARD: "..m.id.." (ttl:"..m.t..") to "..n.ip..":"..n.port) events.thread(function() rpc.call(n, {"push", job.me, m}, rpc_timeout) end) end end else log.print("DUP: "..m.id.." (ttl:"..m.t..") from "..q.ip..":"..q.port) endend

Logging facilities: all logs are available at the controllers, where they are timestamped and

stored in the DB.

The RPC call itself is embedded in an inner anonymous function in a new anonymous thread: we do not care about its success

(epidemics are naturally robust to message loss). Testing for success would be easy: the second return value is nil in case of failure.

31



function periodic_pull() while events.sleep(pull_period) do local q = misc.random_pick_one(job.nodes) log.print("PULLING: "..q.ip..":"..q.port) local ids = {} for id, _ in pairs(msgs) do ids[id] = true end local r = rpc.call(q, {"pull", ids}, rpc_timeout) if r then for id, m in pairs(r) do if not msgs[id] then msgs[id] = m log.print("NEW REPLY: "..m.id.." from "..q.ip..":"..q.port) else log.print("DUP REPLY: "..m.id.." from "..q.ip..":"..q.port) end end end endend

SPLAY provides a library for cooperative multithreading, events, scheduling and

synchronization. SPLAY also provides various commodity libraries.

(In LUA, arrays and maps are the same type)

Variable r is nil if the RPC times out.

function pull(ids) local r = {} for id, m in pairs(msgs) do if not ids[id] then r[id] = m end end return rend

There is no difference between a local function and one called by a distant node. Even variables

can be accessed by a distant node by a RPC! SPLAY does all the work of serialization, network

calls, etc. while preserving low footprint and high performance.

32



function source() local m = {id = 1, t = ttl, data = "data"} for _, n in pairs(misc.random_pick(job.nodes, fanout)) do log.print("SOURCE: "..m.id.." to "..n.ip..":"..n.port) events.thread(function() rpc.call(n, {"push", job.me, m}, rpc_timeout) end) endend

events.loop(function() rpc.server(job.me) log.print("ME", job.me.ip, job.me.port, job.position) events.sleep(15) events.thread(periodic_pull) if job.position == 1 then source() endend)

Calling a remote function.

Embed a function in a separate thread.

Start the RPC server: it is up and running.

33



• Distributed systems raise a number of issues for their evaluation

• Hard to implement, debug, deploy, tune

• leverages Lua and centralized controller to produce an easy to use yet powerful working environment

Take-aw

ay

Slide

ᔕᕈᒪᐱᓭ

34


Technology

SLPAY: Distributed Systems Made Simple