Scalability and Replication · COST weighs a system’s scalability against the over-heads introduced by the system, and indicates the actual performance gains of the system, without

Scalability and ReplicationMarco Serafini

COMPSCI 532Lecture 13

Scalability

2

3

Scalability• Ideal world

• Linear scalability• Reality

• Bottlenecks• For example: central coordinator

• When do we stop scaling?

Parallelism

SpeedupIdeal

Reality

44

Scalability• Capacity of a system to improve performance by increasing the amount of resources available

• Typically, resources = processors• Strong scaling

• Fixed total problem size, more processors• Weak scaling

• Fixed per-processor problem size, more processors

55

Scaling Up and Out• Scaling Up

• More powerful server (more cores, memory, disk)• Single server (or fixed number of servers)

• Scaling Out• Larger number of servers• Constant resources per server

Scalability! But at what COST?

Frank McSherry Michael Isard Derek G. MurrayUnaffiliated Microsoft Research Unaffiliated⇤

AbstractWe offer a new metric for big data platforms, COST,or the Configuration that Outperforms a Single Thread.The COST of a given platform for a given problem is thehardware configuration required before the platform out-performs a competent single-threaded implementation.COST weighs a system’s scalability against the over-heads introduced by the system, and indicates the actualperformance gains of the system, without rewarding sys-tems that bring substantial but parallelizable overheads.

We survey measurements of data-parallel systems re-cently reported in SOSP and OSDI, and find that manysystems have either a surprisingly large COST, oftenhundreds of cores, or simply underperform one threadfor all of their reported configurations.

1 Introduction“You can have a second computer once you’veshown you know how to use the first one.”

-Paul Barham

The published work on big data systems has fetishizedscalability as the most important feature of a distributeddata processing platform. While nearly all such publi-cations detail their system’s impressive scalability, fewdirectly evaluate their absolute performance against rea-sonable benchmarks. To what degree are these systemstruly improving performance, as opposed to parallelizingoverheads that they themselves introduce?

Contrary to the common wisdom that effective scal-ing is evidence of solid systems building, any systemcan scale arbitrarily well with a sufficient lack of care inits implementation. The two scaling curves in Figure 1present the scaling of a Naiad computation before (sys-tem A) and after (system B) a performance optimizationis applied. The optimization, which removes paralleliz-able overheads, damages the apparent scalability despiteresulting in improved performance in all configurations.While this may appear to be a contrived example, we will

⇤Derek G. Murray was unaffiliated at the time of his involvement,but is now employed by Google Inc.

3001 10 100

50

1

10

cores

spee

d-u

p system

A

system B

3001 10 100

1000

8

100

cores

seco

nds

system A

system B

Figure 1: Scaling and performance measurementsfor a data-parallel algorithm, before (system A) andafter (system B) a simple performance optimization.The unoptimized implementation “scales” far better,despite (or rather, because of) its poor performance.

argue that many published big data systems more closelyresemble system A than they resemble system B.

1.1 Methodology

In this paper we take several recent graph processing pa-pers from the systems literature and compare their re-ported performance against simple, single-threaded im-plementations on the same datasets using a high-end2014 laptop. Perhaps surprisingly, many published sys-tems have unbounded COST—i.e., no configuration out-performs the best single-threaded implementation—forall of the problems to which they have been applied.

The comparisons are neither perfect nor always fair,but the conclusions are sufficiently dramatic that someconcern must be raised. In some cases the single-threaded implementations are more than an order of mag-nitude faster than published results for systems usinghundreds of cores. We identify reasons for these gaps:some are intrinsic to the domain, some are entirely avoid-able, and others are good subjects for further research.

We stress that these problems lie not necessarily withthe systems themselves, which may be improved withtime, but rather with the measurements that the authorsprovide and the standard that reviewers and readers de-mand. Our hope is to shed light on this issue so thatfuture research is directed toward distributed systemswhose scalability comes from advances in system designrather than poor baselines and low expectations.

1

7 7

What Does This Plot Tell You?






-Paul Barham




3001 10 100

50

1

10

cores

spee

d-u

p system

A

system B

3001 10 100

1000

8

100

cores

seco

nds

system A

system B



1.1 Methodology




1






-Paul Barham




3001 10 100

50

1

10

cores

spee

d-u

p system

A

system B

3001 10 100

1000

8

100

cores

seco

nds

system A

system B



1.1 Methodology




1

8 8

How About Now?






-Paul Barham




3001 10 100

50

1

10

cores

spee

d-u

p system

A

system B

3001 10 100

1000

8

100

cores

seco

nds

system A

system B



1.1 Methodology




1

99

COST• Configuration that Outperforms Single Thread (COST)• # cores after which we achieve speedup over 1 core

scalable system cores twitter uk-2007-05GraphLab 128 242s 714sGraphX 128 251s 800sSingle thread (SSD) 1 153s 417sUnion-Find (SSD) 1 15s 30s

Table 5: Times for various connectivity algorithms.

fn UnionFind(graph: GraphIterator) {

let mut root = Vec::from_fn(graph.nodes, |x| x);

let mut rank = Vec::from_elem(graph.nodes, 0u8);

graph.map_edges(|mut x, mut y| {

while (x != root[x]) { x = root[x]; }

while (y != root[y]) { y = root[y]; }

if x != y {

match rank[x].cmp(&rank[y]) {

Less => { root[x] = y; },

Greater => { root[y] = x; },

Equal => { root[y] = x; rank[x] += 1; },

}

}

});

}

Figure 4: Union-Find with weighted union.

mentations of label propagation, faster than the fastestof them (the single-threaded implementation) by over anorder of magnitude.

There are many other efficient algorithms for comput-ing graph connectivity, several of which are paralleliz-able despite not fitting in the “think like a vertex” model.While some of these algorithms may not be the best fitfor a given distributed system, they are still legitimatealternatives that must be considered.

4 Applying COST to prior workHaving developed single-threaded implementations, wenow have a basis for evaluating the COST of systems. Asan exercise, we will retrospectively apply these baselinesto the published numbers for existing scalable systems,even though the single-threaded implementations are onmore modern hardware.

4.1 PageRankFigure 5 presents the published scaling informationfrom PowerGraph (GraphLab) [7], GraphX [8], andNaiad [14], as well as two single-threaded measurementsas horizontal lines. The intersection with the upper lineindicates the point at which the system out-performsa simple resource-constrained implementation, and isa suitable baseline for systems with similar limitations(e.g., GraphChi and X-Stream). The intersection with the

51216 100

20

1

10

cores

seco

nds

Vertex SSD

Hilbert RAM

GraphLab

Naiad

51264 100

460

50

100

coresse

cond

s

cores

GraphX

Vertex SSD

Hilbert RAM

Figure 5: Published scaling measurements for Page-Rank on twitter rv. The first plot is the time perwarm iteration. The second plot is the time for ten it-erations from a cold start. Horizontal lines are single-threaded measurements.

lower line indicates the point at which the system out-performs a feature-rich implementation, including pre-processing and sufficient memory, and is a suitable base-line for systems with similar resources (e.g., GraphLab,Naiad, and GraphX).

From these curves we would say that Naiad has aCOST of 16 cores for PageRanking the twitter rv graph.Although not presented as part of their scaling data,GraphLab reports a 3.6s measurement on 512 cores, andachieves a COST of 512 cores. GraphX does not in-tersect the corresponding single-threaded measurement,and we would say it has unbounded COST.

4.2 Graph connectivityThe published works do not have scaling information forgraph connectivity, but given the absolute performanceof label propagation on the scalable systems relativeto single-threaded union-find we are not optimistic thatsuch scaling data would have lead to a bounded COST.

Instead, Figure 6 presents the scaling of two Naiad im-plementations of parallel union-find [12], the same ex-amples from Figure 1. The two implementations differ intheir storage of per-vertex state: the slower one uses hashtables where the faster one uses arrays. The faster im-plementation has a COST of 10 cores, while the slowerimplementation has a COST of roughly 100 cores.

The use of hash tables is the root cause of the factorof ten increase in COST, but it does provide some value:node identifiers need not lie in a compact set of integers.This evaluation makes the trade-off clearer to both sys-tem implementors and potential users.

5 Lessons learnedSeveral aspects of scalable systems design and imple-mentation contribute to overheads and increased COST.The computational model presented by the system re-stricts the programs one may express. The target hard-

4

Single iteration 10 iterations

1010

Possible Reasons for High COST• Restricted API

• Limits algorithmic choice• Makes assumptions

• MapReduce: No memory-resident state• Pregel: program can be specified as “think-like-a-vertex”

• BUT also simplifies programming• Lower end nodes than laptop• Implementation adds overhead

• Coordination• Cannot use application-specific optimizations

1111

Why not Just a Laptop?• Capacity

• Large datasets, complex computations don’t fit in a laptop• Simplicity, convenience

• Nobody ever got fired for using Hadoop on a cluster • Integration with toolchain

• Example: ETL à SQL à Graph computation on Spark

1212

Disclaimers• Graph computation is peculiar

• Some algorithms are computationally complex…• Even for small datasets• Good use case for single-server implementations

• Similar observations for Machine Learning

Replication

13

14

Replication• Pros

• Good for reads: can read any replica (if consistent)• Fault tolerance

• Cons• Bad for writes: must update multiple replicas• Coordination for consistency

1515

Replication protocol • Mediates client-server communication• Ideally, clients cannot “see” replication

Replication agent

Replication agent

Replication protocol

Replica

Replica

Replica

Client Replication agent

Replication protocol

1616

Consistency Properties• Strong consistency

• All operations take effect in some total order in every possible execution of the system• Linearizability: total order respects real-time ordering• Sequential consistency: total order is sufficient

• Weak consistency• We will talk about that in another lecture

• Many other semantics

1717

What to Replicate?• Read-only objects: trivial• Read-write objects: harder

• Need to deal with concurrent writes• Only the last write matters: previous writes are overwritten

• Read-modify-write objects: very hard• Current state is function of history of previous requests• We consider deterministic objects

1818

Fault Assumptions• Every fault tolerant system is based on a fault assumption• We assume that up to f replicas can fail (crash)• Total number of replicas is determined based on f• If the system has more than f failures, no guarantee

1919

Synchrony Assumptions• Consider the following scenario

• Process s sends a message to process r and waits for reply• Reply r does not arrive to s before a timeout

• Can s assume that r has crashed?• We call a system asynchronous if we do not make this assumption• Otherwise we call it (partially) synchronous

• This is because we are making additional assumptions on the speed or round-trips

2020

Distributed Shared Memory (R/W)• Simple case

• 1 writer client, m reader clients• n replicas, up to f faulty ones• Asynchronous system

• Clients send messages to all n replicas and wait for n-f replies (otherwise they may hang forever waiting for crashed replicas)

• Q: How many replicas do we need to tolerate 1 fault?• A: 2 not enough

• Writer and readers can only wait for 1 reply (otherwise it blocks forever if a replica crashes)• Writer and readers may contact disjoint sets of replicas

2121

Quorum Intersection• To tolerate f faults, use n = 2f+1 replicas

• Writes and reads wait for replies from a set of n-f = f+1replicas (i.e., a majority) called a majority quorum• Two majority quorums always intersect!

…Replicas

Writer Readerw(v) ack r v

wait for n-f acks wait for n-f replies

2222

Consistency is Expensive• Q: How to get linearizability?• A: Reader needs to write back to a quorum

…Replicas

Writer Reader(1) w(v,t) (1) r

(2) wait for n-frcv (vi,ti)

ack

(3) w(vi,ti) with max ti

Reference: Attiya, Bar-Noy, Dolev. “Sharing memory robustly in message-passing systems”

replicas set vi = v only if t > ti

(2) wait for n-f acks(4) wait for

n-f acks

2323

Why Write Back?• We want to avoid this scenario

• Assume initial value is v = 4• No valid total order that respects real-time order exists in this execution

write (v = 5)

read (v)à5

read (v)à4

Writer

Reader 1

Reader 2

24

State Machine Replication (SMR)• Read-modify-write objects

• Assume deterministic state machine• Consistent sequence of inputs (consensus)

concurrent client requests

R1R2R3

consensus R2 R1 R3SM

SM

SM

Consistent outputs!

Consistent decision on sequential

execution order

2525

Impossibility Result• Fischer, Lynch, Patterson (FLP) result

• “It is impossible to reach distributed consensus in an asynchronous system with one faulty process” (because fault detection is not accurate)

• Implication: Practical consensus protocols are• Always safe: Never allow inconsistent decision• Liveness (termination): Only in periods when additional synchrony assumptions hold. In periods when these assumptions do not hold, the protocol may stall and make no progress.

2626

Leader Election• Consider the following scenario

• There are n replicas of which up to f can fail• Each replica has a pre-defined unique ID

• Simple leader election protocol• Periodically, every T seconds, each replica sends a heartbeat to all other replicas• If a replica p does not receive a heartbeat from a replica r within T + D seconds from the last heartbeat from r, then pconsiders r as faulty (D = maximum assumed message delay)• Each replica considers as leader the non-faulty replica with lowest ID

2727

Eventual Single Leader Assumption• Typically, a system respects synchrony assumption

• All heartbeats take at most D to arrive • All replicas elect the same leader

• In the remaining asynchronous periods• Some heartbeat might take more than D to arrive• Replicas might disagree over who is faulty and who is not• Different replicas might see different leaders

• Eventually, all replicas see a single leader• Asynchronous periods are glitches that are limited in time

2828

The Paxos Protocol• Paxos is a consensus protocol

• All replicas start with their own proposal• In SMR, a proposal is a batch of requests the replica has received from clients, ordered according to the order in which the replica received them

• Eventually, all replica decide the same proposal• In SMR, this is the batch of requests to be executed next

• Paxos terminates when there is a single leader• The assumption is that eventually there will be a single leader

• Paxos potentially stalls when there are multiple leaders• But it prevents divergent decisions during these asynchronous periods

29

Paxos (Simplified)

L

sendread(b)

wait for n-f repliesIf some reply is (vi, bi), set v to viwith highest bi send proposal (v, b)

1. accept (v, b) unless this breaks promise2. if accept, reply ack

If the replica has previously accepted a proposal (vi, bi) and b > bi1. reply with (vi, bi)2. promise not to accept messages with ballot < bIf no prior accepted proposal reply with ack

wait for n-f acks, thendecide on v and broadcast decisionNewly elected

leader picks unique ballot

number b.It has its own

proposed value v

Reference: L. Lamport. “Paxosmade simple”

If progress gets stuck (not enough replies), the leader picks a larger ballot number and restarts the protocol. Eventually, there will be a single leader with a large enough ballot number which completes all the steps

3030

Properties• Definition of chosen proposal (v,b):

• Accepted by a majority of replicas at a given point in time• Proposal (v,b) decided by one replica à (v,b) chosen at some point in time• Invariant

• Once (v,b) chosen, future proposals (v’, b’) from different leaders such that b’ > b have v = v’• Note that proposals from old leaders cannot overwrite the ones from newer leaders

3131

Typical Applications of Paxos• State machine replication is hard

• Hard to implement: consensus is only one of the problems• Writing deterministic applications on top of SMR is hard

• Typical approach: use a system that uses consensus• Storage systems • Coordination services to keep system metadata

• Google Chubby lock server uses Paxos• Apache Zookeper uses a variant of Paxos• Zookeper used by Apache HBase, Kafka, …

Transactions

32

33

How About Multiple Objects?• Transaction: read and modify multiple objects

begin txnwrite z = 2read xread yif x > y

write y = xcommit

elseabort

end txn

34

ACID Properties• Guarantees of a storage system / DBMS

• Atomicity: All or nothing• Consistency: Respect application invariants (e.g. balance > 0)• Isolation: Transactions run as if no concurrency• Durability: Committed transactions are persisted

• Consistency here has a different meaning!• Consistency with single objects relates to Isolation with transactions

3535

Isolation Levels• Serializability: total order of transactions• Strict serializability: total + real-time order• Snapshot isolation

• Read from consistent snapshots• Writes only visible inside transaction until commit• Abort if writes conflict

• Many others

3636

Distributed Transactions• Transactions on objects on different nodes• Typically expensive• Two-phase commit protocol

• Voting (prepare) phase• Coordinator sends query• Participants execute and send back vote (commit or abort) to coordinator

• Commit phase• Coordinator waits for replies from all participants• If all participants commit then coordinator sends commit request to participants, else send abort request• Participants send acknowledgement, coordinator terminates transaction

• Comments• Simplified description: abstracted away logging to disk• Q: fault tolerant?

Documents

Scalability and Replication · COST weighs a system’s scalability against the over-heads introduced by the system, and indicates the actual performance gains of the system, without