Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May

Improving the Efficiency of Fault-Tolerant Distributed

Shared-Memory Algorithms

Eli Sadovnik and Steven Homberg

Second Annual MIT PRIMES Conference, May 19-20, 2012

Introduction

• Shared memory supports concurrent access– Read & write interface• Memory models: single writer, multiple reader (SWMR)

and multiple writer, multiple reader (MWMR)– Consistency is important• Strong consistency provides useful semantics

• Abstraction for message-passing networks– Shared memory can be emulated– Difficult to do, but solutions exist– For example applications for the Internet, such as Dropbox

Our Research Project

THE RAMBO PROJECT•Framework for emulating shared memory– Introduced by Lynch and Shvartsman, extended by Gilbert– Implements the MWMR model with strong consistency– Designed for dynamic distributed message-passing settings

OUR GOAL•RAMBO is elegant but not always efficient•Extend RAMBO with intelligent data management

Consistency & Atomicity• There are many consistency models• We are interested in atomicity

Violation

(Safety)

Violation

(Safety)

Violation

(Regularity)

Violation

(Regularity)

AtomicityAtomicity

time

0

read(3) read(0) read(8)

write(8)

time

0


write(8)

time

0


write(8)

Emulating Shared Memory

Data:

5

Status:

WORKING

User 1:ReaderData:

5

User 2:WriterData:

5

User 3:ReaderData:

5

Weakness of the Centralized Approach

Data:

Status:

FAILED

User 1:ReaderData:

User 2:WriterData:

User 3:ReaderData:

error

errorerror

Replication in Distributed Setting

Data:

Status:

FAILED

User 1:ReaderData:

User 2:WriterData:

User 3:ReaderData:

5

55

Data:

5

Status:

WORKING

Data:

5

Status:

WORKING

The ABD AlgorithmHagit Attiya, Amotz Bar-Noy, Danny Dolev

A SWMR algorithm•Operation level wait-freedom– Termination unaffected by concurrency

•Designed for a message-passing setting– Allows limited failures– Communication is reliable– Messages can be delayed

Quorum Systems and ABD

• ABD is a quorum based algorithm– Quorum system is a collection of intersecting sets

• For example a voting majority quorum system

• Data is replicated in a quorum systems– Quorum system members are networked servers

• Guarantee of atomicity– Quorum intersection and read/write protocols

• Reads must write! (… sometimes as we will see later)– A reader must write the latest data– Writer cannot be trusted to complete

Phased Read/Write Protocols

Data:

Status:

WORKING

User 1:ReaderData:

User 2:WriterData:

User 3:ReaderData:

5

55

Data:

Status:

WORKING

Data:

Status:

WORKING

Q2

Q1

User 2 writesits data, a 5, to quorum Q1.

55

Phased Read/Write Protocols

Data:

Status:

WORKING

User 1:ReaderData:

User 2:WriterData:

User 3:ReaderData:

5

55

Data:

5

Status:

WORKING

Data:

5

Status:

WORKING

Q2

Q1

User 1 queriesquorum Q2,sees the latestdata is a 5,and writesthat back tothe computerthat does nothave the latestdata.

5

Data Versions & Timestamps

Data:

5,t=1

Status:

WORKING

User 1:ReaderData:

User 2:WriterData:

User 3:ReaderData:

5,t=1

7,t=25,t=1

Data:

7,t=2

Status:

WORKING

Data:

7,t=2

Status:

WORKING

Q2

Q1

Timestamps allow us to distinguish among different versions of the data.

Data Versions & Timestamps

Data:

7,t=2

Status:

WORKING

User 1:ReaderData:

User 2:WriterData:

User 3:ReaderData:

7,t=2

7,t=27,t=2

Data:

7,t=2

Status:

WORKING

Data:

7,t=2

Status:

WORKING

Q2

Q1

Quorum Viability

Data:

7,t=2

Status:

WORKING

User 1:ReaderData:

User 2:WriterData:

User 3:ReaderData:

error

errorerror

Data:

7,t=2

Status:

WORKING

Data:

7,t=2

Status:

WORKING

Q2

Q1

Data:

Status:

FAILED

Data:

Status:

FAILED

A weakness ofthe ABD algorithmis that it isdependent ona quorum ofservers always beingviable. When no quorum is available, thenoperations are blocked.

The RAMBO Framework(Reconfigurable Atomic Memory

for Basic Objects)

Seth Gilbert

Nancy Lynch

Alexander Shvartsman

Quorum Reconfiguration

Data:

Status:FAILED

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Data:

Status:WORKING

Q2

Q1

Data:

Status:WORKING

Data:

Status:WORKING

RAMBO uses quorum reconfiguration to ensure service longevity.

A new quorum system (a new set of servers) is installed to replace the old ones, allowing progress in spite of failures.

Replica Transfer

Data:

Status:FAILED

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Data:

Status:WORKING

Q2

Q1

Data:

Status:WORKING

Data:

Status:WORKING

7,t=2

7,t=2 7,t=2

After a new set of servers is installed, these servers do not have any information.

The replica information (copies of data) must be transferred to the new configuration.

Garbage Collection

Data:

Status:FAILED

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

Status:WORKING

After information is transferred to the new servers, the old servers are phased out of use.

This process is called `garbage collection’.

The mechanism for garbage collection has two phases and is analogous to read/write operations (introduced in the next slies).

Read/Write Operations

Data:

Status:FAILED

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

Status:WORKING

User 1:ReaderData:

7,t=2 7,t=2 7,t=2 7,t=2

7,t=2

What if reads and writes occur during reconfiguration?

Concurrent operations contact all existing configurations to ensure the latest information is accessed.

Multi-Configuration Access

Read/Write Operations

Old configurations need to be removed from use.

Ongoing read/write operations use their existing configuration knowledge. New operations ignore the old configuration.

Data:

Status:FAILED

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

Status:WORKING

User 1:ReaderData:

7,t=2

7,t=2 7,t=27,t=27,t=2

Garbage Collection

Q1: Can a reader (respectively writer) avoid contacting configurations that it learned have been marked as garbage collected?

Q2: When can a reader avoid its second phase, and can a reader propagate selectively?

Q3: Can we propagate to the most recent configuration only?

Research Questions

Concurrent Garbage Collection (Q1)

Data:

5,t=1

Status:WORKING

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Data:

Status:WORKING

Q2

Q1

Data:

Status:WORKING

Data:

0,t=0

Status:WORKING

7,t=2

1

2

3

6

7Return 7

7,t=2 7,t=2 7,t=2 7,t=2

4

User 1:ReaderData:

5

7,t=2

7,t=2

We believe that the garbage collected configuration can in fact be ignored because the reader learns of the configuration’s information regardless.

7,t=2 0,t=0 0,t=00,t=0 0,t=0

Improved Configuration Management (Q1)

• Authors of RAMBO conjecture that operations must contact all configurations that are discovered during the query (respectively propagate) phase.

• Communicating with configurations learned to be garbage collected mid-operation is unnecessary– Intermediate discovery of garbage collected configurations

from another server– That server knows at least as recent tag as any known in

the old configurations

• IMPACT: improves operation liveness

Improved Bookkeeping (Q2)

Data:

7,t=2

Status:

WORKING

User 1:ReaderData:

Data:

7,t=2

Status:

WORKING

Data:

7,t=2

Status:

WORKING

Q2

Q1

7t=2

7t=2

After querying the reader learns that a majority of nodes has the up-to-date information, thus making propagation needless.

7,t=2

7,t=27,t=2

7,t=27,t=2

Semi-Fast Read Operations (Q2)

• Read operations always propagate– Regardless of the actual replica dissemination – Redundant messages and slow operation

• The proposed solution– During the query phase, reader records the latest

timestamps of server with which it communicated– The reader contacts servers that are not up-to-date– Sometimes this allows omitting the propagation phase

entirely (`semi-fast’ read operations)• IMPACT: improves operation latency and reduces

communication costs

Overly Extensive Propagation (Q3)

Data:

Status:FAILED

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Data:

7,t=2

Status:WORKING

Q2

Q1

Data:

7,t=2

Status:WORKING

Data:

Status:WORKING

User 1:WriterData:

7,t=27,t=27,t=27,t=27,t=2

Currently, RAMBO both queries and propagates to all active configurations. In fact, just the query phase covering all active configurations is sufficient for atomicity.

Propagate to the Latest Configuration (Q3)

• We believe it is not necessary to propagate to any configuration but the last active configuration.

• Properties of configuration information • All configurations are totally ordered.• Configuration have a forward link.• Discovery is faster than reconfiguration

• Operations query all active configurations• IMPACT: reduces communication cost

Summary

• Algorithmic optimizations• Opportunistic benefits– A clear advantage when • Servers gossip, and• Configurations have members in common

• Changes are minimally intrusive– Modest increase in bookkeeping and the size of

messages

Future Work

• Formal reasoning– Use the Input/Output Automata framework to

demonstrate that the new changes preserve consistency guarantees of RAMBO

• Simulation– Use the TEMPO toolkit to simulate RAMBO executions and

build confidence in our proofs

• Empirical experiments– Augment the existing implementations of RAMBO and

collect behavior data on Planet-Lab

Special Thanks to:The MIT PRIMES Program

Supervisor Prof. Nancy Lynch

Mentor Dr. Peter Musial

Documents

Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May