Copyright by Nalini Moti Belaramani 2009

Copyright

by

Nalini Moti Belaramani

2009

The Dissertation Committee for Nalini Moti Belaramanicertifies that this is the approved version of the following dissertation:

Policy Architecture for Distributed

Storage Systems

Committee:

Mike Dahlin, Supervisor

Lorenzo Alvisi

Mohamed Gouda

Lili Qiu

Petros Maniatis


Storage Systems

by

Nalini Moti Belaramani, B.Eng.; M.Phil.

DISSERTATION

Presented to the Faculty of the Graduate School of

The University of Texas at Austin

in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT AUSTIN

August 2009

To Mom, Dad, and Kiran.

asato ma satgamaya,

tamaso ma jyotirgamaya,

mrityor ma amritamgamaya.

From untruth to truth,

From darkness to light,

From death to immortality.

Acknowledgments

So many people who have helped make my dream a reality. This dis-

sertation would not have been possible were not for following people.

My adviser, Mike Dahlin. I have the utmost respect and gratitude for

my adviser. Mike is probably the best adviser a Ph.D. student can have. What

I admire most about Mike is his quest for perfection in everything he does—be

it coding, writing, managing time, or managing students. I am grateful for his

gentle prodding that has helped me push my limits and achieve what I have.

Thank you Mike for imparting your knowledge to me.

My graduate committee members,who were great to work with. Their

questions and comments have helped me to understand the value and the

pitfalls of my work better. And of course, Petros Maniatis for the technical

discussions that have helped shape my work and for always encouraging me

to brag about my work.

Jiandan Zheng, Robert Soule, and Robert Grimm with whom collabo-

ration was easy and rewarding.

The LASR lab, which is full of a cool bunch of people who sincerely

want every one to succeed. I thank everyone for sitting through my numerous

practice talks, reading my drafts, and just lending me ears when I wanted

to talk things out. The support staff, which is is phenomenal. If it weren’t

v

for Gloria Ramirez, Sara Strandtman, and Lydia Griffith, I would be lost in

the bureaucratic maze of the UT-system. The technical support staff is also

wonderful to work with. I would have been handicapped without their help in

setting up and managing machines for experiments.

I have been lucky to meet a great set of friends who really care about

me and have, in one way or the other, made sure I do not lock myself in a hole.

They include: Edmund Wong for introducing me to gastronomic delights that

make stress disappear into thin air, for feeding me, for sharing my frustrations,

for answering my questions, and most importantly, for brightening up my day

with his energy. Amol Nayate, Joseph Modayil, Prem Melville, Shilpa Chhada,

and Misha Bilenko for taking me under their wings, for helping me adapt to

the new culture I found myself in, and for being my family away from home.

Richa Gargh and Yannis Korkolis for always being there when I need to fill

my people quota, when I want to celebrate, or when I just want to bum

around. The Brown Trash family for making me feel that I belong, for hosting

awesome pot-lucks, and for being such great dance party crashers! My Hong

Kong friends for being my personal cheerleaders, for not letting distance come

in the way of friendships, and for always believing that I can achieve great

things.

In the final stretch, when all I did was write my dissertation, I thank

the following persons for helping me maintain my sanity: Alexandre Dumas

and Stephenie Meyer, who gave me my daily dose of drama through their

works, the Count of Monte Cristo and the Twilight series. DJ Tiesto, whose

vi

weekly podcast helped me drown out the rest of the world in techno beats and

just focus on writing.

Lastly, I thank the four most important persons in my life:

Jason Brandt, for being there for me, sharing my happiness and frus-

trations, making sure I am well-fed and well-slept, for bearing with my silly

and foul moods with a laugh. Thank you Jason for being in my life and for

being you!

My mom, my dad, and my sister Kiran, for staying by my side through

thick and thin, cheering me on, and for always believing in me. They have a

big part in shaping who I am today and what I have achieved. Despite being

far away, you are close to my heart. I love you.

vii


Storage Systems

Nalini Moti Belaramani, Ph.D.

The University of Texas at Austin, 2009

Supervisor: Mike Dahlin

Distributed data storage is a building block for many distributed sys-

tems such as mobile file systems, web service replication systems, enterprise file

systems, etc. New distributed data storage systems are frequently built as new

environment, requirements or workloads emerge. The goal of this dissertation

is to develop the science of distributed storage systems by making it easier

to build new systems. In order to achieve this goal, it proposes a new policy

architecture, PADS, that is based on two key ideas: first, by providing a set of

common mechanisms in an underlying layer, new systems can be implemented

by defining policies that orchestrate these mechanisms; second, policy can be

separated into routing and blocking policy, each addresses different parts of the

system design. Routing policy specifies how data flow among nodes in order

to meet performance, availability, and resource usage goals, whereas blocking

policy specifies when it is safe to access data in order to meet consistency and

durability goals.

viii

This dissertation presents a PADS prototype that defines a set of dis-

tributed storage mechanisms that are sufficiently flexible and general to sup-

port a large range of systems, a small policy API that is easy to use and cap-

tures the right abstractions for distributed storage, and a declarative language

for specifying policy that enables quick, concise implementations of complex

systems.

We demonstrate that PADS is able to significantly reduce development

effort by constructing a dozen significant distributed storage systems spanning

a large portion of the design space over the prototype. We find that each

system required only a couple of weeks of implementation effort and required

a few dozen lines of policy code.

ix

Table of Contents

Acknowledgments v

Abstract viii

List of Tables xvi

List of Figures xviii

Chapter 1. Introduction 1

1.0.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 2. Need for a General Framework 9

2.1 Large Design Space . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 PRACTI Taxonomy . . . . . . . . . . . . . . . . . . . . 12

2.2 Need for New Storage Systems . . . . . . . . . . . . . . . . . . 16

2.3 Difficulty in Building a New Storage System . . . . . . . . . . 17

2.4 Need for a General Framework . . . . . . . . . . . . . . . . . . 19

Chapter 3. The PADS Approach 20

3.1 Separation of Mechanisms and Policy . . . . . . . . . . . . . . 20

3.2 Separation of Policy into Routing and Blocking . . . . . . . . . 21

3.3 PADS Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.1 Using PADS . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . 25

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

x

Chapter 4. Routing Policy 28

4.1 What is Routing? . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Subscriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Event-driven API . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3.1 Triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.2 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.3 Stored Events . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 R/OverLog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4.1 R/OverLog Language Semantics . . . . . . . . . . . . . 40

4.4.1.1 Tuples, Events and Data Types . . . . . . . . . 40

4.4.1.2 Tables and Facts . . . . . . . . . . . . . . . . . 41

4.4.1.3 Timing Events . . . . . . . . . . . . . . . . . . 42

4.4.1.4 Rules . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.1.5 Aggregates . . . . . . . . . . . . . . . . . . . . 44

4.4.1.6 Functions . . . . . . . . . . . . . . . . . . . . . 45

4.4.1.7 Watch Statements . . . . . . . . . . . . . . . . 45

4.4.2 R/OverLog Policies . . . . . . . . . . . . . . . . . . . . 46

4.4.3 R/OverLog vs. OverLog . . . . . . . . . . . . . . . . . . 47

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Chapter 5. Blocking Policy 49

5.1 What is Blocking? . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Blocking Points . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Blocking Conditions . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4.1 Monotonic-read coherence . . . . . . . . . . . . . . . . . 57

5.4.2 Delta Coherence . . . . . . . . . . . . . . . . . . . . . . 57

5.4.3 Causal Consistency . . . . . . . . . . . . . . . . . . . . 58

5.4.4 Sequential Consistency . . . . . . . . . . . . . . . . . . 59

5.4.5 Linearizability . . . . . . . . . . . . . . . . . . . . . . . 60

5.4.6 TACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4.7 Two-phase Commit . . . . . . . . . . . . . . . . . . . . 64

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

xi

Chapter 6. The Mechanisms Layer 68

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.1.1 Basic Model . . . . . . . . . . . . . . . . . . . . . . . . 70

6.2 Local Storage Module . . . . . . . . . . . . . . . . . . . . . . . 71

6.2.1 Update Log and Object Store . . . . . . . . . . . . . . . 71

6.2.2 Local Access API . . . . . . . . . . . . . . . . . . . . . 72

6.3 Communication Module . . . . . . . . . . . . . . . . . . . . . . 73

6.3.1 Basic Subscriptions . . . . . . . . . . . . . . . . . . . . 75

6.3.2 Separate Invalidation and Body Subscriptions. . . . . . 75

6.3.3 Supporting Flexible Consistency . . . . . . . . . . . . . 76

6.3.4 State-based Synchronization . . . . . . . . . . . . . . . 78

6.3.5 Supporting Efficient Dynamic Subscriptions . . . . . . . 78

6.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.4 Consistency Bookkeeping Module . . . . . . . . . . . . . . . . 81

6.4.1 Dimensions of Local Consistency State . . . . . . . . . . 81

6.4.2 Local Consistency State . . . . . . . . . . . . . . . . . . 82

6.4.3 Updating the State . . . . . . . . . . . . . . . . . . . . . 85

6.5 Conflict Detection . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Chapter 7. Bridging the Gap 92

7.1 Mechanisms API . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.2 Making it Easier to Use . . . . . . . . . . . . . . . . . . . . . . 96

7.2.1 R/M Interface . . . . . . . . . . . . . . . . . . . . . . . 98

7.2.2 R/OverLog Runtime . . . . . . . . . . . . . . . . . . . . 101

7.2.3 BlockingPolicyModule (BPM) . . . . . . . . . . . . . . . 104

7.2.4 BlockingRoutingBridge (BRB) . . . . . . . . . . . . . . 106

7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Chapter 8. Evaluation Part 1: Microbenchmarks 110

8.1 Fundamental Overheads . . . . . . . . . . . . . . . . . . . . . 111

8.2 Quantifying the Constants . . . . . . . . . . . . . . . . . . . . 116

8.3 Absolute Performance . . . . . . . . . . . . . . . . . . . . . . . 122

8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

xii

Chapter 9. Evaluation 2: Case-studies 126

9.1 Evaluation Overview . . . . . . . . . . . . . . . . . . . . . . . 127

9.2 Architectural Equivalence . . . . . . . . . . . . . . . . . . . . . 129

9.3 Case-studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

9.3.1 Simple Client Server (SCS) . . . . . . . . . . . . . . . . 132

9.3.1.1 System Overview . . . . . . . . . . . . . . . . . 133

9.3.1.2 P-SCS Implementation . . . . . . . . . . . . . . 134

9.3.2 Full Client Server (FCS) . . . . . . . . . . . . . . . . . . 137

9.3.3 Coda . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

9.3.3.1 System Overview . . . . . . . . . . . . . . . . . 139

9.3.3.2 P-Coda Implementation . . . . . . . . . . . . . 139

9.3.4 TRIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

9.3.4.1 P-TRIP Implementation . . . . . . . . . . . . . 141

9.3.5 Bayou . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

9.3.5.1 P-Bayou Implementation . . . . . . . . . . . . . 143

9.3.6 Chain Replication . . . . . . . . . . . . . . . . . . . . . 144

9.3.6.1 P-Chain-Replication Implementation . . . . . . 145

9.3.7 TierStore . . . . . . . . . . . . . . . . . . . . . . . . . . 147

9.3.7.1 System Overview . . . . . . . . . . . . . . . . . 148

9.3.7.2 P-TierStore Implementation . . . . . . . . . . . 148

9.3.8 Pangaea . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

9.3.8.1 P-Pangaea Implementation . . . . . . . . . . . 151

9.4 Properties of Constructed Systems . . . . . . . . . . . . . . . . 153

9.4.1 Realism . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9.4.2 Agility . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

9.4.3 Absolute Performance . . . . . . . . . . . . . . . . . . . 160

9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Chapter 10. Related Work 163

10.1 Replication Protocols . . . . . . . . . . . . . . . . . . . . . . . 163

10.2 Declarative Domain Specific Languages . . . . . . . . . . . . . 167

10.3 Conflict Detection . . . . . . . . . . . . . . . . . . . . . . . . . 169

xiii

Chapter 11. Limits and Future Work 172

11.1 Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

11.1.1 PADS Model . . . . . . . . . . . . . . . . . . . . . . . . 173

11.1.2 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . 176

11.1.3 Current Implementation Limits . . . . . . . . . . . . . . 178

11.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

11.2.1 Streamlined Implementation . . . . . . . . . . . . . . . 181

11.2.2 Formal Model and Verification . . . . . . . . . . . . . . 183

11.2.3 Self-tuning File System . . . . . . . . . . . . . . . . . . 185

11.2.4 New Applications . . . . . . . . . . . . . . . . . . . . . 186

Chapter 12. Conclusion 188

12.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

12.2 Research Experience . . . . . . . . . . . . . . . . . . . . . . . 190

Appendices 192

Appendix A. Code Listings 193

A.1 Simple Client Server (P-SCS) . . . . . . . . . . . . . . . . . . 193

A.1.1 Routing Policy . . . . . . . . . . . . . . . . . . . . . . . 193

A.1.2 Blocking Policy . . . . . . . . . . . . . . . . . . . . . . . 195

A.2 Full Client Server (P-FCS) . . . . . . . . . . . . . . . . . . . . 196

A.2.1 Routing Policy . . . . . . . . . . . . . . . . . . . . . . . 196


A.3 P-Coda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

A.3.1 Routing Policy . . . . . . . . . . . . . . . . . . . . . . . 200


A.3.3 Co-operative Caching . . . . . . . . . . . . . . . . . . . 203

A.4 P-TRIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

A.4.1 Routing Policy . . . . . . . . . . . . . . . . . . . . . . . 204


A.4.3 Hierachical topology . . . . . . . . . . . . . . . . . . . . 205

A.5 P-Bayou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

xiv

A.5.1 Routing Policy . . . . . . . . . . . . . . . . . . . . . . . 206


A.6 P-ChainReplication . . . . . . . . . . . . . . . . . . . . . . . . 207

A.6.1 Routing Policy . . . . . . . . . . . . . . . . . . . . . . . 207


A.7 P-TierStore . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

A.7.1 Routing Policy . . . . . . . . . . . . . . . . . . . . . . . 211


A.8 P-Pangaea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

A.8.1 Routing Policy . . . . . . . . . . . . . . . . . . . . . . . 213


Bibliography 218

Vita 233

xv

List of Tables

1.1 Features covered by case-study systems. . . . . . . . . . . . . 5

2.1 Partial list of distributed storage systems published in the past25 years. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Storage systems design space and examples . . . . . . . . . . . 11

2.3 Approximate lines of code of several existing distributed datastorage systems. . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Routing triggers provided by PADS. . . . . . . . . . . . . . . . 33

4.2 Routing actions provided by PADS. . . . . . . . . . . . . . . . 34

4.3 PADS’s stored events interface. . . . . . . . . . . . . . . . . . 36

5.1 Conditions available for defining blocking predicates. . . . . . 53

6.1 Components of invalidations and bodies . . . . . . . . . . . . 72

6.2 Mapping between blocking conditions and local consistency statemaintained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7.1 Actions available to policy . . . . . . . . . . . . . . . . . . . . 93

7.2 Triggers sent from the underlying layer to the controller. . . . 94

7.3 Flags available for specifying consistency . . . . . . . . . . . . 95

7.4 Flags provided by the BlockingPolicyModule. . . . . . . . . . 105

8.1 Network overheads associated with subscription primitives. . . 111

8.2 Storage and network overheads under network disruptions. . . 115

8.3 Messages per interested update sent by a coherence-only systemand PADS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.4 Read/write performance for 1KB objects/files in milliseconds. 122

8.5 Read/write performance for 100KB objects/files in milliseconds. 122

8.6 Performance numbers for processing NULL trigger to produceNULL event. . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

xvi

9.1 Read and write latencies in milliseconds for Coda and P-Coda. 160

xvii

List of Figures

2.1 PR-AC-TI taxonomy for classifying families of storage systems. 13

3.1 PADS approach to system development. . . . . . . . . . . . . 24

6.1 Internal structures of the mechanisms layer. . . . . . . . . . . 69

6.2 Diagram comparing the messages sent on invalidation streamswithout and with multiplexing. . . . . . . . . . . . . . . . . . 79

6.3 Pseudocode for processing received invalidations. . . . . . . . . 86

6.4 Pseudocode for processing other received messages on invalida-tion streams. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7.1 Bridging the mechanisms and policy layers. . . . . . . . . . . . 97

7.2 Internal queues in the R/M Interface . . . . . . . . . . . . . . 98

7.3 Major components of the R/OverLog Runtime. . . . . . . . . 101

7.4 Blocking Policy Module intercepting operations. . . . . . . . . 104

7.5 Blocking policy implementation for causal consistency. . . . . 105

7.6 Blocking policy implementation for the simple client-server sys-tem (refer to Section 9.3.1) . . . . . . . . . . . . . . . . . . . . 107

8.1 Network bandwidth cost to synchronize 1000 10KB files, 100 ofwhich are modified. . . . . . . . . . . . . . . . . . . . . . . . . 116

8.2 Bandwidth to synchronize 500 objects of size 3KB . . . . . . . 118

8.3 Bandwidth to subscribe to varying number of single-object in-terest sets with and without subscription multiplexing. . . . . 120

8.4 Bandwidth to subscribe to varying number of updates to 500objects sets for checkpoint and log synchronization. . . . . . . 121

9.1 Demonstration of full client-server system, P-FCS, under failures.157

9.2 Demonstration of TierStore under a workload similar to that inFigure 9.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

9.3 Average read latency of P-Coda and P-Coda with cooperativecaching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

xviii

Chapter 1

Introduction

Distributed data storage is a building block for many distributed sys-

tems such as mobile file systems, web service replication systems, enterprise

file systems, etc. Often, these systems take advantage of data replication (i.e.

storing copies of data on multiple nodes) to provide availability, durability,

and performance guarantees.

However, there are tradeoffs inherent in any replication-based system

design. For example, the CAP dilemma [30] states that replication systems

cannot provide strong Consistency and high Availability in a network prone

to Partitions. Also, there is a fundamental tradeoff between Performance and

Consistency (known as the PC dilemma [55]). Because of these tradeoffs, it

is difficult to design a single replication system that satisfies the needs of all

environments and workloads. Every replication system needs to make tradeoffs

between availability, consistency, and performance to suit the environment it

is targeting. As new workloads, environments, and technologies emerge, new

tradeoffs need to be made and new systems need to be developed.

For example, consider the current technology trends: increasing popu-

larity of devices such as smartphones, set-top boxes and cameras, with ade-

1

quate storage and networking capabilities, and the availablity of cheap always-

available cloud storage. Users would want a distributed storage system that

allows them to share data (such as photos, music, and files) among their per-

sonal devices and cloud, so that they can access the latest version of their data

from any device. Due to the mobility and the resource limitations of the de-

vices, the system needs to support arbitrary synchronization topologies, allow

different devices to store different subsets of data, and take advantage of the

various networking technologies for greater energy-efficiency.

Unfortunately, building a new storage system is a huge task. It can take

months or even years to build such a system from scratch. Modifying existing

systems to accommodate new requirements may still require a lot of effort.

On the other hand, using current state-of-the-art replication frameworks may

not be a viable option because they are limited in the range of systems they

support (refer to Chapter 2).

The goal of this thesis is to make it easier to design and build dis-

tributed storage systems so that new systems can be quickly developed to

address the needs of new environments. In order to achieve this goal, this the-

sis proposes a new policy architecture, PADS, with a new set of abstractions

for building distributed storage systems. This work provides both practical

and scientific benefits. Its practical benefits include significant reduction in

effort required to build new distributed storage systems and increased flexibil-

ity allowing constructed systems to quickly adapt to new workloads. As for

scientific benefits, this work enhances the understanding of distributed storage

2

systems by identifying the right set of building blocks.

PADS is based on two key ideas: first, the separation of mechanisms and

policy, and second, the separation of policy into routing and blocking concerns.

The first idea is based on the observation that there is large overlap in

the low-level details of various replication systems. Every system stores data,

transfers updates, and maintains bookkeeping information. The differences

between the systems are the choices they make to accommodate their design

tradeoffs, i.e. where data items are stored, how and when information is

propagated, and what consistency and durability semantics to guarantee. If

all the necessary primitives were provided by a common mechanisms layer, a

system can be built by specifying policies that orchestrate these primitives.

The second idea separates the design of a system into routing policy

and blocking policy.

• Routing policy: Many of the design choices of distributed storage systems

are simply routing decisions about data flows between nodes. These

decisions provide answers to questions such as: “When and where to

send updates?” or “Which node to contact on a read miss?”. They

largely determine how a system meets its performance, availability, and

resource consumption goals.

• Blocking policy: Blocking policy specifies predicates for when nodes must

block incoming updates or local read/write requests to maintain system

invariants. Blocking is important for meeting consistency and durability

3

goals. For example, a policy might block the completion of a write until

the update reaches at least 3 other nodes.

Using the PADS approach has its advantages: First, designers focus

on high-level design rather than on low-level implementation. As a result,

development time is greatly reduced. Designers can quickly build systems

with new innovative techniques for new environments as they emerge. Second,

because the underlying framework provides mechanims to take care of diffi-

cult details such as failure recovery and consistency enforcement, designers

can handle corner-cases easily. Third, because the underlying framework is

general, it allows developers to quickly adapt existing systems if requirements

change. Fourth, since it is easier to informally review and formally verify

higher-level specification, this approach facilitates the development of correct

and deadlock-free systems.

However, the main challenge in designing PADS is to ensure that (1) it

is sufficiently general to support a wide range of systems including client-server

systems, object-replication systems and server-replication systems, (2) it has

an API that is easy to learn and allows easy specification of common replication

techniques such as demand caching, leases, and callbacks, (3) it makes it easy

to reason about corner-cases such as failure detection and recovery, and (4) it

is sufficiently efficient so that systems do not have pay a high cost for PADS’s

generality.

4

Sim

ple

Client

Server

Full

Client

Server

Coda

[49]

Coda

+C

oop

Cache

TR

IP[6

4]

TR

IP+

Hie

r

Tie

rSto

re

[26]

Tie

rSto

re

+C

C

Chain

Repl

[88]

Bayou

[69]

Bayou

+Sm

all

Dev

Pang-

aea

[77]

Routing

Rule

s24

43

37

45

66

14

22

45

10

10

59

Blo

ckin

g5

67

74

41

14

32

1C

ondit

ions

Topolo

gy

Clien

t/C

lien

t/C

lien

t/C

lien

t/C

lien

t/Tre

eTre

eTre

eC

hain

sA

d-

Ad-

Ad-

Ser

ver

Ser

ver

Ser

ver

Ser

ver

Ser

ver

Hoc

Hoc

Hoc

Rep

lica

tion

Part

ial

Part

ial

Part

ial

Part

ial

Full

Full

Part

ial

Part

ial

Full

Full

Part

ial

Part

ial

Dem

and

√√

√√

√√

√

cach

ing

Pre

fetc

hin

g√

√√

√√

√√

√√

√

Cooper

ative

√√

√√

√

cach

ing

Consi

sten

cySeq

uen

-Seq

uen

-O

pen

/O

pen

/Seq

uen

-Seq

uen

-M

ono.

Mono.

Lin

ear-

Causa

lM

ono.

Mono.

tial

tial

Clo

seC

lose

tial

tial

Rea

ds

Rea

ds

izablity

Rea

ds

Rea

ds

Callback

s√

√√

√

Lea

ses

√√

√

Invalvs.

Invali-

Invali-

Invali-

Invali-

Invali-

Invali-

Update

Update

Update

Update

Update

Update

whole

update

dation

dation

dation

dation

dation

dation

pro

pagation

Dis

connec

ted

√√

√√

√√

√√

√

oper

ation

Cra

sh√

√√

√√

√√

√√

√√

√

reco

ver

y

Obje

ctst

ore

√√

√√

√√

√√

√√

√√

inte

rface

*File

syst

em√

√√

√√

√√

√√

√√

inte

rface

*

Tab

le1.

1:Fe

atur

esco

vere

dby

case

-stu

dysy

stem

s.E

ach

colu

mn

corr

espo

nds

toa

syst

emim

plem

ente

don

PAD

S,an

dth

ero

ws

list

the

set

offe

atur

esco

vere

dby

the

impl

emen

tati

on.

∗ Not

eth

atth

eor

igin

alim

plem

enta

tion

sof

som

esy

stem

spr

ovid

ein

terf

aces

that

diffe

rfr

omth

eob

ject

stor

eor

file

syst

emin

terf

aces

we

prov

ide

inou

rpr

otot

ypes

.

5

To test the usefulness of PADS, we built a PADS prototype as an

instantiation of the architecture. The prototype consists of three parts: First,

it defines a common set of mechanisms that are general enough to build a

wide range of systems. Second, it separates system policy into two parts:

routing policy that defines how data flows between nodes and blocking policy

that defines when access to data should block. This dichotomy provides a

separation of concerns for designers allowing them to deal with one part of the

design at a time. Third, it distills the main abstractions of distributed storage

systems into a small API and supports declarative policy specification making

it easy for designers to write complex policy concisely.

Ultimately, the evidence of PADS’s benefits is simple: we were able

to build 12 diverse systems (summarized in Table 1.1) in 4 months. These

systems cover a large part of the design space suggesting that PADS is suf-

ficiently general to support new systems. In addition, PADS’s ease of use is

demonstrated by the fact that each system was constructed in less than a hun-

dred lines of policy code. Despite the conciseness, every constructed system is

concrete and deployable—the implementations handle challenging real-world

issues such as configuration and crash recovery. In addition, we find it easy

to add new features to PADS-constructed systems. For example, we add co-

operative caching to P-Coda, our implementation of Coda [49], in only one

week.

Our experience provides evidence of PADS’s generality, ease of use,

concreteness, and ease of adaptability. This flexibility comes only at a modest

6

cost to absolute performance. Microbenchmark performance of an implemen-

tation of one system (P-Coda) built on our user-level Java PADS prototype is

within ten to fifty percent of the original system (Coda [49]) in the majority

of cases and 3.3 times worse in the worst case we measure.

1.0.1 Contributions

By defining a new policy architecture, this dissertation makes the fol-

lowing contributions:

• It provides a new way to build distributed storage systems that signifi-

cantly reduces development effort.

• It defines the set of mechanisms required for a common underlying layer

over which a wide range of systems can be constructed.

• It proposes an intuitive and structured dichotomy for policy specification.

• It defines a policy API that is small yet sufficient for easy specification

of complex systems.

• It provides a prototype with which designers can build new storage sys-

tems quickly with modest performance overheads.

• It provides ready-to-use prototypes of 12 diverse systems that cover a

large part of the design space.

This dissertation is organized as follows: Chapter 2 makes a case for

why a new framework is needed for building storage systems. It looks at the

7

design space covered by distributed storage systems, current approaches for

building a system and where they fall short, and defines the requirements for

a new framework. Chapter 3 provides a high-level discussion of the PADS

approach and the prototype. Chapters 4 and 5 discuss the routing policy and

blocking policy respectively including their requirements, specification and im-

plementation. Chapter 6 details the implementation of the mechanisms layer

and how it meets the requirements of the policy layer. Chapter 7 details the

interface glue between the mechanisms and the policy layers. Chapters 8 and

9 evaluate the benefits of PADS for system development. Chapter 8 evaluates

the performance of PADS with various microbenchmarks and chapter 9 eval-

uates the benefits of PADS as a development platform by describing systems

constructed with PADS. Chapter 10 presents some related work. Chapter 11

discusses the limitiations of the architecture and lays out directions for future

work. We summarize the main contributions and the lessons learned from this

dissertation in Chapter 12.

8

Chapter 2

Need for a General Framework

This section establishes the need for a general framework to build dis-

tributed storage systems. First, it explores the design range which replication-

based storage systems cover and introduces a taxonomy to classify them. Next,

it provides reasons for why new distributed storage systems will continue to

be needed. Then, it describes the tremendous amount of effort required to

build a new storage system and how current frameworks fall short in terms of

generality. Finally, it defines the requirements for an ideal framework.

2.1 Large Design Space

Time Frame SystemsPre-1985 NFS [68], Replicated Dictionary [90]1985-1994 AFS [42], Coda [49], Deceit [80], Ficus [36],

Sprite [65] XFS [89], Zebra [38]1995-2004 Bayou [86], BlueFS [66], Chain Replication [88],

CFS [21], Farsite [12], Google File System [29],Ivy [63], OceanStore [50], Pangaea [77], PAST [76],Segank [82], TACT [43]

2005-present Cimbiosys [70], Dynamo [25], Omnistore [46],Tier-Store [26], WinFS [59], WheelFS [84]

Table 2.1: Partial list of distributed storage systems published in the past 25 years.

Data storage the basic building block for many distributed systems

9

such as mobile file systems, web service replication systems, enterprise file

systems, and grid replication systems. Distributed storage systems often use

data replication in order to provide durability, availability, and performance

guarantees. In fact, over the years, a significant number of replication-based

distributed storage systems have been built. Figure 2.1 depicts a subset of

systems that have been been published in a few major conferences over the

last 25 years.

Because distributed storage systems need to deal with a wide range of

heterogeneity in terms of devices with diverse capabilities (e.g., phones, set-

top-boxes, laptops, servers), workloads (e.g., streaming media, interactive web

services, private storage, widespread sharing, demand caching, preloading),

connectivity (e.g., wired, wireless, disruption tolerant), and environments (e.g.,

mobile networks, wide area networks, developing regions), they cover a wide

design space—systems employ different techniques in order to accommodate

the requirements, workloads, or environment they target. For example, some

systems may replicate all data on every node whereas others may replicate

specific subsets on each node. Also, systems propagate updates via different

paths (i.e. update propagation topology) and implement different propaga-

tion techniques such demand caching, prefetching, co-operative caching. In

addition, different systems may guarantee different levels of availability and

consistency that are enforced by different means. For example, one system

may guarantee data availability at all times at the expense of consistency

whereas another may use callbacks and leases to implement stronger consis-

10

Design Choices Options Example Systems

Replication Full Replication Bayou [69], Chain Replica-tion [88], TRIP [64], WinFS [67]

Partial Replication BlueFS [66], Coda [49],Sprite [65], Pangaea [77],TierStore [26]

Update TopologyClient-Server BlueFS [66], Coda [49], TRIP [64]Structured Chain Replication [88],

Sprite [65], TierStore [26],WheelFS [84]

Ad-hoc Bayou [69], IceCube [47], Pan-gaea [77]

Propagation Type Operation Bayou [69], IceCube [47]State (Update only) Ficus [36], TACT [43], Tier-

Store [26], WheelFS [84]State (Update BlueFS [66], Coda [49]and Invalidation) Fluid Replication [20]

ConsistencyGuarantees

Strong Chain Replication [88], TRIP [64]Causal Bayou [69], TACT [43]Eventual Ficus [36], Pangaea [77],

WinFS [67]

Commit Policy Implicit Pangaea [77], TierStore [26]Primary-commit Bayou [69], Coda [49]

Availability Always Available Bayou [69], TierStore [26], Pan-gaea [77]

Only When Connected AFS [42], Coda [49], Chain Repli-cation [88]

Durability Local Sufficient Bayou [69], TierStore [26], Pan-gaea [77]

Multi-node Coda [49], Chain Replication [88],WheelFS [84]

Table 2.2: Storage systems design space and examples

11

tency semantics and be unavailable in cases of network partitions. Another

design difference between systems is when a system considers an update to be

durable. For some systems, an update is sufficiently durable if it is persistent

to local disk, whereas other systems may require an update to be persistent

on multiple nodes to be considered durable. Table 2.2 provides a partial list

of the design space covered by distributed storage systems.

2.1.1 PRACTI Taxonomy

In order to aid the understanding of the design space covered by replication-

based systems, we define the PRACTI taxonomy. The taxonomy is based on

three properties:

• Partial Replication (PR): a system can place any subset of data and

metadata on any node. In contrast, some systems require a node to

maintain copies of all data in the system [69, 90, 94].

• Arbitrary Consistency (AC): a system can provide both strong and weak

consistency guarantees and only applications that require strong guaran-

tees pay for them. In contrast, some systems can only enforce relatively

weak coherence guarantees but make no guarantees about stronger con-

sistency properties [36, 77].

• Topology Independence (TI): any node can exchange updates with any

other node. In contrast, many systems restrict communication to client-

server [42, 49, 65] or hierarchical [17, 92] patterns.

12

Partial Replication

(PR)

Topology

Independence

(TI)

Any

Consistency

(AC)

Client-

Server

Hierarchical

DHT

Object

Replication

Server

Replication

PR

AC

TI

Figure 2.1: PR-AC-TI taxonomy for classifying families of storage systems.

Every distributed storage system can be categorized according to which

of the PR-AC-TI properties it provides, as illustrated in Figure 2.1. As the

figure indicates, many existing storage systems can be viewed as belonging to

one of four protocol families, each providing at most two of the PR-AC-TI

properties.

Server replication systems like Replicated Dictionary [90] and Bayou [69]

provide log-based peer-to-peer update exchange that allows any node to send

updates to any other node (TI) and that consistently orders writes across all

objects. Lazy Replication [51] and TACT [94] use this approach to provide

a wide range of tunable consistency guarantees (AC). Some server replication

systems like Chain Replication [88] do not assume topology independence and

allow updates to propagate via fixed paths. Unfortunately, sever replication

13

protocols fundamentally assume full replication: all nodes store all data from

any volume they export and all nodes receive all updates. As a result, these

systems are unable to exploit workload locality to efficiently use networks and

storage, and they may be unsuitable for devices with limited resources.

Client-server systems like Sprite [65] and Coda [49] and hierarchical

caching systems like hierarchical AFS [62] permit nodes to cache arbitrary sub-

sets of data (PR). Although specific systems generally enforce a set consistency

policy, a broad range of consistency guarantees are provided by variations of

the basic architecture (AC). However, these protocols fundamentally require

communication to flow between a child and its parent. Even when systems

permit limited client-client communication for cooperative caching [14, 23, 28],

they must still serialize control messages at a central server for consistency [18].

These restricted communication patterns (1) hurt performance when network

topologies do not match the fixed communication topology or when network

costs change over time (e.g., in environments with mobile nodes), (2) hurt

availability when a network path or node failure disrupts a fixed communi-

cation topology, and (3) limit sharing during disconnected operation when a

set of nodes can communicate with one another but not with the rest of the

system.

DHT-based storage systems such as BH [87], PAST [76], and CFS [21]

implement a specific—if sophisticated—topology and replication policy: they

can be viewed as generalizations of client-server systems where the server is

split across a large number of nodes on a per-object or per-block basis for scal-

14

ability and replicated to multiple nodes for availability and reliability. This

division and replication, however, introduce new challenges for providing con-

sistency. For example, the Pond OceanStore prototype assigns each object

to a set of primary replicas that receive all updates for the object, uses an

agreement protocol to coordinate these servers for per-object coherence, and

does not attempt to provide cross-object consistency guarantees [73].

Object replication systems such as Ficus [36], Pangaea [77], and WinFS [59]

allow nodes to choose arbitrary subsets of data to store (PR) and arbitrary

peers with whom to communicate (TI). But, these protocols enforce no or-

dering constraints on updates across multiple objects, so they can provide

coherence but not consistency guarantees. Unfortunately, reasoning about

the corner cases of consistency protocols is complex, so systems that provide

only weak consistency or coherence guarantees can complicate constructing,

debugging, and using the applications built over them. Furthermore, support

for only weak consistency may prevent deployment of applications with more

stringent requirements.

Our analysis indicate that there is a fundamental challenge to provide

all three PR-AC-TI properties: supporting flexible consistency (AC) requires

careful ordering of how updates propagate through the system, but consistent

ordering becomes more difficult if nodes communicate in ad-hoc patterns (TI)

or if some nodes know about updates to some objects but not other objects

(PR). Existing systems resolve this dilemma in one of three ways. The full

replication of AC-TI replicated server systems ensures that all nodes have

15

enough information to order all updates. Restricted communication in PR-AC

client-server and hierarchical systems ensures that the root server of a subtree

can track what information is cached by descendants; the server can then

determine which invalidations it needs to propagate down and which it can

safely omit. Finally, PR-TI object replication systems simply give up ability to

consistently order writes to different objects and instead allow inconsistencies.

2.2 Need for New Storage Systems

Despite the plethora of existing systems, new systems are constantly

developed to meet the needs of new environments and workloads. The main

reason is that there are fundamental tradeoffs inherent in every storage system

design. For example, the CAP dilemma [30] states that replication systems

cannot provide strong Consistency and high Availability in a network prone to

Partitions. Also, designs often need to make tradeoffs between Performance

and Consistency (PC dilemma [55]). Because of these tradeoffs, every system

makes design choices that balance performance, resource usage, consistency

and availability to best meet the needs of the workload or environment it is

targeting. As new environments and new workloads emerge, they often require

different tradeoffs that existing systems cannot accommodate creating a need

for new systems.

16

System Lines of CodeCoda [49] 167,564TierStore [26] 22,566Pangaea [77] 30,000Bayou [69] 20,425TRIP [64] 16,909

Table 2.3: Approximate lines of code of several existing distributed data storage systems.

2.3 Difficulty in Building a New Storage System

Unfortunately, building a new storage system is not easy. A common

approach to build a storage system is to implement it from scratch. The

advantage of this approach is that it allows designers to hand-craft optimal

tradeoffs for the target workload or environment. Unfortunately, this approach

can take months or years of development effort: designers need to implement all

the low-level details, often re-implementing common techniques. Also, getting

the consistency guarantees right can be tricky. Table 2.3 provides an indication

of the amount of effort required to implement several systems by listing their

lines of code.

Another approach involves modifying existing systems to accommodate

new tradeoffs. Although this approach would enable a lot of code-reuse, it is

not straight-forward. For many systems, their design tradeoffs are embedded

in their low-level implementation and there is a limit to how much they can

be modified.

The third approach involves using an existing distributed storage frame-

work. Unfortunately, current frameworks fall short in terms of generality: they

17

support only full replication or make assumptions regarding update propaga-

tion topology. Systems built with these frameworks can only target specific

environments and may have limited capabilities to adapt to new requirements.

For example, the TACT toolkit [43] allows designers to specify their

consistency requirements in terms of order error, staleness, and numerical or-

der. The toolkit automatically sets up synchronization streams to meet these

requirements. Unfortunately, it assumes full replication of data among nodes.

Swarm [85] is another wide-area peer-replication middleware that allows de-

signers to pick consistency-related options for replica synchronization, failures,

and concurrency. Again, it only supports full replication and carries out up-

date propagation only in a hierarchical fashion.

WheelFS [84], a wide-area distributed file system, supports partial repli-

cation and allows applications to provide semantic cues to specify consistency,

replication level, and placement. However, WheelFS only supports two levels

of consistency on a per-object basis: open-to-close consistency or eventual.

Also, it assumes a single-primary-multiple-backup scheme for update propa-

gation. Deceit [80], a flexible distributed file system that also supports partial

replication and allows users to set parameters for a file to achieve different

levels of availability, performance, and consistency. Unfortunately, it is only

applicable to specific environments: replication among well-connected servers

and client-side caching. Fluid replication [20] also provides a menu of options

for consistency and automatically creates replicas to meet those requirements.

Unfortunately, it is restricted to hierarchical caching.

18

Stackable layers [39], allows designers to create flexible storage systems

with layers that provide different services. Unfortunately, it supports only

local storage not distributed storage.

2.4 Need for a General Framework

Because current approaches for building new distributed storage sys-

tems require substantial effort or are suited for very specific environments,

there is a need for a general framework that can make it easier and faster to

develop a wide range of new distributed storage systems. An ideal framework

must support all PR-AC-TI properties. In particular, it must

• allow any node to store any subset of data,

• not restrict update propagation to a specific topology,

• allow designers to specify a wide range of consistency semantics, and

• facilitate development and evolution of systems to accommodate new

requirements.

This dissertation addresses this need by introducing the PADS archi-

tecture for constructing distributed storage systems. Later chapters provide

details of PADS, how it provides three PR-AC-TI properties, and how it sim-

plifies system development.

19

Chapter 3

The PADS Approach

The previous chapter argues that a new general framework would make

it easier to build distributed storage systems. In order to meet that need, this

thesis proposes the PADS approach to simplify the design and implementation

of storage systems. The PADS approach is based on two key ideas: First, by

separating policy from mechanisms, system developers can focus their energies

on high-level design rather than implementing low-level details. Second, by

separating policy into routing and blocking concerns, the task of building a

system is divided into two smaller sub-problems. This chapter discusses the

intuition behind these approaches and briefly describes how these ideas are

realized in the PADS prototype.

3.1 Separation of Mechanisms and Policy

During the development of a distributed replication system, a lot of

effort is spent on the implementation and the debugging of low-level details,

such as data storage, transmission of data, bookkeeping, conflict detection, etc.

Also, there is a considerable overlap in the low-level techniques that different

systems require, leading to wasted re-engineering effort.

20

PADS argues that by constructing a flexible mechanisms layer that

provides low-level primitives, system development is reduced to writing policies

that orchestrate these primitives, thereby reducing the implementation effort

required.

PADS casts the mechanisms as part of a data plane and policies as part

of a control plane. The data plane encapsulates a set of common mechanisms

that handle details of storing and transmitting data and maintaining consis-

tency information. System designers then build storage systems by specifying

a control plane policy that call upon these mechanisms to implement system

design.

The separation of policy from mechanisms significantly reduces devel-

opment effort for several reasons: First, since the mechanisms layer takes care

of the grunt work, system designers can focus their energies on high-level de-

sign issues. Second, debugging effort is reduced because developers only need

to focus on policy debugging rather than debugging the underlying mecha-

nisms layer. The effort spent on debugging the mechanisms layer is amortized

over all the systems built over it. Third, if system requirements change, only

policy needs to be updated rather than the low-level primitives.

3.2 Separation of Policy into Routing and Blocking

Since distributed storage systems span a wide design space, the next

question is how should policies be specified so that the designs of different

systems can be easily constructed over the mechanisms layer.

21

Our observation is that a lot of the design choices that distributed

systems make can be seen as how data is routed among nodes and when it is

“safe” to access data. PADS is built around this intuition and separates policy

specification along this dichotomy: routing policy and blocking policy.

Routing policy sets up data flows between nodes. It provides answers

to questions such as: “When and where to send updates?” or “Which node

to contact on a read miss?”, and largely determines how a system meets its

performance, availability, and resource consumption goals. For example, in

Bayou [69], a server replication system, a node periodically sets up an update

flow to a random peer to exchange updates. In chain replication [88], another

server replication protocol, updates occur at the head of each chain and flow

down to the tail. In Coda [49], a client-server protocol, a client will fetch data

from the server on a miss and send updates to the server when a file is closed

or when an unreachable server becomes reachable.

Blocking policy can be seen as specifying the system invariants that

must hold before or after an operation can continue. If the system invariants

do not hold, then the operation is blocked. For example, for durability require-

ments, the completion of a write might block until the update reaches three

other nodes. For consistency requirements, the read of a local data object

might block until the local version matches the latest version in the system.

The advantage of this separation is that development is split into two

smaller sub-problems presenting a division of concerns to the designer. Also,

because routing and blocking concerns are very different, their separation en-

22

ables PADS to provide specialized policy API designed to make the specifica-

tion of each concern simpler. On another level, blocking policy and routing

policy can be seen as working together to enforce the safety constraints and

the liveness goals of a system. Blocking policy enforces safety conditions by

ensuring that an operation blocks until system invariants are met, whereas

routing policy guarantees liveness by ensuring that an operation will eventu-

ally unblock—by setting up data flows to ensure the conditions are eventually

satisfied.

3.3 PADS Prototype

To test effectiveness of PADS, we built a Java-based prototype and

constructed a dozen systems with it. The prototype consist of 3 parts: a

flexible and efficient mechanisms layer, a routing policy API, and a blocking

policy API. The mechanisms layer implements all the basic primitives for stor-

ing data objects, sending and receiving updates, and maintaining consistency

information in a way that is able to support all PR-AC-TI properties. The

policy API allows designers to invoke the mechanisms but does so in a way

that is general enough to support a wide range of systems.

The key features of the prototype that enable it to simplify development

include

• its concise policy API that is intuitive, easy to learn, and captures the

right abstractions for distributed storage, and

23

!"#$%!#"&!

'()*+,!

"(-./0!

1-(/2.+,!

"(-./0!

"#$%!

3(45.-67!"#$%!

86/9:+.;4;!

<=6/)>:?-6!

'()*+,!

"(-./0!

1-(/2.+,!

3(+@,!!

A.-6!

"(-./0!

%56/.@/:*(+!"(-./0!

3(45.-:*(+!

%0;>64!

$65-(046+>!

B(C6!D!

"#$%!#"&!

Control

Plane !Data

Plane !

B(C6!E! B(C6!F! B(C6!G!

$:>:!

A-(H;!

I(/:-!

'6:CJK7.>6 !

Figure 3.1: PADS approach to system development.

• its support for a declarative policy specification that enables quick and

concise implementation of complex systems. Routing policy is written as

a set of rules in a declarative domain specific language called R/OverLog.

Blocking policy is written as a set of predicates in a configuration file.

The following chapters (chapters 4, 5, 6 and 7), detail each part of the proto-

type. Chapter 9 provides evidence of PADS’s effectiveness by describing the

range of systems built with PADS.

3.3.1 Using PADS

As Figure 3.1 illustrates, in order to build a distributed storage system

on PADS, a system designer writes a routing policy as a set of rules and a

blocking policy as a set of predicates. She uses a PADS compiler to translate

her routing rules to Java byte-code and places the blocking predicates in a

configuration file. Finally, she distributes a Java jar file containing PADS’s

standard data plane mechanisms and her system’s control policy to the sys-

24

tem’s nodes. Once the system is running at each node, users can access locally

stored data, and the system synchronizes data among nodes according to the

policy.

A PADS node can be seen as an instantiation of a routing and a blocking

policy over the mechanisms layer on a single system node.

3.4 Scope and Limitations

A PADS policy is a specific set of directives rather than a statement of a

system’s high-level goals. Distributed storage design is a creative process and

PADS does not attempt to automate it: a designer must still devise a strategy

to resolve trade-offs among factors like performance, availability, resource con-

sumption, consistency, and durability. For example, a policy designer might

decide on a client-server architecture and specify “When an update occurs at

a client, the client should send the update to the server within 30 seconds”

rather than stating “Machine X has highly durable storage” and “Data should

be durable within 30 seconds of its creation” and then relying on the system

to derive a client-server architecture with a 30 second write buffer.

PADS targets distributed storage environments with mobile devices,

nodes connected by WAN networks, or nodes in developing regions with lim-

ited or intermittent connectivity. In these environments, factors like limited

bandwidth, heterogeneous device capabilities, network partitions, or workload

properties force interesting tradeoffs among data placement, update propaga-

tion, and consistency. Conversely, we do not target environments like well-

25

connected clusters.

Within this scope, there are four design issues for which the current

PADS prototype restricts a designer’s choices:

First, the prototype does not support security specification. Ultimately,

our policy architecture should also define flexible security primitives, and pro-

viding such primitives is important future work [57].

Second, the prototype exposes an object-store interface for local reads

and writes. It does not expose a file system or a tuple store interface. We

believe that these interfaces are not difficult to incorporate. Indeed, we have

implemented an NFS [68] interface over our prototype.

Third, the prototype provides a single mechanism for conflict resolution.

Write-write conflicts are detected and logged in a way that is data-preserving

and consistent across nodes to support a broad range of application-level re-

solvers. We implement a simple last-writer-wins resolution scheme and believe

that it is straightforward to extend PADS to support other schemes [26, 47,

49, 79, 86].

Fourth, the prototype only supports state-transfer (i.e. updates and

meta-data about updates) and not operation-transfer [47, 69].

3.5 Summary

In order to simplify the development of distributed storage systems,

the PADS approach incorporates two ideas: First, systems are constructed

26

by specifying high level-policy over a common mechanisms layer. Second,

policy specification is separated into routing policy and blocking policy, each

of which addresses different design concerns. Later chapters provide details

of the policy layer, the mechanisms layer, and the advantages of PADS for

system development.

27

Chapter 4

Routing Policy

In PADS, routing policy specifies how data and meta-data propagate

among nodes. This chapter discusses what exactly constitutes routing policy

and how it can be specified. The challenge is to define an API that is suf-

ficiently expressive to construct a wide range of systems and yet sufficiently

simple for a designer to comprehend. We discuss how PADS meets this chal-

lenge.

4.1 What is Routing?

A large part of a distributed storage system’s design defines how up-

dates and information propagate among nodes to ensure that system meets its

performance and availability goals. The design provides answers to questions

such as: “When and where to send updates?” or “Which node to contact on

a read miss?”. In effect, the design sets up a overlay among the nodes for

information propagation. For example, in a traditional client-server system,

a client fetches data from the server and sends updates to it. Its design can

be seen as setting up a star-shaped overlay among the nodes with the server

in the center and clients at the edge. To implement hierarchical caching, a

28

system defines a tree overlay among nodes so that updates or notifications

propagate in a hierarchical manner.

What is needed from a policy layer are primitives to set up flows among

nodes, ways to retrieve information about important events (e.g. read miss),

and a means to invoke the flow primitives based on the information received.

In order to address these needs, PADS provides the following tools:

• Subscription: a flexible primitive to set up data flow between two nodes.

• Event-driven API: that informs policy about important events and allows

routing policy to set up data flows at appropriate times.

• R/OverLog: a domain specific declarative language for easy policy spec-

ification.

In short, in PADS, routing policy is written as an event-driven program

comprising of a set of declarative rules that set up or remove subscriptions

when particular events occur. We detail them in later sections of this chapter.

Note that we do not claim that these primitives are minimal or that they are

the only way to realize this approach. However, they have worked well for us

in practice.

4.2 Subscriptions

Systems set up data flows differently. In client-server systems, clients

only send updates to the server, whereas in peer-to-peer systems, any node

29

can send updates to any other node. Nodes in a server replication system

often exchange updates to the whole data set, whereas clients in a client-server

system only retrieve updates for the data they cache. Some systems propagate

notifications rather than updates to quickly propagate update information.

In order to support different types data flows, the primitives provided

by PADS must be efficient, support all three PR-AC-TI properties, support

separate data and meta-data propagation, expose options for different perfor-

mance tradeoffs, and allow incremental progress in cases of disruptions.

Instead of having multiple data flow primitives, PADS is able to encap-

sulate sufficient flexibility into a single primitive—a subscription. A subscrip-

tion is an unidirectional stream of updates between a pair of nodes. For each

subscription, the policy writer can choose

• The source and the destination nodes: Updates are sent from the source

node to the destination node. Since subscriptions can be set up between

any nodes, they support topology independence (TI).

• The subset of data that is of interest to the subscription: Only updates

to objects that are in the interest set will be sent on the subscription.

This allows subscriptions to support partial replication (PR).

• What type of information should be sent: There are two choices: invalida-

tions or bodies. An invalidation indicates that an update to a particular

object occurred at a particular instant in time, whereas a body contains

30

the data for a specific update. The advantage of separating invalidation

and body streams is that it enables meta-data and data to propagate

via different paths. A node can quickly and cheaply inform other nodes

about an update, via an invalidation, without having to send the entire

update. Also, nodes can choose to only receive bodies of the updates

they care about, leading to better bandwidth efficiency.

• The logical start time: Only updates that have occurred to the objects of

interest from the start time will be sent on the subscription. Therefore, a

new subscription can continue where the previous one left off supporting

incremental progress.

• The catchup method: If the start time of an invalidation subscription is

earlier than the sender’s current time, then the sender has two options,

each of which has different performance tradeoffs: the sender can trans-

mit either a log of the updates that have occurred since the start time

or a checkpoint of the latest versions of the objects. Body subscription,

however, only send checkpoints and do not support the log option.

Hence, setting up data flows is reduced to setting up subscriptions

among nodes. For example, if a designer wants to implement hierarchical

caching, the routing policy would set up subscriptions among nodes to send

updates up and to fetch data down the hierarchy. If a designer wants nodes

to randomly gossip updates, the routing policy would set up subscriptions

between random nodes. If a designer wants mobile nodes to exchange updates

31

when they are in communication range, the routing policy would probe for

available neighbors and set up subscriptions at opportune times.

The advantage of having a single primitive is that it makes make system

design cleaner—all routing is set up in terms of subscriptions.

4.3 Event-driven API

In order to make it easier to write routing policies, the interface should

allow designers to do four things: to set up subscriptions, to receive information

about events, to store and look up data objects in the underlying layer, and

to communicate with blocking policy . For example, a client’s routing policy

needs to know when a local read miss occurs so that it can establish a sub-

scription to retrieve the required data from the server. In a publish/subscribe

system, a node may store and read from a configuration object the sets of data

it is interested in. For consistency reasons, a blocking policy may block the

completion of a write until the routing policy indicates (by sending an message

to the blocking policy) that the update has propagated to the required nodes.

PADS provides three sets of primitives for specifying routing policies:

(1) a set of 9 triggers that expose events pertaining to the state of the data

plane (2) a set of 7 actions for establishing and removing subscriptions (3) a

set of 5 stored events that allow a policy to persistently store and access data

objects for configuration options and information affecting routing decisions.

Consequently, writing routing policy entails invoking the appropriate actions

or storing events based on the triggers and events received.

32

Local Read/Write TriggersOperation block obj, off, len, blocking point, failed conditionWrite obj, off, len, writerId, timeDelete obj, writerId, time

Message Arrival TriggersInval arrives srcId, obj, off, len, writerId, timeSend body success srcId, obj, off, len, writerId, timeSend body failed srcId, destId, obj, off, len, writerId, time

Connection TriggersSubscription start srcId, destId, objS, inval|bodySubscription caught-up srcId, destId, objS, invalSubscription end srcId, destId, objS, reason, inval|body

Table 4.1: Routing triggers provided by PADS.objId, off, and len indicate the object identifier, offset, and length of the update to be sent.blocking point and failed condition indicate at which point an operation blocked and what

condition failed (refer to Chapter 5). writerId and time indicate the logical time of aparticular update. srcId and destId are the identifiers of the source node and the

destination node respectively. objS indicates the interest set of the subscription. inval |body indicate the type of subscription. reason indicates whether the subscription ended

due to failure or termination.

In fact, because the number of primitives provided by PADS is small,

designers are not overwhelmed by too much choice. Most importantly, since we

were able to express a wide range of systems with these primitives (refer Chap-

ter 9), we believe that these primitives capture the fundamental abstractions

for distributed storage systems.

The following subsections detail these primitives.

4.3.1 Triggers

PADS triggers expose to the routing policy events that occur in the

data plane. As Table 4.3.1 details, these events fall into three categories.

33

Subscription ActionsAdd Inval Sub srcId, destId, objS, [startTime], LOG|CP|CP+BodyAdd Body Sub srcId, destId, objS, [startTime]Send Body srcId, destId, objId, off, len, writerId, timeRemove Inval Sub srcId, destId, objSRemove Body Sub srcId, destId, objS

Consistency-related ActionsAssign Seq objId, off, len, writerId, timeB Action <policy defined>

Table 4.2: Routing actions provided by PADS.objId, off, and len indicate the object identifier, offset, and length of the update to be sent.startTime specifies the logical start time of the subscription. writerId and time indicatethe logical time of a particular update. The fields for the B Action are policy defined.

• Local operation triggers inform the routing policy when an operation

blocks because it needs additional information to complete or when a

local write or delete occurs.

• Message receipt triggers inform the routing policy when an invalidation

arrives or whether a send body succeeds or fails.

• Connection triggers inform the routing policy when a subscription is suc-

cessfully established, when a subscription has caused a receiver’s state to

be caught up with a sender’s state (i.e., the subscription has transmitted

all updates to the subscription set up to the sender’s current time), or

when a subscription is removed or fails.

34

4.3.2 Actions

PADS actions allow the routing policy to set up data flows (i.e. sub-

scriptions) among nodes. As Table 4.3.2 details, these actions fall into two

categories.

• Add subscription actions that establish streams of invalidations or bodies

for specific subsets of objects between two nodes. The send body action

is a special case of a subscription in which only the body of a single

object is transferred.

• Remove subscription actions that terminate invalidation or body streams

between two nodes.

PADS also defines a third type of actions that aid the enforcement

of consistency: the Assign Seq action and the B Action. Assign Seq action

marks a previous update with a commit sequence number allowing policies

to implement various consistency and commit protocols. For example, in a

client-server system, in order to enforce serializability, when a server receives

an update to an object, it uses the Assign Seq action to commit the update once

it is sure that all other versions of the object (located on other clients) have

been invalidated. Serializability can be guaranteed at clients by reading only

valid committed versions (refer to Chapter 9). B Action allows routing policy

to send messages to blocking policy. Blocking policy may define conditions

based on information available to routing policy. For example, whether the

35

Stored EventsWrite event objId, eventName, field1, ..., fieldNRead event objIdRead and watch event objIdStop watch objIdDelete events objId

Table 4.3: PADS’s stored events interface.objId specifies the object in which the events should be stored or read from. eventName

defines the name of the event to be written and field* specify the values of fields associatedwith it.

node is disconnected from the server, or whether an update has reached 3 other

nodes. Once the conditions are satisfied, the routing policy uses the B Action

to inform the blocking policy so that the operation can continue.

4.3.3 Stored Events

Many systems look up or maintain persistent state to make routing de-

cisions. For example, in a client-server system, routing policy needs to look up

the identity of the server in order to determine who to establish subscriptions

to. In a publish/subscribe system, the routing policy may need to look up or

maintain the sets of objects for which to establish subscriptions.

Supporting this need is challenging both because we want an abstrac-

tion that meshes well with our event-driven programming model and because

the techniques must handle a wide range of scales. In particular, the abstrac-

tion must not only handle simple, global configuration information (e.g., the

server identity in a client-server system like Coda [49]), but it must also scale

up to per-file information (e.g., which nodes store the gold copies of each object

36

in Pangaea [77].)

PADS defines a new abstraction, the stored events primitives, in order

to store and retrieve events from a data object in the underlying persistent

object store. Table 4.3 details the full API for stored events. A Write Event

stores an event into an object and a Read Event causes all events stored in

an object to be fed as input to the routing program. The API also includes

Read and Watch to produce new events whenever they are added to an object,

Stop Watch to stop producing new events from an object, and Delete Events

to delete all events in an object.

The stored events primitives enable routing policy to implement various

scenarios intuitively in the event-driven fashion.

For example, in a client-server system, clients need to know the iden-

tity of the server to contact. At configuration time, the installer writes the

event <add server, serverID> to the object /config/server. At startup,

the routing policy generates a <read event, /config/server> action that

looks up the /config/server object. The add server event is retrieved from

the object which, in turn,triggers actions in the routing policy to establish

subscriptions to the server.

In a hierarchical information dissemination system, a parent p keeps

track of what volumes a child subscribes to so that the appropriate sub-

scriptions can be set up. When a child c subscribes to a new volume v, p

stores the information in a configuration object /subInfo by generating a

37

<write event, /subInfo, child sub, p, c, v> action. When this infor-

mation is needed, for example on startup or recovery, the parent generates

a <read event, /subInfo> action that causes a <child sub, p, c, v>

event to be generated for each item stored in the object. The child sub

events, in turn, trigger actions in the routing policy that re-establish subscrip-

tions.

In some client-server systems [49], a client maintains a list of objects to

prefetch from the server (i.e. a hoard list). The routing policy can store an item

on the hoard list by using the write event action to store the <hoard item,

objId> event into the persistent object /config/[nodeID]/hoardlist. Later

when the policy wishes to walk the hoard list to prefetch objects, it can use

the read event action to produce all hoard item events stored in the hoard

list. The production of these events, in turn, activates policy actions to fetch

an item from the server.

In Pangaea [77], each file’s directory entry includes a list of gold nodes

that store copies of that file. To implement such fine-grained, per-file routing

information, a PADS routing policy creates a .meta object for each directory,

stores the gold node information for each of its children as <file entry,

objId, goldNodeList>, and updates a file’s file entry whenever the file’s

set of gold nodes changes (e.g., due to a long-lasting failure). When a read

miss occurs, the routing policy produces the stored file entry event from the

parent .meta object, and this event activates rules that route a read request

to one of the file’s gold nodes.

38

4.4 R/OverLog

In order to make it easier to write routing policies, PADS provides

designers with a domain specific language, R/OverLog—a declarative language

based on OverLog [56] network routing language.

The advantage of using R/OverLog for policy specification is three-

fold: First, R/OverLog is event-driven so it fits well with the PADS paradigm.

Second, since R/OverLog is based on a network routing language, it allows

policies to use the same interface to do network management and invoke PADS

actions based on network events. For example, a policy can easily specify

“add a subscription when a peer is detected”. Third and most importantly,

because PADS uses the routing abstraction for replication, for which OverLog

and R/OverLog have been specially designed, complex policies can be easily

and concisely specified. Statements in an R/OverLog program closely follow

pseudocode, making programs easy to write. Also, most R/OverLog programs

fit in a couple of pages making it easier to do design reviews. Granted that

there is a learning curve to learn R/OverLog, but the compactness of policies

and the ease with which they can be written make the effort worth-while.

Since PADS mechanisms are written in Java, a R/OverLog routing

policy is translated into Java and is compiled before being instantiated over

the mechanisms. This section provide details of the R/OverLog language

semantics. Details of how R/OverLog is translated into Java are provided in

Chapter 7.

39

4.4.1 R/OverLog Language Semantics

A R/OverLog program defines a set of tables and a set of rules. Tables

store tuples that represent internal state of the routing program, whereas rules

define the logic of the program and are fired when events occur. We discuss

each part of the language semantics in turn.

4.4.1.1 Tuples, Events and Data Types

Tuples represent the state of an R/OverLog program and are stored in

tables. Every tuple has a name and consists of a series of fields. An event can

be considered as a special case of a tuple that is not stored in a table.

Since R/OverLog is a typed-language, it defines 5 data types for fields:

• int: an integer (For example, 1)

• float: an float (For example, 3.56)

• boolean: a boolean value (For example, true)

• string: a string of characters enclosed in double quotes (For example,

"orange")

• location: network address of a node, including the hostname or IP

address and port number. (For example, myhost.mydomain.com:5000)

In order to define field types, type declaration statements need to be

used as follows:

40

tuple <name>(location [, type*]).

where name is string that starts with a lower-case letter. The first field of a

tuple or an event must be of the location type. The location represents the

node in the tuple is stored or the node to which the event should be sent. For

example, the following type declaration:

tuple neighbor(location, location, float).

declares a neighbor tuple with 3 fields: 2 locations and 1 float.

In practice, it is not necessary to explicitly write tuple declarations for

all tuples in the program. The number and types of fields are inferred from

their use in program rules. However, if type information cannot be inferred,

an error is reported and a tuple declaration is needed.

4.4.1.2 Tables and Facts

Tables store the state of the program as tuples. They must be explicitly

defined via materialization statements that specify the name of the table, how

long the tuples are retained, the maximum number of tuples stored in the

table, and the fields making up the primary key. For example, the following

table declarations

materialize (neighbor, 10, infinity, keys(1,2)).

materialize (sequence, infinity, 1, keys(1)).

define two tables: first, a neighbor table whose tuples are retained for 10

seconds, has unbounded size, and the primary key of the table is made up of

41

the first two fields; second, a sequence table that stores only 1 tuple (i.e. the

latest one added to the table) indefinitely and primary key of the table is the

first key of the sequence tuple. Note that table names, like tuple and event

names, always being with a lower-case letter.

Tuples are added to tables when they are generated due to an execution

of a rule or are hard-corded in the program as facts. For example, the following

statements

neighbor(host1.mydomain.com:5000, host2.mydomain.com:5000, 3).

neighbor(host1.mydomain.com:5000, host3.mydomain.com:5000, 0.5).

add two tuples to the neighbor table.

4.4.1.3 Timing Events

R/OverLog consists of a built-in event generator periodic that is gen-

erates events at periodic intervals. For example, the following rule:

r0 refreshEvent(@X):- periodic(@X, 5, -1).

generates a refreshEvent event every 5 seconds. The second field in the periodic

tuple specifies the length of the period (in seconds), and the third field specifies

the number of events that should be generated. The value -1 specifies that

events should be generated as long as the program is running

4.4.1.4 Rules

The logic of the program is written as a set of rules. Rules have the

form:

42

<ruleId> <head>:- <body>.

The ruleId is a string identifier for the rule. The head of the rule

specifies the events or the tuples generated when the rule is fired (i.e. executed)

Generated events fire other rules. All generated tuples are added to their

corresponding tables. The body of the rule specifies the constraints that must

hold in order for the rule to fire. It consists of at most one event, table lookups,

variable assignments, and constraints separated by commas. The commas are

interpreted as an AND logical construct. Variables in the body begin with

upper-case letters. Consider the follow rules:

r1 refreshSeq(@X, NewSeq):-refreshEvent(@X), sequence(@X, Seq), NewSeq=Seq+1.

r2 sequence(@X, NewSeq):-refreshSeq(@X, NewSeq).

The variable X and the at sign (@) represent the node at which the program is

running. The rule r1 specifies that whenever a refreshEvent is received, the

sequence number is looked up from the sequence table and is incremented. A

new refreshSeq event is generated with the new sequence number. The rule

r2 stores the new sequence number back in the sequence table.

Note the at sign(@) is also used to send an event to other nodes. Con-

sider the following rule.

p1 ping(@Y, X):-doPing(@X), neighbor(@X, Y, ).

43

When a doPing event is received, the rule p1 looks up the neighbor table and

sends a ping event to every neighbor, represented by @Y. The underscore sign

( ) is a wild-card and matches any value.

The body of a rule must conform to the following two constraints: First,

all table lookups and events in the body of a single rule must be localized to a

single node. Second, the body consists of at most one event. In the case that

the body as no event, the rule is fired whenever a new tuple is added to the

tables specified in the body.

In order to delete a tuple from a table, the head of rule contains the

delete keyword. For example, the following rule

d1 delete neighbor(@X, Y, L):-removeNeighbor(@X), neighbor(@X, Y, L),Y==host2.mydomain.com:5000.

removes from the tuple for host2 from neighbor table.

The R/OverLog program exits if a rule with the tuple exit(@X) is

generated due to the firing of a rule.

4.4.1.5 Aggregates

R/OverLog allows a rule to carry out aggregation on the tuples that

match the constraints in the body. For example, the following rule is fired by

the checkStatus event, looks up the neighbor table and finds the minimum

latency to generate the minLatency event.

minLatency(@X, a MIN<L>):-checkStatus(@X), neighbor(@X, , L).

44

Other supported aggregates include finding the maximum or average value of

a field ( a MAX /a AVG), counting the number of matched tuples (a COUNT), and

picking a random value (a RANDOM). Note that aggregates are specified in the

head of the rule and not the body.

4.4.1.6 Functions

R/OverLog defines two built-in functions that can be used for specifying

assignments or conditions in the body: f now() returns the current clock

time as an integer and f rand() returns a random float between 0 and 1.

For example, the following rule generates a currTime event that contains the

current clock time every minute.

t1 currTime(@X, T) :- periodic(@X, 60, -1), T= f now().

For ease of specification, programmers can extend R/OverLog to de-

fine their own functions. They must declare the function in the R/OverLog

program and define it in the R/OverLog runtime (refer to Chapter 7). For

example, the following function defines a f getParent function that takes in

a string argument and returns a string.

fun string f getParent(string).

4.4.1.7 Watch Statements

Watch statements are used to print local events or table updates. and

are useful debugging tools. For example, the following statement prints every

45

ping event received or generated.

watch(ping).

4.4.2 R/OverLog Policies

The routing policy is written as an R/OverLog program. In order to

support the routing policy API, the underlying layer inserts triggers as events

into the R/OverLog program and whenever an event that matches an action

is generated by a rule, it invokes the appropriate mechanism primitive. For

example, the follow rule:

sub01 addInvalSub(@X, S, X, Obj, CTP):-operationBlock(@X, Obj, Off, Len, BPoint, ),serverId(@X, S), CTP = "CP",BPoint == "readNowBlock".

specifies that whenever node X receives a operationBlock trigger informing

it of an operation that is blocked at the readNowBlock blocking point, it

should look up the server id from the serverId table, and produce an event

addInvalSub event that will invoke the action to establish a subscription from

the server for the object with check point catchup.

Hence, for a rule in the routing policy, the input event a trigger injected

from the data plane, a stored event injected from the data plane’s persistent

state, or an internal event produced by another rule on a local machine or a

remote machine. Every rule generates a single event that invokes an action in

the data plane, fires another local or remote rule, or is stored in a table as a

tuple.

46

4.4.3 R/OverLog vs. OverLog

The first version of our prototype employed OverLog as the language

for writing routing policies. Unfortunately, the runtime to execute OverLog

was slow, unstable, and did not provide a consistent rule execution model. We

realized that even though the language was good for policy specification, the

runtime made it impossible to use. Hence, we developed our own language

and execution runtime for PADS. The main differences between R/OverLog

and OverLog are as follows:

R/OverLog restricts a rule from doing remote table lookups. In other

words, for a single rule, all the right side predicates (i.e. the triggering event

and table lookups) reside on the same node. Since all communication between

nodes are carried out with explicit message transfers, it was easier to implement

an efficient runtime for R/OverLog.

R/OverLog implements fixed point semantics making it easier to reason

about rule execution. The semantics guarantees that all rules triggered by the

appearance of the same event are executed atomically in isolation from one

another. Once all such rules are executed, their table updates are applied,

their actions are invoked, and the events they produce are enqueued for future

execution. Then, another new event is selected and all of the rules it trig-

gers are executed. The original OverLog runtime did not support fixed-point

semantics, but subsequent implementation have added support for it.

R/OverLog adds an interface to insert and receive events from a run-

47

ning R/OverLog program. This interface is important for PADS to inject

triggers to and receive actions from the policy. On the other hand, the Over-

Log runtime was not designed to receive external events. R/OverLog also adds

type information to events making the language safer.

Finally, in order to execute a R/OverLog program, it needs to be trans-

lated to Java and compiled before it can run on the provided runtime. Over-

Log, on the other hand, executes directly on its runtime. The design and the

implementation of the runtime is detailed in Section 7.2.2.

4.5 Summary

Routing policy specifies how data and metadata flow in a distributed

storage system. This chapter defines the primitives by PADS to support in-

formation routing as well as the policy API provided to designers for easy

specification. Routing policy is written as an event-driven program as a set of

rules in R/OverLog, a declarative language designed for this purpose, over an

API comprising of a set of actions that set up data flows, a set of triggers that

expose local node information, and a set of stored events to store and retrieve

persistent state.

48

Chapter 5

Blocking Policy

The second part in the development of storage systems consists of writ-

ing blocking policies to specify the consistency and durability requirements.

In order to to be useful, PADS needs to make it easy to implement a wide

range of semantics. This chapter defines what constitutes blocking policy and

specifies the primitives PADS provides for writing blocking policy.

5.1 What is Blocking?

The consistency and durability guarantees required by different systems

fall on a broad spectrum. Some require weaker guarantees such as eventual

consistency [26], whereas others may implement stronger guarantees such as

linearizability [94] or sequential consistency [53]. Some systems may differenti-

ate between tentative and committed updates [86] and use different policies for

update commitment. For some systems, persistence of an update on a single

node is sufficient [69], whereas for others, an update needs to be persistent on

all nodes [88].

One approach is to pre-implement standard consistency and durability

semantics as libraries and allow designers to pick the one they need. However,

49

since the libraries are pre-defined, this approach does not allow designers to

implement customized semantics or take advantage of the knowledge of the

system to construct a more efficient implementation.

PADS takes a different approach: it provides designers with sufficient

information about the consistency state of local objects so that they can im-

plement their own consistency and durability guarantees. Since the state is

automatically maintained by the mechanisms layer, designers do not need to

write tricky bookkeeping code.

PADS casts consistency and durability as blocking policy because each

defines the circumstances under which the processing of a request or the return

of a response should block or continue. In particular, enforcing consistency

semantics generally requires blocking reads until a sufficient set of updates

is reflected in the locally accessible state, blocking writes until the resulting

updates make it to some or all of the system’s nodes, or both. Similarly,

durability policies often require writes to propagate to some subset of nodes

(e.g., a central server, a quorum, or an object’s “gold” nodes [77]) before the

write is considered complete or before the updates are read by another node.

PADS considers these durability and consistency constraints as invari-

ants that must hold when an object is accessed. The system designer specifies

these invariants as a set of predicates that block a read request, a write request,

or the application of a received update to local state until the predicates are

satisfied. To that end, PADS (1) defines 5 blocking points for which a system

designer specifies predicates, (2) provides 4 built-in conditions that a designer

50

can use to specify predicates, and (3) exposes a B Action interface that allows

a designer to specify custom conditions based on routing information. The

set of conditions for each blocking point makes up the blocking policy of the

system.

This small set of primitives is sufficient to specify any order-error or

staleness error constraint in Yu and Vahdat’s TACT model [94] and imple-

ment a broad range of consistency models from best effort coherence to delta

coherence [81] to causal consistency [52] to sequential consistency [53] to lin-

earizability [94].

5.2 Blocking Points

PADS defines five points for which a policy can supply a predicate and

a timeout value to block a request until the predicate is satisfied or the timeout

is reached. The first three are the most important:

• readNowBlock blocks a read until it will return data from a moment that

satisfies the predicate. Blocking here is useful for ensuring consistency

(e.g., block until a read is guaranteed to return the latest sequenced write.)

• writeAfterlock blocks a write request after it has updated the local object

but before it returns. Blocking here is useful for ensuring consistency

(e.g., block until all previous versions of this data are invalidated) and

durability (e.g., block here until the update is stored at the server.)

51

• applyUpdateBlock blocks an invalidation received from the network be-

fore it is applied to the local data object. Blocking here is useful to

increase data availability by allowing a node to continue serving local

data, which it might not have been able to if the data had been invali-

dated. (e.g., block applying a received invalidation until the corresponding

body is received.)

PADS also defines writeBeforeBlock to block a write before it modifies the

underlying data object and readEndBlock to block a read after it has retrieved

data from the data plane but before it returns.

5.3 Blocking Conditions

PADS defines a set of predefined conditions, listed in Table 5.3, to

specify predicates at each blocking point. A blocking predicate can use any

combination of these conditions. The first four conditions provide an interface

to the consistency bookkeeping information maintained in the data plane on

each node.

• isValid requires that the last body received for an object is as new as the

last invalidation received for that object. isValid is useful for returning

the latest known version of the object and for maximizing availability

by ensuring that invalidations received from other nodes are not applied

until they can be applied with their corresponding bodies [26, 64].

52

Predefined Conditions on Local Consistency StateisValid Block until node has received the body corre-

sponding to the highest received invalidation forthe target object.

isComplete Block until target object’s consistency state re-flects all updates before the node’s current logicaltime.

isSequencedVV|CSN

Block until object’s total order is established viaGolding’s algorithm (VV) [31] or an explicit com-mit (CSN) [69].

maxStalenessnodes, count, t

Block until all writes up to(operationStartTime-t) from count nodes in nodeshave been received.

User Defined Conditions on Local or Distributed StateB Actionevent-spec

Block until an event with fields matching event-spec is received from routing policy.

Table 5.1: Conditions available for defining blocking predicates.Note: for the maxStaleness condition, setting nodes and count to -1 represents all nodes in

the system.

• isComplete requires that a node has received all invalidations for the

target object up to the node’s current logical time. Since routing policies

can direct arbitrary subsets of invalidations to a node, a node may have

gaps in its consistency state for some objects. isComplete specifies that

no such gap exists. If the predicate for readNowBlock is set to isValid

and isComplete, reads are guaranteed to see causal consistency.

• isSequenced requires that the most recent write to the target object has

been assigned a position in a total order. Policies that ensure sequential

or stronger consistency can use the AssignSeq routing action (see Table

4.3.2) to allow a node to sequence other nodes’ writes and specify the

isSequenced condition as a readNowBlock predicate to block reads of un-

53

sequenced data. Alternatively, a variation of isSequenced can regard a

write as sequenced when its logical time is earlier than the latest update

a node has received from all other nodes in the system; at that point, no

earlier updates can arrive, and a total order is well defined [31].

• maxStaleness is useful for bounding real time staleness.

The fifth condition on which a blocking predicate can be based on is

B Action. A B Action condition provides an interface with which a routing

policy can signal an arbitrary condition to a blocking predicate. An operation

waiting for event-spec unblocks when the routing rules produce an event whose

fields match the specified event-spec.

Rationale. The first four, built-in consistency bookkeeping primitives ex-

posed by this API were developed because they are simple and inexpensive

to maintain within the data plane [16, 97] but would be complex or expensive

to maintain in the control plane. Note that they are primitives, not solu-

tions. For example, to enforce linearizability, one must not only ensure that

one reads only sequenced updates (e.g., via blocking at readNowBlock on isSe-

quenced) but also that a write operation blocks until all prior versions of the

object have been invalidated (e.g., via blocking at writeEndBlock on, say, the

B Action allInvalidated which the routing policy produces by tracking data

propagation through the system).

54

Beyond the four pre-defined conditions, a policy-defined B Action con-

dition is needed for two reasons. The most obvious need is to avoid having to

predefine all possible interesting conditions. The other reason for allowing con-

ditions to be met by actions from the event-driven routing policy is that when

conditions reflect distributed state, policy designers can exploit knowledge of

their systems to produce better solutions than a generic implementation of the

same condition.

For example, in the client-server system we describe in Chapter 9, a

client blocks a write until it is sure that all other clients caching the object

have been invalidated. A generic implementation of the condition might have

required the client that issued the write to contact all other clients. However,

a policy-defined event can take advantage of the client-server topology for a

more efficient implementation. The client sets the writeEndBlock predicate to

a policy-defined writeComplete event. When an object is written, an invali-

dation is sent to other clients via the server. When other clients receive an

invalidation, they send acknowledgements to the server. The server gathers

acknowledgements from all other clients and generates a writeComplete action

for the client that issued the write. In this custom scheme, the client that

issued the write only needs to contact the server rather than all other clients.

Conversely, in an earlier version of the interface, we included a built-in

predicate hasPropagated(nodes, count, p), which allowed a node to block until

its pth most recent write had propagated to at least count nodes out of the

list of nodes in nodes. This condition is potentially useful because it maps

55

directly to Yu and Vahdat’s order error condition for implementing tunable

consistency [94]—it provides a way to wait until a write is durably stored at

a quorum of servers. However, our built-in implementation of this primitive

had a node poll other nodes directly to determine when a write had reached

them, which makes sense for some systems but not others. By constructing

such a condition using the B Action primitive instead, a designer can gather

the same information in a natural way for a particular system. For example, if

the nodes are set up in a star topology, instead of having every node keep track

of where its writes have propagated, it is more natural to implement routing

rules so that the center node keeps track of update propagation and informs

the original writer when the update has propagated to sufficient nodes. The

routing policy then generates an appropriate message for the blocking policy,

that was waiting on the B Action primitive.

5.4 Examples

This section describes how PADS blocking primitives can be used to

implement a wide range of consistency semantics from (weak) monotonic coher-

ence to (strong) linearizability. To implement stronger semantics, the blocking

policy makes extensive use of the B Action condition. It relies on the routing

policy to ensure that the necessary invariants are satisfied before generating

a message for the blocking policy to unblock the operation. We describe be-

low how various semantics can be implemented and the requirements from the

routing policy when the B Action condition is specified.

56

5.4.1 Monotonic-read coherence

Monotonic-read coherence is one of the weakest reasonable consistency

semantics for distributed storage. It specifies that if a node reads the value of

an object, any successive read operation by that node will always return the

same value or a more recent version.

PADS guarantees monotonic-read coherence by default because the

mechanisms layer replaces the body of an object only if a newer version is

received. By setting the blocking points to the default true predicate, read

and write operations continue without blocking and coherence is guaranteed.

5.4.2 Delta Coherence

Delta coherence [81] specifies that a read of a data object returns the

last value that was produced at most delta time units preceding that read

operation.

In order to implement delta coherence in PADS, the readNowBlock is

set to the predicate maxStaleness(-1,-1, delta) AND isValid. The maxStaleness

condition ensures that only updates that have occurred in the past delta time

units are read. The isValid condition ensures that the body read matches the

update. Note that if the isValid condition is not specified, then there is no

guarantee that the body matches the latest invalidation received, and therefore

a read may return a much older body, violating delta coherence.

An alternate implementation is to set the readNowBlock to maxStaleness(-

57

1, -1, delta) and the applyUpdateBlock to isValid. By setting the applyUpdate-

Block to isValid, we ensure that all locally stored bodies match the latest

invalidations received (i.e. all local objects are valid). However, this imple-

mentation may reduce the amount of data that can be read. Invalidations are

received in causal order but bodies can be received in any order. If the body

of an invalidation is delayed, then the its application and all subsequently re-

ceived invalidations will be delayed (even though their bodies may have already

arrived). By the time the required body arrives and all received updates are

applied, it may be too late. Therefore, by setting the isValid condition for the

readNowBlock instead, we prevent application of updates from being delayed

and thus increasing the chances of meeting the temporal deadline.

5.4.3 Causal Consistency

Causal consistency defines that writes that potentially are causally re-

lated are seen by every node of the system in the same order. Concurrent

writes (i.e. ones that are not causally related) may be seen in different order

by different nodes [13].

In order to guarantee causal consistency in PADS, the readNowBlock

is set to isComplete AND isValid. The isComplete condition guarantees that,

for the target object, the locally stored meta-data reflects the latest update

in casual order up to the node’s current logical time and the isValid condition

guarantees that the body returned matches that update. If the isComplete

condition is not used, then it is possible that the local version is causally much

58

older than the versions of other objects stored and reading it would lead to a

violation of causal order. Chapter 6 details how causality is maintained in the

underlying data layer.

5.4.4 Sequential Consistency

Sequential consistency [53] specifies that the results of any execution

is the same as if all operations (reads and writes) were executed in some

sequential order and operations of each individual processor appear in this

sequence in program order.

Sequential consistency can be implemented in PADS as follows: First,

when a write occurs at a node, the write is blocked until all other all other

copies of the object have been invalidated. This ensures that future reads can-

not return old values that existed before the write. The blocking policy sets

the writeAfterBlock to B Action invalidatedAll and relies on the routing policy

to ensure that all other copies have been invalidated. In a fully-connected

and full-replication scenario, the routing policy implements rules that send an

ACK to the original writer whenever a node receives an invalidation. When

sufficient ACKS have been collected, the routing policy generates an invali-

datedAll message for the blocking policy. Second, after all the ACKS have been

received, the routing policy implements a commit protocol to assign the write

a position in the total order. The protocol can use a primary-commit scheme

or Golding’s algorithm to determine the order. The routing policy uses the

AssignSeq action to indicate that the write has been assigned a fixed position

59

in the total order. Finally, a read only returns the version of an object if the

write that wrote is the last write for that object has been sequenced. Note that

because invalidation subscriptions transmit the commit information along with

invalidations in causal order, reading valid, complete, and sequenced objects

is consistent with the total order determined by the commit protocol. There-

fore, the readNowBlock is set to isSequenced AND isValid AND isComplete.

This scheme is satisfies the two requirements (i.e. program order and write

atomicity) defined by Adve et. al. [11] for enforcing sequential consistency.

5.4.5 Linearizability

Linearizability guarantees that each operation (read or write) takes ef-

fect instantaneously at some point between its invocation and its response [40].

Linearizability is similar to sequential consistency in the sense that all opera-

tions appear to have executed in some sequential order. However, linearizabil-

ity has the additional requirement that the sequential order is consistent with

the global order (i.e. real-time order).

In order to implement linearizability, we add an additional condition

to the sequential consistency implementation to ensure that the total order is

consistent with the real-time write order: the system ensures that only one

write occurs at a time and that the invalidation for that write propagates to all

other nodes before another write is allowed. This scheme can be implemented

by defining a write token that a node needs to posses. The management of

the token is carried out by routing rules. Before a write, a node checks if it

60

has the token (i.e. the writeBeforeBlock is set to B Action getToken). If it

gets the token, it continues with the write. If not, it blocks until receives the

token. Like the sequential consistency implementation, the writeAfterBlock is

set to B Action invalidatedAll to ensure that the write completes only after

all nodes have acknowledged that they have received the update. Once the

node receives ACKS from all other nodes, it commits the update by using

the AssignSeq action and gives up the token so that another blocked write

can continue. The readNowBlock is set to isSequenced AND isValid AND

isComplete so that only valid, complete, and sequenced objects are read.

5.4.6 TACT

TACT [94] defines a continuous consistency model based on 3 parame-

ters: numerical error, order error, and staleness. Numerical error defines the

maximum number of writes not seen by a replica (i.e. unseen writes). Order

error defines the maximum number of writes that have not established their

commit order at the local replica (i.e. uncommitted writes). Staleness de-

fines the maximum amount of time before a replica is guaranteed to observe

a write accepted by a remote replica (i.e. the maximum staleness of the data

observed).

Default conditions. The weakest semantics that TACT guarantees is ca-

sual consistency. The default blocking predicates are set follows: the read-

NowBlock is set to isValid AND isComplete and the applyUpdateBlock is set

61

to isValid so that local data is always valid (for availability).

Numerical error. One way to support the numerical error constraint is to

block reads if the number unseen writes exceeds the specified bound. The

routing policy is defined such that every node maintains information about

how many writes have occurred at other nodes that it has not yet received.

Whenever a node writes to an object, it informs other nodes of its write. The

writeAfterBlock predicate is set to B Action informedAll to ensure that the

write completes only when other nodes are aware of the write. Also, before

reading an object, the node checks with its routing policy to ensure that the

number of writes it has not seen does not exceed the numerical error bound.

This can be implemented by adding the condition B Action okToRead to the

readNowBlock. When the routing policy is informed of the read block due

to the B Action predicate, it checks to see if the numerical error bound is

satisfied. If so, it generates the okToRead message. If not, it will establish

data flows to ensure ensure that the bound is satisfied before unblocking the

read.

Depending on the topology of the system, it might be more efficient

for a single primary node to keep track of the number of unseen writes in

the system. In that case, whenever a local write occurs or when a remote

write is received, the node informs the primary and before a read, the node

checks with the primary node to ensure that the read can continue. For this

scheme, the blocking conditions still remain the same, but the routing policy

62

is implemented differently.

An alternate way to implement the numerical error constraint is to not

accept a write unless it is known for certain that the write will not violate the

numerical error bound. In this scheme, the routing policy of every node (e.g.,

A) maintains an unseen writes vector that keeps track of the number of local

writes that have not been seen by another node (e.g., B), and a bound vector

that specifies the upper bound on number of writes that A can accept without

B seeing those writes. The blocking policy defines the writeBeforeBlock to

B Action okToWrite and the writeAfterBlock to B Action informedAll. Note,

this scheme is equivalent to the implementation mentioned in [43].

Order error. In order to implement the order error bound, writes are blocked

if the number of local writes that are uncommitted violate the bound. For this

scheme, the routing policy at each node keeps track of how many local writes

have not been committed yet. The system can use any commit policy (e.g.,

a primary-commit scheme or Golding’s algorithm [31]) as long as the routing

policy is informed of the commit. The blocking policy adds to readNowBlock

the condition B Action okToRead. It is expected that the routing policy gen-

erates the okToRead message only if the number of uncommitted writes are

within the order error bound. If not, the routing policy initiates data flows so

that the bound can satisfied.

63

Staleness. In order to the staleness bound, the maxStaleness(-1, -1, bound)

condition is added to the readNowBlock. This ensures that reads will be

blocked if the current replica is considered to be stale. Note this scheme

is similar to the one described in [43].

5.4.7 Two-phase Commit

The two-phase-commit protocol is often employed to commit transac-

tions among a set of nodes. One node is designated as the co-ordinator which

initiates the commitment protocol and the other nodes are considered as par-

ticipants. The protocol is executed in the following steps:

• Step 1: At the end of the transaction, the co-ordinator sends a vote

message to participants indicating that the transaction is ready for com-

mit.

• Step 2: On receiving the vote message, every participant replies with ei-

ther a yes message or a no message depending on whether it can commit

the transaction.

• Step 3: If the co-ordinator receives a yes message from all participants, it

commits the transaction and sends a commit message to the participants.

Otherwise, the co-ordinator aborts the transaction and sends an abort

message instead.

• Step 4: On receiving a commit or an abort message, the participant

either commits or aborts the transaction accordingly and sends an ack

64

message to the co-ordinator.

• Step 5: After receiving the ack messages, the protocol is considered

complete (i.e. the transaction is either committed or aborted on all

nodes).

In PADS, we implement the two-phase commit protocol for a single

write rather than for a transaction. The writer acts as the co-ordinator of the

protocol. We assume that invalidation and body subscriptions are established

between the writer and all other nodes. The blocking policy is set as follows:

the applyUpdateBlock is set to isValid, the writeAfterBlock is set to B Action

2PC-Complete and the readNowBlock is set to isValid and isCommitted. The

routing policy implements the protocol as follows:

• Step 1: When a write occurs, the update is sent to other nodes via

subscriptions. This update is considered as the vote message of the

protocol.

• Step 2: When the participant receives the update (i.e. the precise in-

validation), it sends a yes message to the co-ordinator (a.k.a. the co-

ordinator). Note that in this implementation, a participant cannot cast

a no vote.

• Step 3: Once the co-ordinator receives the yes message from all other

nodes, it will commit the write using the AssignSeq action. This action

will cause the underlying layer to generate a commit invalidation that

65

is sent to all other nodes via subscriptions. The commit invalidation

serves as the commit message of the protocol. If the co-ordinator does

not hear back from all nodes in time, it aborts the write. In PADS,

aborting a write is equivalent to not committing a write. In this case,

the co-ordinator simply does nothing.

• Step 4: On receiving the commit invalidation, the participants send an

ack message to the co-ordinator. Note that the underlying layer will au-

tomatically commit the write when the commit invalidation is received.

• Step 5: If the write was aborted in Step 3, the co-ordinator generates a

2PC-complete message to the blocking layer to unblock the write. Oth-

erwise, the co-ordinator waits for ack messages from the participants

before generating the 2PC-complete message for the blocking layer.

5.5 Summary

In PADS, consistency and durability guarantees of a distributed storage

system are cast as conditions that must be satisfied before and after access

to data. These conditions are considered as part of the blocking policy of

the system. For specifying blocking policy, PADS defines five blocking points

in the data plane’s processing of read, write, and receive-update actions. At

each blocking point, a designer specifies a blocking predicate that indicates

when the processing of these actions must block. A blocking predicate can use

any combination of the five pre-defined blocking conditions and the extensible

66

condition.

67

Chapter 6

The Mechanisms Layer

In PADS, the mechanisms layer has two main responsibilities: first, it

implements the primitives defined by the policy layer, and second, it ensures

that implementation is flexible and efficient. As a result, with the low-level

details taken care of by the mechanisms, system designers only need to focus

on writing higher level policy.

This chapter focuses on the primitives provided by mechanisms layer in

order to support the policy API. A more complete discussion of the PRACTI

mechanisms layer on which PADS is based is available in Zheng’s disserta-

tion [96] as well as in other publication [16] and technical report [97]. We

include a discussion of the mechanisms here for context and for completeness.

6.1 Overview

The policy layer has defined primitives that provide designers with

sufficient options to build a wide range of systems. The flexibility is useless

if the underlying mechanisms layer cannot support these options efficiently.

It is, therefore, imperative that the primitives are implemented such that the

constructed systems do not pay high costs for generality.

68

!"#$%&'

(&)*$'

+,-.&$'/)0'

1)2343&$2%5'

6))77$$,420 '

1)88924%.

:)2'

;)-9<$'

Subscriptions Read/Write/

Delete

Storage Module /)%.<'=

%%$33''=

>?'

Figure 6.1: Internal structures of the mechanisms layer.

The mechanisms layer must satisfy three requirements:

• For basic operation, it must provide locally accessible data storage for

any subsets of data.

• For routing policy specification, it must implement the subscription

primitive so that all PR-AC-TI properties can be supported.

• For blocking policy specification, it must maintain sufficient consistency-

related book-keeping so that blocking conditions can be supported.

In order to meet the above requirements, as depicted in Figure 6.1, the

mechanisms layer consists of the following parts:

• a storage module that stores data objects and information about updates

locally.

• a communication module that implements a flexible synchronization pro-

tocol in order to support subscriptions.

69

• a consistency bookkeeping module that keeps track of the consistency

state of objects stored locally.

We detail each module in subsequent sections.

6.1.1 Basic Model

Data objects. In PADS, data are stored as objects identified by unique ob-

ject identifier strings. Sets of objects can be compactly represented as interest

sets that impose a hierarchical structure on object IDs. For example, the inter-

est set “/a/*:/b” includes object IDs with the prefix “/a/” and also includes

the object ID “/b”. In addition, an interest set is also used to represent the

set of objects a node is interested in replicating locally.

Keeping time. The mechanisms layer heavily relies on Lamport’s clocks [52]

and version vectors to keep logical time and consistent information. Every

node maintains a time stamp, lc@n where lc is a logical counter and n the

node identifier. To allow events to be causally ordered, the time stamp is

incremented whenever a local update occurs and advanced to exceed any ob-

served event whenever a remote update is received. Every node also maintains

a version vector, currentV V , that indicates all the local or remote updates it

is aware of. The currentV V is a representation of the current logical time of

the node.

70

Invalidations and bodies. Whenever an object is updated an invalidation

and a body is generated. An invalidation contains the object ID and the logical

time of the update. A body contains the actual data of the update in the case

of the write. When an object is assigned a sequence or is deleted, only an

invalidation is generated.

6.2 Local Storage Module

The storage module stores data and updates in an objects store and an

update log. It also provides an API so that applications can access the data

stored in a node.

6.2.1 Update Log and Object Store

The update log stores local or received invalidations in causal order. In

order to prevent the log from becoming arbitrarily large, the node truncates

older portions of the log when the log hits a locally configurable size limit and

maintains a version vector, omitV V , to keep track of the cut-off time.

The object store stores latest bodies of objects the node chooses to repli-

cate along with per-object meta-data such as the logical time writeT imeStamp,

the real time realT imeStamp and additional consistency-related state (refer

to Section 6.4) for the last-known write to each byte range. When an ob-

ject is deleted, the body is deleted from the object store but the delete time,

deleteAS is maintained.

71

Invalidation type ComponentsWrite Invalidation objId, offset, length, writeTimeStampCommit Invalidation objId, targetTimeStamp, commitTimeStampDelete Invalidation objId, deleteTimeStampImprecise Invalidation targetSet, startVV, endVVBody objId, offset, length, writeTimeStamp, data

Table 6.1: Components of invalidations and bodies

6.2.2 Local Access API

Data stored on a node can be accessed via four methods: read, write,

delete, and commit. The read method requires the object Id and the byte

range, i.e. <objId, offset, length>. When a write occurs, it is assigned a log-

ical time stamp, writeT imeStamp, and a real time stamp, realT imeStamp.

When an object is deleted, a logical time stamp, deleteT imeStamp is as-

signed. The commit method takes in an an object ID, and the time stamp,

targetT imeStamp, of the update it wants to commit. The commit operataion

is also assigned a time stamp, commitT imeStamp, to indicate the logical time

of the commit.

The mechanisms also support multi-object writes. For the ease of dis-

cussion, we only consider single objects writes to whole objects instead of

specific byte-ranges.

Whenever an object is updated (via a write, delete, or commit), an in-

validation and possibly a body is generated. Table 6.1 summarizes the different

types of invalidations and bodies and their components.

When a write occurs, a write invalidation and a body are generated.

72

A write invalidation consists of four fields. The first three fields specify the

object and byte-range was updated. The last field, writeT imeStamp specifies

the time at which the object was updated. A body has an additional field,

data, that stores the actual contents of the update.

When an object is updated via a commit, a commit invalidation consist-

ing of the object Id, objId, the time of the committed write, targetT imeStamp,

and the commit time, commitT imeStamp, is generated. When an object is

deleted, a delete invalidation consisting of the object Id, objId, and the delete

time, deleteAS, is generated.

Another type of invalidation is an an imprecise invalidation It summa-

rizes all the updates that have occurred to a set of objects, targetSet, between

a start time, represented by a partial version vector startV V , and an end

time, another partial version vector endV V . Imprecise invalidations are used

to transfer update summaries from one node to the other (refer to Section 6.3).

On the other hand, write invalidations, commit invalidations and delete inval-

idations are considered to be precise invalidations because they contain the

complete meta-data of an update.

6.3 Communication Module

Propagating data and meta-data among nodes is an important design

aspect of a distributed storage systems. In PADS, it is considered part of

the routing policy. In order to aid the specification of various types of data

propagation, the policy API defines subscription primitive—a uni-directional

73

stream of updates from one node to another. A subscription is associated with

a subscription set, SS, that specifies the set of objects a node is interested

in synchronizing and a start time, startV V , which specifies that only the

updates that have occurred after that time should be sent. It can be seen as

an agreement that the sender will send all updates that it is aware of and that

have occurred after the start time to the objects in the subscription set.

The communication module, in the mechanisms layer, implements a

synchronization protocol that is able to provide all the PR-AC-TI and flexi-

bility properties that subscriptions require and is sufficiently efficient to keep

the costs for generality to a minimum. In particular, the protocol

• sends information that is proportional to the required updates on a syn-

chronization stream (i.e. supporting PR)

• allows any node to establish a stream with any other node (i.e. support-

ing TI)

• sends sufficient information to implement flexible consistency guarantees

but that information is kept to a minimum (i.e. supporting AC)

• supports separate invalidation and body subscriptions

• is incremental

• supports both log-based and state-based synchronization

• efficiently supports dynamic subscription establishment between two nodes

74

Subsequent subsections detail the implementation of the protocol.

6.3.1 Basic Subscriptions

Suppose all the objects stored in Node A lie in the interest set, A.IS and

Node A knows about all updates to A.IS up to its current time, A.currentV V .

Node A wants to receive new updates to A.IS and requests a subscription from

node B with the A.IS as subscription set and A.currentV V as the start time.

If node B stores the same objects as A, i.e. A.IS = B.IS, synchro-

nization is simple. Node B just sends all the updates that have occurred from

A.currentV V to B.currentV V in causal order. A causal steam is used because

it provides flexibility for applications to implement a wide range of weaker or

stronger consistency guarantees with little additional overhead. Also, a causal

stream is incremental, i.e. in case of disconnection, the synchronization can

continue where it left off.

A subscription can be seen as having two phases: a catchup phase in

which the the sender, Node B, sends all updates to objects in the subscription

set from the start time, startV V , to the sender’s currentV V , and a connected

phase in which the sender forwards any new updates it receives until the

subscription is removed (i.e. dismantled).

6.3.2 Separate Invalidation and Body Subscriptions.

In order to provide greater flexibility, PADS separates meta-data and

data synchronization by having separate invalidation and body subscriptions.

75

Invalidation subscriptions propagate meta-data of updates that occurred after

startV V in causal order. They satisfy the prefix property—if an invalidation,I1,

is sent before another invalidation, I2, then I1 causally precedes I2. Body

streams are simply unordered streams of the bodies of updates that occurred

after startV V . Ordering of body streams is unnecessary because received bod-

ies are applied to the object store only after their corresponding invalidations

are received. The causality of invalidation streams is sufficient to guarantee

causal consistency.

In fact, unordered-ness of body subscriptions has its benefits. First, it

allows the propagation of high priority updates before other updates improv-

ing performance and availability. Second, for frequently updated objects, if

multiple bodies to the same object byte-range are sitting in the send queue,

only the latest body can be sent and hence reducing bandwidth [64].

An invalidation stream is associated with a subscription set, SS, a start

time, stream.startV V , and a stream time, stream.V V that keeps track of the

logical time progression of the stream. Since a body stream is unordered, it is

associated with a subscription set, SS and a start time, stream.startV V .

6.3.3 Supporting Flexible Consistency

Say a node wants to receive updates to a subset of data from another

node. In addition to the updates, it is necessary to send enough information

so that systems can enforce whatever level of consistency they require.

Consider the scenario that an invalidation subscription is established

76

from Node B to Node A for the subscription set A.IS. Because of partial

replication, however, different nodes may store different subsets of data. In

other words, Node B may not store all the objects that Node A wants. In

addition to sending all the updates to objects in A.IS, Node B must warn

Node A if there are any causal gaps – updates that have occurred to objects

that Node A cares about but Node B does not have. In this case, Node B sends

an imprecise invalidation. Imprecise invalidations can be seen as a summary of

multiple updates. They summarize updates that occurred to a set of objects,

targetSet, between a start time, startV V , and an end time, endV V . Note that

start and end time are partial version vectors rather than full version vectors.

Note, imprecise invalidations are only sent over invalidation subscriptions and

not body subscriptions.

Imprecise invalidation must be sent in two cases. First, the sender,

Node B, does not know about the updates the receiver, Node A, has asked

for. Second, imprecise invalidations are sent for updates to objects Node A

did not ask for so that Node A can pass on the knowledge of the gaps in its

information to another node, Node C, that it synchronizes with in the future.

It is this propagation of imprecise invalidations allows PADS to simultaneously

support partial replication, topology independence, and still support a broad

range of consistency semantics.

77

6.3.4 State-based Synchronization

PADS also supports state-based synchronization by sending a check-

point of the final state of objects in an interest set instead of sending an

ordered log of updates. A checkpoint consists of an imprecise invalidation for

A.IS from A.stream.startV V to B.currentV V and the latest meta-data of

objects in A.IS updated between stream.startV V and B.currentV V .

Log and checkpoint synchronizations have different tradeoffs. The

bandwidth requirement for log synchronization is always proportional to the

number of updates that occurred to objects in the subscription set. Log syn-

chronization is useful for incrementally and continuously exchanging updates

between pairs of nodes. On the other hand, the bandwidth requirement for

checkpoint synchronization is proportional to the number of objects updated.

Hence, for a small frequently updated subscription set, a checkpoint synchro-

nization might be a better option. Also, a log synchronization is impossible to

execute if the update log has been truncated beyond the start time of the sub-

scription. i.e. the subscription requires invalidations that are older than what

is currently stored in the log. The only option is to fall back to checkpoint

synchronization.

6.3.5 Supporting Efficient Dynamic Subscriptions

Next, what if Node A decides that it wants to synchronize more objects?

Say Node A already has an invalidation subscription established for A.IS

from Node B and it wants get updates for objects in A.newIS from the time

78

Imp Inval

/z/*

No Multiplexing

Receiver Sender

Receiver Sender

Stream

Start

Inval

/x/2

Imp Inval

/y/*:/z/*

Inval

/x/3

Inval

/x/4

Inval

/x/1

Catchup

End

Inval

/y/1

Imp Inval

/x/*:/z/* Inval

/y/2

Imp Inval

/z/*

Stream

Start

Inval

/x/2

Imp Inval

/y/*:/z/*

Inval

/x/3 Inval

/x/4

Inval

/x/1 Catchup

Start

Inval

/y/1

Stream

Start

Inval

/y/3

Inval

/y/2

Inval

/y/3

Imp Inval

/y/*:/z/*

Stream SS = /x/*

Stream SS = /y/*

Redundant Information Sent

With Multiplexing

Stream SS = /x/* Catchup for /y/*: sends missing updates

Stream SS = /x/*:/y/*

Figure 6.2: Diagram comparing the messages sent on invalidation streams without andwith multiplexing.

A.newIS.startV V . A simple approach would be to establish a new separate

stream for A.newIS. The problem is that a lot of redundant information will

be sent—in order to ensure that each stream provides a causally ordered series

of events from stream.startV V , each must include imprecise invalidations for

any omitted events and Node A may receive imprecise invalidations for the

same updates twice. If Node A creates a large number of subscriptions, then

the redundant information sent can be quite substantial.

PADS reduces the overheads of dynamic synchronization requests by

multiplexing subscription requests on to a single stream. Since updates on the

stream are sent in causal order, simply sending updates to A.newIS from the

current logical time of the stream stream.V V will not work because that would

79

not “fix the gap” that Node A has for A.newIS from A.newIS.startV V to

stream.V V . Instead, Node B “pauses” the stream at stream.V V and sends a

catchup stream which includes all updates for A.newIS from A.newIS.startVV

to stream.V V that node B knows about. After catchup, node B continues

sending invalidations for A.IS and A.newIS and gap markers for everything

else starting from stream.V V . Figure 6.2 illustrates the multiplexing of the

stream.

6.3.6 Summary

The following list highlights the features that enable subscriptions to

be flexible and efficient.

• Because subscriptions are peer-to-peer streams, they are able to support

arbitrary synchronization topologies.

• The synchronization set for each subscription is not pre-defined. Hence,

nodes can synchronize arbitrary subsets of data.

• By separating invalidation and body subscriptions, PADS provides the

flexibility to propagate meta-data and data via separate paths.

• The causal propagation of updates provides systems with the flexibility

to build a wide range of consistency semantics and allows synchronization

to make incremental progress between failures.

• By supporting both log-based and state-based synchronization, subscrip-

tions allows systems to make appropriate performance tradeoffs.

80

• By summarizing unwanted meta-data, PADS ensures that updates are

propagated with minimal overhead while supporting partial replication

and guaranteeing causal consistency.

• By multiplexing subscription requests over a single stream, PADS en-

sures that dynamic requests for synchronizations are efficiently handled.

6.4 Consistency Bookkeeping Module

The responsibility of the bookkeeping module is to maintain sufficient

local state in order to support the primitives (i.e. blocking conditions) defined

by the blocking policy API. The advantage of maintaining the state in the

mechanisms layer is that it relives designers from writing tricky bookkeeping

code. Also, since the state maintained is general, designers can quickly adapt

the system if consistency requirements change.

In this section, we discuss the state maintained by the module and how

it is updated as messages are received.

6.4.1 Dimensions of Local Consistency State

In PADS, every node keeps track of the consistency-related state of the

objects it stores. Consistency has three dimensions:

• Preciseness: The interest set is precise if the node has received all the

precise invalidations affecting the objects in the interest set up to the

current logical time, currentV V . An interest set is imprecise, if the

81

node has received imprecise invalidations for any subset the interest set.

An object is precise if it lies in a precise interest set

• Validity: An object is valid if object store contains the body of the last

received write invalidation, else it is invalid.

• Commit: An object is committed if the commit operation or a commit

invalidation is received for the last-known write update.

In fact, different consistency semantics can be guaranteed by reading

objects in specific states. For example, causal consistency can be easily guar-

anteed by accessing only precise and valid objects. Preciseness guarantees that

the node is aware of the latest causal update to the object and validity implies

that the node is storing the body of that update. On the other hand, systems

that do not require causal consistency have the option of accessing data even

from imprecise interest sets.

6.4.2 Local Consistency State

In order to keep track of the consistency of local objects, a node, Node

A, maintains the following state:

• currentV V : Node A maintains a version vector that indicates the latest

update it has seen, either due to a precise invalidation or to an imprecise

invalidation. This implies that Node A is not aware of any updates after

currentV V .

82

• currRealV V : Node A also maintains a vector of real timestamps rather

than logical timestamps of the updates it has received. currRealV V can

be seen as the real-time equivalent of currentV V .

• stream.V V : For every invalidation stream, Node A maintains a logical

time that includes the last update and all the causally preceding updates

received on the stream. It implies that Node A has seen, either via

precise or imprecise invalidations, all events from the stream.startV V

to stream.V V .

• IS.noGapV V : For every interest set, Node A maintains a noGapV V that

indicates that Node A has seen all updates and no imprecise invalidations

to the interest set until this time. For a particular interest set IS1, if

IS1.noGapV V < currentV V then the interest set is considered imprecise

– Node A is missing one or more invalidations that affect IS1 between

IS1.noGapV V and currentV V . Hence, causal consistency cannot be

assured for reads of objects in IS1.

• obj.writeT imeStamp: For every object currently stored, Node A stores

the logical timestamp of the latest write invalidation it has received for

the object.

• obj.realT imeStamp: For every object currently stored, Node A stores

the real timestamp of the latest write invalidation it has received for the

object.

83

Blocking Condition Consistency Module EquivalentisValid obj.isValid = trueisComplete IS.precise = trueisSequenced obj.isCommited = truemaxStaleness count values corresponding tonodes count t nodes in currRealV V are less than

t milliseconds old.

Table 6.2: Mapping between blocking conditions and local consistency state maintained.obj represents the object for which the condition is specified and IS represents the interest

set enclosing the object.

• obj.isV alid: A flag, stored for every object, that indicates whether the

node stores the body of the last received write invalidation for the object.

If the isV alid flag is not set, the object is considered to be invalid. Causal

consistency cannot be assured for a read of that object, because the body

is older than the last invalidation received.

• obj.isCommitted: A flag, stored for every object, that indicates whether

the node has received a commit invalidation for the last received write

invalidation it has received for the object. If the isCommitted flag is

set, the object is considered to be committed.

Note that object-related consistency state is stored in the object store and

interest set-related state is stored in a hierarchical structure called ISSTATUS.

The state maintained at the module is sufficient to implement the block-

ing conditions defined by the blocking policy API. Table 6.4.2 depicts how

blocking conditions can be implemented.

84

6.4.3 Updating the State

When a receiver receives messages on an invalidation or body stream,

in addition to updating the log and the store, the key job for the receiver is to

make sure that the consistency state is correctly updated. We describe below

how incoming invalidation streams and body streams are processed.

Processing incoming invalidation streams. In order to eliminate the

need for updating IS.noGapV V every time an invalidation is received, we

introduce the concept of “attaching” an interest set to a stream. An interest set

is “attached” to a stream if no imprecise invalidations for the interest set have

been received on the stream, i.e. IS.noGapV V includes stream.V V . When

an invalidation is received, only stream.V V needs to be updated. The node

also keeps track of which streams an interest set is attached to by maintaining

a IS.attachedStreams set. If an imprecise invalidation, IMP, is received on

a stream, the IMP.targetSet is “detached” from the stream by explicitly by

storing its noGapV V for the interest set in ISSTATUS. The details of how an

invalidation stream is processed as described in Figure 6.3 and 6.4.

Processing incoming body streams. Processing a body stream is sim-

ple. When a node receives a body, it will check if the body.writeT imeStamp

matches local times stamp for the object, obj.writeT imeStamp. If there is a

match, it implies that the body corresponds to the latest received invalidation

for the object, the body is applied the store and the obj.isV alid flag is set to

85

if received message is a subscription start message, SubStart then//set up subscription stream:stream.SS ⇐ subStart.SSstream.V V ⇐ subStart.V V

else if received message is an write invalidation, I then//update log, timing state and per object state:store I in update logupdate stream.V V and currentV V to include I.writeT imeStamp.update currRealV V to include I.realT imeStamp.obj.writeT imeStamp ⇐ I.writeT imeStampobj.realT imeStamp ⇐ I.realT imeStampobj.isV alid ⇐ falseobj.isCommitted ⇐ false

else if received message is a commit invalidation, CI then//update log, timing state and per object state:store CI in update logupdate stream.V V and currentV V to include CI.commitT imeStamp.update currRealV V to include CI.realT imeStamp.//if target timestamp matches current timestamp:// update commit flag:if obj.writeT imeStamp equals CI.targetT imeStamp then

obj.isCommitted ⇐ trueend if

else if received message is a delete invalidation, DI then//update log, timing state and per object state:store DI in update logupdate stream.V V and currentV V to include DI.deleteT imeStamp.update currRealV V to include DI.realT imeStamp.obj.deleteT imeStamp ⇐ DI.deleteT imeStamp

else if received message is an imprecise invalidation, IM then//update log, timing state and interest set state:store IM in update logupdate stream.V V and currentV V to include IM.endV V .update currRealV V to include IM.realEndV V .check for intersecting setIIS ⇐ stream.SS ∩ IM.targetSetif IIS 6= ∅ then

// detach IIS from the streamIIS.noGapV V ⇐ min(IIS.noGapV V, IM.startV V − 1)stream.SS ⇐ stream.SS\IISremove stream from IIS.attachedStreams

end ifend if

Figure 6.3: Pseudocode for processing received invalidations.

86

if received message is a checkpoint, CP then//apply received meta-data to local structuresfor all IS in CP do

update local IS.noGapV V to include CP.IS.noGapV Vend forfor all object metadata in CP do

if CP.obj.metadata is newer than local.obj.metada thenupdate local.obj.metadata to include CP.obj.metadata

end ifend for

else if received message is a catchup start message, CStart then//switch to catchup modestream.pendingSS ⇐ Cstart.SSstream.pendingV V ⇐ Cstart.V Vfor all invalidation or gap markers received do

process as above, except, update pendingV V instead of stream.V Vend for

else if received messages is a catchup end message, CEnd then//switch to normal modeif stream.pendingV V equals or includes stream.V V then

//attach stream.pendingSS to the stream.stream.SS ⇐ stream.SS ∪ stream.pendingSSadd stream to consistencyModule.pendingSS.attachedStreams

end ifend if

Figure 6.4: Pseudocode for processing other received messages on invalidation streams.

87

true. If the body is older than the timestamp, then the body is discarded. If

the body is newer than the timestamp, it implies that its corresponding inval-

idation has not been received yet. Instead of discarding it, the body is stored

in a body buffer and is applied to the store when its corresponding invalidation

arrives.

6.5 Conflict Detection

Conflict detection is an important feature for distributed storage sys-

tems. An object may be independently updated on multiple nodes leading to

diverging versions. Updates to the same byte of the same object are consid-

ered to be conflicting if there is no causal relationship between them. Such

conflicts need to be detected so that appropriate resolution, either automatic

or manual, can be invoked to resolve the differences and achieve eventual con-

sistency [49] [86].

Existing synchronization protocols [16, 33, 45, 59] often store or trans-

mit extra information for the sole purpose of conflict detection. For mobile

environments where network bandwidth and storage capacity can be limited,

it is important that such information be minimal. Fortunately, the mecha-

nisms layer in PADS can detect conflicts without having to store or transmit

any extra information – it simply derives the information it needs from the

state already maintained for consistency.

Conflict detection is carried out as described below. If no conflict is

detected, the received update is applied; otherwise the conflict flag set and

88

all the information is stored in a special file for resolution. PADS provides

mechanisms for conflict detection, but it leaves conflict resolution to system

specific policies. For convenience, PADS provides a default last-writer-wins

policy.

Dependency summary vectors. PADS uses a dependency summary vec-

tor (DSV) scheme for conflict detection. A dependency summary vector (DSV)

is a vector associated with an update that summarizes all the causally preced-

ing updates to the object being updated.

In particular, a DSV of a update U ,

• includes the timestamp of all causally preceding updates to the object.

• may include the timestamp of the current update, U .

• may include the timestamps of updates to other objects.

• excludes any updates that are causally ordered after U .

Note that, there is not necessarily a unique DSV for a single up-

date. For example, suppose all the causally ordered updates on an object

are (1@A), (3@A), (10@B). Two legal DSVs for the the second update (3@A)

are < 1@A, 9@B > and < 2@A, 6@B > but not < 0@A, 9@B > or <

3@A, 10@B > because the former does not include the first update and the

latter does not exclude the third update.

89

Conflict detection simply requires comparing the write times and the

DSVs for two updates. In order to detect whether two different updates U1 and

U2 to the same object conflict, we carry out the following comparisons: If U1.ts

is included in U2.dsv, then U1 causally precedes U2, by definition. Similarly, if

U2.ts is included in U2.dsv, then U2 causally precedes U1. Otherwise, U1 and

U2 are marked as conflicts.

Deriving DSVs. It would be inefficient to transmit a DSV with each update

and store a DSV with each object. PADS therefore derives DSVs from the

meta-data already maintained by the synchronization protocol. In order to do

that, it ensures that a node is aware of all the previous updates to the object

before it is updated. Any new update (a local write or a received invalidation)

can only be applied if there is no gap in the object update information (i.e.

the enclosing interest set is precise).

By definition noGapVV of an interest set covers all the causally preced-

ing updates to the objects in the interest up to that time. Hence, for an object

in the interest set, noGapV V includes all the causally preceding updates to

that object. If the interest set is precise, then the noGapV V and the DSV are

equal to currentV V .

To determine the DSVs of received invalidations, PADS takes advan-

tage of the causal property of the stream: For a received invalidation, all the

causally preceding updates have been already received, and any newer updates

will not arrive before the current received invalidation. streamV V includes all

90

the current and all causally preceding updates. The DSV for invalidations in

connected phase is streamV V . For invalidations received during log synchro-

nization, the DSV is pendingV V , and for updates received via a checkpoint,

the DSV is the received noGapV V .

Detecting conflicts. PADS detects conflicts by comparing the timestamp

and the DSV of the received invalidation with the locally stored object times-

tamp and the noGapV V of the enclosing interest set. Conflicting updates are

flagged and logged for the application or the user to handle.

6.6 Summary

The PADS mechanisms layer provides a set of common primitives so

that designers can simply invoke these mechanisms to construct systems with-

out having to spend the effort to reimplement them. In particular, the mech-

anisms layer implements (1) an object store and an update log for storing and

accessing data objects locally, (2) subscriptions for flexible and efficient update

propagation between nodes, (3) bookkeeping of consistency state that can be

used to implement various consistency semantics, and (4) a conflict detection

scheme so that concurrent updates can be eventually detected and resolved

without additional overhead.

91

Chapter 7

Bridging the Gap

The mechanisms layer implements the set of primitives required by

PADS policy layer. Unfortunately, the mechanisms API does not match the

policy API. This section describes how that gap is bridged. In particular, it

briefly describes the API exposed by the mechanisms layer, provides reasons

for why development on the raw API is difficult, and presents the modules

implemented to transform the raw API to routing and blocking policy API.

7.1 Mechanisms API

The mechanisms layer implements sufficient primitives that a distributed

storage system can be directly built over it. It provides an extensive set of

API that consists of a set of 10 actions (listed in Table 7.1) that can be used

to invoke mechanisms primitives, 37 triggers (listed in Table 7.2) that provide

information about events or messages received by the mechanisms layer, and

6 flags at the read/write interface (listed in Table 7.3) to specify the consis-

tency semantics. In order to build a distributed storage system over the raw

interface, a system designer specifies the system design by setting flags for

consistency at there read and write interface and defining a Controller that

92

Type List of Actions

Subscription Related

subscribeInvalsubscribeBodysubscribeCheckpointremoveSubscribeInvalremoveSubscribeBodyremoveConnection

Other Commands

issueDemandReadrequestSyncgetCurrentVVunbind

Table 7.1: Actions available to policy

implements methods to handle the received triggers.

Unfortunately, it is difficult to build a storage system with the interface

exposed for five reasons:

First, the API exposes too many abstractions. The API exposes at

least 6 types of streams that send different types of information. For example,

bodies are sent on a body stream whereas information about node state is

sent over a sync-reply stream. Also, the API differentiates between streams

and subscriptions—a subscription is instantiated over a stream which may be

serving multiple subscriptions. On the contrary, the PADS routing API only

exposes the single abstraction of a subscription keeping things simple.

Second, the difference between the triggers received is subtle and re-

quires an understanding of the implementation of the underlying layer. Con-

sider the two triggers informInvalStreamInitiated and informSubscribeInval-

Succeeded. The first one indicates that an invalidation stream was initiated

93

Type List of Triggers

Stream Initialization

informInvalStreamInitiatedinformCommitStreamInitiatedinformBodyStreamInitiatedinformUnbindStreamInitiatedinformSyncRplyStreamInitiatedinformCheckpointStreamInitiatedinformOutgoingInvalStreamInitiatedinformOutgoingBodyStreamInitiatedinformOutgoingCheckpointStreamInitiated

Stream Termination

informInvalStreamTerminatedinformCommitStreamTerminatedinformBodyStreamTerminatedinformUnbindStreamTerminatedinformSyncRplyStreamTerminatedinformCheckpointStreamTerminatedinformOutgoingInvalStreamTerminatedinformOutgoingBodyStreamTerminatedinformOutgoingCheckpointStreamTerminated

Message Received

recvSyncReplyinformReceiveInvalinformReceivePushBodyinformReceiveDemandReplyinformCheckpointStreamReceiveStatus

Local Events

informLocalWriteinformLocalDeleteinformLocalReadImpreciseinformLocalReadInvalid

Demand Read RelatedinformDemandReadMissinformDemandReadHitinformDemandImprecise

Consistency Related informBecameImpreciseinformBecamePrecise

Invalidation Subscription Related

informGapExistForSubscribeInvinformSubscribeInvalFailedinformSubscribeInvalSucceededinformOutgoingSubscribeInvalInitiatedinformOutgoingSubscribeInvalTerminated

Body Subscription Related

informSubscribeBodySucceededinformSubscribeBodyRemovedinformOutgoingSubscribeBodyInitiatedinformOutgoingSubscribeBodyTerminated

Table 7.2: Triggers sent from the underlying layer to the controller.

94

Operation Available FlagsRead blockInvalid, blockImprecise, maxTemporalErrorWrite priority, bound, targetOrderError

Table 7.3: Flags available for specifying consistency

between two nodes and the second one indicates that an invalidation subscrip-

tion was successfully established on a stream. If a designer is not careful, it is

is easy to confuse the two triggers. Another pair of triggers that can be eas-

ily mixed up is the informReceivePushBody and informReceiveDemandReply

triggers. Both indicate that a single body was received. However, the first

one implies that the sender of the body initiated the transfer and the sec-

ond one implies that the body was sent in response to a request. The PADS

API combines the two triggers to a single sendBodySuccess trigger reducing

opportunities for misuse.

Third, the level of detail exposed by some triggers is often not required

in practice. For example, the triggers informDemandReadMiss and informDe-

mandReadImprecise indicate that a request for a body from another node could

not be satisfied because the local copy was either invalid (for the first trigger)

or imprecise (for the second trigger). In all the systems we constructed with

PADS, we did not need this level of detail—it was sufficient to know that

a remote body request could not be locally satisfied but the reason was not

necessary.

Fourth, despite its size, the API is missing some important triggers

and consistency flags. Eventhough write operations expose a flag to set the

95

TACT order error constraint, there is no trigger that informs policy when the

constraint is not satisfied. Also, the default range of consistency semantics

supported is limited because the API only supports two of the three TACT

parameters ( it does not support the numerical error constraint).

Fifth, and most importantly, the API puts a lot of burden on the devel-

opers. Because the API does not impose a structure on system specification

(i.e. no separation of policy into routing and blocking), programmers are left

to decipher the different parts of system design (i.e. consistency, durability,

and information propagation) and the pick the right methods in the large API

to implement their systems. The API also does not make it easy to write

a customized consistency implementation—a designer would need to explicit

implement a wrapper that blocks read/write operations, sends a new trig-

ger to the controller, receives messages from the controller, and unblocks the

operation when appropriate.

7.2 Making it Easier to Use

For ease of development, we convert the raw API to PADS policy API.

In particular,

• we expose a much smaller API to developers. We trim excessive methods

and add methods to make the API more complete.

• we introduce a separation of concerns by splitting policy specification

into blocking and routing, making systems easier to design and imple-

96

!"#$%&'

()*+,-./0/'

"1('&-2)34,*)'

"156)3789'":-;0)'

%3,-/<,2)='"8:;-9'!8<.*>'

"),='

?3.2)'

@:A/*3.B;8-/'Triggers Actions

Blocking Triggers/

Action

C!(' C!('

C"C' C"C'

Figure 7.1: Bridging the mechanisms and policy layers.BPM stands for BlockingPolicyModule and BRB stands for BlockingRoutingBridge.

ment.

• we implement common tasks such as retries on subscription establish-

ment failure and communication between routing and blocking policies

so that designers need not reimplement them.

In order to support the PADS API, we implement four modules, as

illustrated by Figure 7.1: (1) the Routing/Mechanisms (R/M) interface that

translates the API exposed by the mechanisms layer to routing policy API,

(2) the R/OverLog runtime that translates R/OverLog policies into Java and

executes them over the mechanisms, (3) the BlockingPolicyModule (BPM)

that acts as a wrapper layer to intercept reads, writes, and update applications

to support blocking points and expose the pre-defined blocking conditions,

and (4) the BlockingRoutingBridge (BRM) that allows routing policies and

blocking policies to communicate with each other. We provide details of each

97

module in subsequent subsections.

7.2.1 R/M Interface

!"#$%&'(")*+,'-(.'

/0+12%*34'-(.'

R/M

Interface

Policy

Mechanisms

Action Queue

Trigger Queue

Figure 7.2: Internal queues in the R/M Interface

The R/M interface is the middle layer between the policy and mecha-

nisms layers and serves three purposes: to translate the mechanisms API to the

concise PADS routing API, to prevent policies from implementing deadlocks,

and to implement common tasks.

Translating the API. As described earlier, the API exposed by the mech-

anisms layer is difficult to program on. The R/M interface translates the API

so that developers have a small, intuitive API to work with. It derives the

new API as follows:

• by reducing the abstractions exposed. Only a single abstraction for up-

date flows is exposed—subscriptions. Designers only take care of body

98

and invalidation subscriptions. They do not need to worry about streams

or multiple subscriptions on a single stream.

• by hiding triggers or actions that are excessive, redundant, or do not

fit in the PADS model. For example, we hide triggers pertaining to

streams because we do not expose streams to designers. We also hide

the actions and triggers that get information about another node’s state

(i.e. requestSync action and recvSyncReply triggers), because in PADS,

this functionality is part of meta-data propagation and should be imple-

mented by the routing rules rather than the underlying layer. Triggers

that are related to consistency of local objects (i.e. informBecamePrecise

and informBecameImprecise) are hidden from the routing policy layer,

because PADS uses this information for blocking policy. Also, since the

PADS model cleanly separates data propagation and meta-data propaga-

tion paths, the primitives for binding bodies together with invalidations

for propagation are hidden (i.e. no bound writes and no unbind action).

• by combining similar triggers and in some cases, making them more

general. For example, we combine the read miss triggers (informLocal-

ReadImprecise and informLocalReadInvalid) to a single operationBlock

trigger which not only informs the routing policy of read misses but also

of write blocks. We combine the informReceivePushBody and informRe-

ceiveDemandReply triggers into a single sendBodySuccess trigger.

99

Even though the API exposed to the routing policy is much smaller

than that provided by the mechanisms layer, we find that the routing API

provides sufficient control to policy writers to specify system designs.

Preventing deadlocks. The R/M interface adds a level of indirection in

the threading model to prevent deadlocks. Without the interface, a thread

that is carrying out some important task in the mechanisms layer could get

transferred to the policy layer via a trigger and then come back down to

the mechanisms layer via an action without having finished the task it was

originally performing. This situation can cause blocking or deadlocks. The

R/M interface prevents deadlocks by cleanly separating policy threads from

mechanisms threads. Whenever a mechanisms thread needs to send a trigger

to policy, it puts a trigger object in an R/M interface queue and returns

to its processing. A separate R/M thread retrieves the trigger in the queue

and calls into the policy layer. This separation allows mechanisms threads to

asynchronously send triggers to the policy. Policy actions are also passed to

the mechanisms in a similar fashion. Figure 7.2 depicts the internal queues of

the R/M interface.

Implementing common tasks. The R/M interface implements function-

ality for common tasks. For example, if there was a problem establishing a

subscription, the R/M interface will automatically retry the subscription for

2 times after every 3 seconds before reporting failure. The number of retries

and the time between each retry can be specified by the policy writer.

100

7.2.2 R/OverLog Runtime

!"#$%&$'(

)*"+,(

-"+"#.,%#(

)*"+,(/0"0"(

1.+&2"#3(4.5(

)6"'07%+(

)+8$+"(

9%::0+$'.7%+(

4%&02"(

4.#3;.22"#3(

<0=3'#$="(

4%&02"(

<0=3'#$="#3(

>.,.(4%&02"(

?.=2"3(

Figure 7.3: Major components of the R/OverLog Runtime.The solid arrows represent event or table update flows and the dashed arrows

represent handler or table lookups.

The R/OverLog runtime is a Java-based engine to run translated R/OverLog

policies. As Figure 7.3 depicts, major parts of the runtime include:

• Event Queue: keeps events that have been generated or received but not

yet been handled.

• Periodic Event Generator: generates periodic events.

• Handler Map: keeps track of the handlers associated with every event.

A handler contains the logic of a rule that is fired by the event.

101

• Communication Module: responsible for sending and receiving remote

events. The communication module also maintains a list of marshallers

for serializing and deserializing events. When an remote event is received,

it is put into the event queue.

• Data Module: responsible for storing the R/OverLog state as tables.

Every table has interfaces for inserting tuples, removing tuples, and car-

rying out simple aggregation.

• Subscribe Module: allows an external source to listen for a particular

event type by registering a subscriber for it. Whenever an event of the

registered type is generated, the associated subscriber is invoked.

R/OverLog to Java Translation. The translation from R/OverLog to

Java is done using the XTC toolkit [34, 41]—a framework for building exten-

sible source-to-source translators and compilers. The translation is carried as

follows:

• For every event or tuple, the translator generates a tuple class that con-

tains fields and marshalling code.

• For every rule, the translator generates a handler class that implements

the logic of the rule (i.e. table lookups, assignments, and constraints) to

generate either another event or a table update.

• For a single R/OverLog program, the translator generates a main OLG

class that initializes the tables, registers periodic events with the Periodic

102

Event Generator, registers print subscribers for watch statements, and

for every event, registers the appropriate handlers and marshallers with

the Handler Map and Communication Module respectively.

The execution of the program can be started by invoking the start

method in the generated main OLG class.

Program execution. The main execution thread dequeues an event from

the event queue, looks up the handler table and executes all the handlers asso-

ciated with that event. Any new events generated as a result of the execution

are put back in the event queue. The event queue checks if the events are

local or remote. Local events are put to the back of the queue whereas remote

events are handed over the communication module for transfer. Any table

updates are handed over to the data module.

Fixed point semantics. The semantics guarantees that all rules triggered

by the appearance of the same event are executed atomically in isolation from

one another. Once all such rules are executed, their table updates are applied,

their actions are invoked, and the events they produce are enqueued for future

execution.

In order to support fixed point semantics, the event queue is imple-

mented internally as two queues: an external queue that holds events belonging

to different fixed points and an internal queue that holds event corresponding

103

to the current fixed point. Events received from remote nodes and periodic

events are put in the external queue.

The execution thread will pick an event from the internal queue, un-

less it is empty, and invoke the handlers associated with the event. All new

local events generated from the handling of that event are put in the inter-

nal queue. Table updates and remote events are buffered. When the internal

queue becomes empty (i.e. the end of the fixed-point is reached), table updates

are committed and the remote events are sent. The next time the execution

thread needs an event, it is picked from the external queue starting a new fixed

point.

7.2.3 BlockingPolicyModule (BPM)

!"#$%&'()(*

!"#$%

&'()"% *+,-.'(/012%

324#5-%

+,-#.'&/*0-,'#1*!-23,"*

Figure 7.4: Blocking Policy Module intercepting operations.

The blocking policy API consists of 5 blocking points that correspond

to access points and a list of conditions that can be specified for each ac-

cess point. On the other hand, the mechanisms API only provides blocking

flags for reads (equivalent to the readNowBlock) and an order error bound for

writes (equivalent to the writeBeforeBlock). The gap between the two layers is

104

Flags at Blocking PointsisValid

isCompleteisSequenced

MaxStaleness

Table 7.4: Flags provided by the BlockingPolicyModule.

public class MyCausalBPM extends BlockingPolicyModule {readNowBlock.isValid = true;readNowBlock.isComplete = true;ApplyUpdateBlock.isValid = true;

}

Figure 7.5: Blocking policy implementation for causal consistency.

bridged by the BlockingPolicyModule—a wrapper layer, as depicted in Figure

7.4, that intercepts reads and writes before and after access to local state and

received invalidations before they are applied locally. Each interception point

corresponds to a blocking point.

The built-in blocking conditions are provided as flags (refer to Table

7.4). By default, the flags are set to false so no operation is blocked. Most

flags are set to either true or false values. For the maxStaleness condition, the

designer provides a list of nodes, a count and a value for the maximum staleness

allowed. For example, as illustrated by Figure 7.5, the blocking policy for a

causal consistency can be implemented by setting flags such that only valid

and complete objects are read and application of invalidations is delayed until

the corresponding body has arrived.

If an operation blocks due to built-in conditions, the BlockingPolicy-

105

Module will inform the routing policy of the block via operationBlocked triggers

for each condition that is not satisfied. When all the conditions are satisfied,

the operation will automatically unblock.

For custom conditions, the BlockingPolicyModule provides abstract

methods for each point that a designer can extend and methods to look up

consistency state for each object.

7.2.4 BlockingRoutingBridge (BRB)

Some systems implement custom blocking conditions based on routing

conditions (i.e. the B ACT condition) which requires the blocking policy to

send and receive messages from routing policy, the BlockingPolicyModule pro-

vides a BlockingRoutingBridge. that allows a blocking policy to send a trigger

to the routing policy and receive events from it. Triggers are inserted in the

routing runtime’s event queue. In order to receive an event, the blocking policy

first needs to register the eventName for the routing event it wants to listen

to. When an eventName is registered, the bridge sets up a producer-consumer

mailbox for the received events that match that eventName. Whenever an

event that matches a registered eventName is generated in the routing run-

time, it is put in the mailbox. To receive a routing event, the blocking policy

calls the waitFor method with the eventName and a eventSpec. The call will

cause the policy to wait until a matching event is received from the routing

policy, unless it is already available in the mailbox.

For example, for the simple client-server system described in Section 9.3.1,

106

1: public class SCSPolicy extends BlockingPolicyModule {2:3: //built-in conditions4: readNowBlock.isValid = true;5: readNowBlock.isComplete = true;6: readNowBlock.isSequenced = true;7: applyUpdateBlock.isValid = true;8:9: // initialization10: public SCSPolicy() {11: // register the events to listen to12: bridge.registerMsgToListen("writeComplete");13: }14:15: // custom write after block16: public void writeAfterBlock(ObjId objId, long offset,17: long length, AcceptStamp as) {18:19: // inform routing policy of a block20: Object[] msgFeilds = {objId.toString, offset, length,21: "writeAfter", "custom"};22: bridge.sendMsgToRouting("operationBlocked", msgFeilds);2324:25: // create the required event spec26: Object[] spec = {objId.toString()};27:28: // wait for a writeComplete message for the object29: bridge.waitFor("writeComplete", spec);30: }31: }32:

Figure 7.6: Blocking policy implementation for the simple client-server system (refer toSection 9.3.1)

107

the blocking policy specifies the B ACT writeComplete condition for the write-

AfterBlock. As Figure 7.6 illustrates, the implementation of this condition

involves four steps: First, the blocking policy registers the event it wants to

listen to (line 12). Second, the designer extends the appropriate method of

the BlockingPolicyModule (line 16). Third, in the method implementation,

the routing policy is informed that the operation is blocked (line 22). Finally,

the method waits until the required routing event is received (line 29).

In fact, the bridge allows designers to specify custom event matchers,

that determines how received events are matched with the provided event spec.

Custom matchers are useful to implement optimizations. For example, on a

client, each multiple outstanding write may be waiting for an acknowledgement

from the server to ensure that the server has received that particular write.

However, because invalidation streams are causal, when a server acknowledges

a particular write, it is safe to assume that the server has already received

all the previous writes from the client. When the server sends an ack for

the last-received write, the default matcher will only unblock that particular

write. By customizing the event matcher, all previous outstanding writes can

be unblocked too.

7.3 Summary

There is a need to bridge the gap between the API expected by the

policy layers and that provided by the mechanisms layer. This section provides

details of the glue. In particular, it describes (1) the R/M Interface that allows

108

deadlock-free translation between the mechanisms and the routing API, (2)

the R/OverLog runtime that translates R/OverLog to Java and executes it, (3)

the BlockingPolicyModule that intercepts access to local state at the blocking

points and exposes blocking conditions as flags for each point, and (4) the

BlockingRoutingBridge that handles communication between the routing and

blocking policies for routing-based conditions.

109

Chapter 8

Evaluation Part 1: Microbenchmarks

Other than the ease of development, PADS’s usefulness also depends

on the performance of systems built with it. The systems should not have to

pay a high cost for PADS’s generality. This chapter examines the overheads

associated with PADS. Evaluation of systems built with PADS is discussed in

Chapter 9. To aid our evaluation, we consider three criteria

• Synchronization overhead: The overheads associated with storage and

synchronization should be within a reasonable constant factor of ideal

implementations and the cost for generality should be minimal.

• Flexibility: The synchronization options should allow designers to pick

the appropriate tradeoffs for their target workloads.

• Performance: The absolute performance of the research prototype should

be reasonable enough for prototyping and for supporting moderately

workloads.

In order to determine whether PADS meets the above stated criteria,

we carry out our evaluation in three steps. First, we evaluate the theoretical

network and storage overheads associated with PADS primitives. Second, we

110

quantify the overheads and evaluate the flexibility by running microbench-

marks on our prototype. Lastly, we examine the absolute read/write perfor-

mance and R/OverLog communication performance.

8.1 Fundamental Overheads

We evaluate the fundamental network and storage costs associated with

PADS prototype by examining the costs for synchronization via subscriptions,

for data and update log storage, and for conflict detection.

Ideal PADS PrototypeSubscription Setup

Inval Subscription O(NSSPrevUpdates) O(Nnodes

with LOG catch-up +NSSPrevUpdates)Inval Subscription O(NSSObj) O(NSSObj)with CP from time=0Inval Subscription O(NSSObjUpd) O(Nnodes

with CP from time=VV +NSSObjUpd)Body Subscription O(NSSObjUpd) O(NSSObjUpd)

Transmitting UpdatesInval Subscription O(NSSNewUpdates) O(NSSNewUpdates)Body Subscription O(NSSNewUpdates) O(NSSNewUpdates)

Table 8.1: Network overheads associated with subscription primitives.Here, Nnodes is the number of nodes. NSSObj is the number of objects in the subscription set.NSSPrevUpdates and NSSObjUpd are the number of updates that occurred and the numberobjects in the subscription set that were modified from a subscription start time to thecurrent logical time. NSSNewUpdates is the number of updates to the subscription set thatoccur after the subscription has caught up to the sender’s logical time.

Synchronization costs. Table 8.1 shows the network cost associated with

our prototype’s implementation of PADS’s primitives and indicates that our

111

costs are close to the ideal of having actual costs be proportional to the amount

of new information transferred between nodes. Note that these ideal costs may

not be able always be achievable in practice.

During invalidation subscription setup, the sender transmits a version

vector indicating the start time of the subscription and catch-up information

as either a log of past updates or a checkpoint of the current state. The start

time is simply a partial version vector that is proportional to the number of

nodes in the system. That cost is amortized over all the updates sent on the

connection.

For log catchup, in an ideal implementation, the sender would send in-

formation proportional to the number of updates that have occurred to objects

in the subscription set. In PADS, the sender does the same—it sends precise

invalidations for those updates. However, in order to support flexible consis-

tency, invalidation subscriptions also carry extra information such as imprecise

invalidations. Imprecise invalidations summarize updates to objects not part

of the subscription set and are sent to mark logical gaps in the casual stream

of invalidations. The number of imprecise invalidations sent depends on the

workload and is never more than the number of invalidations of updates to ob-

jects in the subscription set sent. The size of imprecise invalidations depends

on the locality of the workload and how compactly the invalidations compress

into imprecise invalidations.

For checkpoint catchup, the sender sends meta-data for all the objects

that were updated after the start time. Hence the cost is proportional to the

112

number of objects that were modified plus the initial start time. Also, by

starting the subscription at logical time 0, the overhead for the start time can

be avoided. Note, checkpoint catch-up is particularly cheap when interest sets

are small.

Once the invalidation subscription is set up, the sender simply sends

precise invalidations of any new updates occur to objects in the subscription

set. The precise invalidations are interspersed with imprecise invalidations.

But, as mentioned before, the number of imprecise invalidations is no more

than the number of precise invalidations.

PADS body subscriptions send the same information as an ideal imple-

mentation. The sender sends the bodies of objects in the subscription set that

were modified after the start time. Once the receiver has caught up, it sends

bodies of any new updates it receives.

Two things should be noted. First, the costs of the primitives are

proportional to the useful information sent, so they capture the idea that a

designer should be able to send just the right data to just the right places.

Second, there are only two ways that PADS sends extra information over

the ideal: the start-time vector during invalidation subscription set up and

the imprecise invalidations sent on a stream to maintain flexible consistency.

Overall, we expect PADS to scale well to systems with large numbers of objects

or nodes—the per-node overheads associated with the version vectors used to

set up subscriptions is amortized over all of the updates sent on the stream,

and subscription sets and imprecise invalidations ensure that the number of

113

records transferred is proportional to amount of data of interest (and not to

the overall size of the database).

Storage costs. The minimal storage requirement is that a node stores the

objects it is interested in.

However, it is impossible for a system to provide any reasonable consis-

tency semantics if no bookkeeping information is maintained (even monotonic-

reads coherence requires some bookkeeping for accept stamps). Therefore,

in addition to data objects in the store, PADS maintains per-object meta-

information, consistency bookkeeping per-interest set, and an update log. For

every object, PADS maintains a logical time stamp, a real time stamp, a

deleted flag, a valid flag and a commit flag. PADS also stores a version vector

for every interest set in ISStatus. The number of interest sets is generally

fixed. However, in the worst case, a version vector is maintained per object.

The update log has a fixed maximum size and hence imposes a constant storage

overhead.

Conflict detection. For conflict detection, PADS utilizes the consistency

information already maintained and therefore exerts no extra overhead. How-

ever, for several other conflict-detection schemes, the amount of bookkeeping

information increases with network disruptions. For PADS, the bookkeep-

ing information remains the same because the number of interest sets a node

maintains is not affected by disruptions. Hence, for k interest sets, the storage

114

VersionPVE

Vector PADSvectors sets

log sync cp syncStorage O(N ×R) O(N + R) O(N + R) O(N + k ×R) O(N + k ×R)lower boundStorage O(N ×R) unbounded O(N ×R) O(N + k ×R) O(N + k ×R)upper boundNetwork O(p×R) O(p + R) O(p + R) O(p + R) O(p + R)lower boundNetwork O(p×R) unbounded O(N ×R) O(p×R) O(N ×R)upper bound

Table 8.2: Storage and network overheads under network disruptions.R nodes and N objects stored on a node in k interest sets with p recent updates

overhead is O(N + k × R). If invalidations subscriptions are disrupted, they

simply re-start where they left off incurring extra version vector overhead due

to the resending of subscription start time. Hence, in the worst case, for log

synchronization, there is a version vector overhead per update sent and for

checkpoint synchronization, there is a version vector overhead per object sent.

Given the amount of flexibility that PADS supports, conflict detection

costs are reasonable, see Table 8.2. In fact, the overheads of conflict detection

is comparable to existing state-of-the-art approaches that do not provide such

flexibility. We only compare the catchup phase because it is not clear whether

the schemes have an equivalent “connected” phase. PADS performs as well as

vector sets while providing stronger consistency guarantees.

115

0

200

400

600

800

1000

1200

1400

Fine RandomFine SeqCoarse RandomCoarse Seq

Tot

al B

andw

idth

(K

B)

Body

ConsistencyOverhead

Invalidations

SubscriptionSetup

Ideal

Figure 8.1: Network bandwidth cost to synchronize 1000 10KB files, 100 of which aremodified.

8.2 Quantifying the Constants

We investigate whether the actual performance of the prototype matches

the cost model by running experiments to quantify the overheads associated

with subscription setup, with flexible consistency support, and with the dif-

ferent options exposed by the primitives.

Synchronization cost under different workloads. Figure 8.1 illustrates

the synchronization cost for a simple scenario. In this experiment, there are

10,000 objects in the system organized into 10 groups of 1,000 objects each,

and each object’s size is 10KB. The reader registers to receive updates for one

of these groups by establishing subscriptions. Then, the writer updates 100 of

the objects in each group. Finally, the reader reads all the objects.

We look at four scenarios representing combinations of coarse-grained

116

vs. fine-grained synchronization and of writes with locality vs. random writes.

For coarse-grained synchronization, the reader creates a single invalidation

subscription and a single body subscription spanning all 1000 objects in the

group of interest and receives 100 updated objects. For fine-grained synchro-

nization, the reader creates 1000 invalidation subscriptions, one for each object,

and fetches each of the 100 updated bodies. For writes with locality, the writer

updates 100 objects in the ith group before updating any in the i+1st group.

For random writes, the writer intermixes writes to different groups in random

order.

Four things should be noted. First, the synchronization overheads are

small compared to the body data transferred. Second, the “extra” overheads

associated with PADS subscription setup and flexible consistency over the

best case is a small fraction of the total overhead in all cases. Third, when

writes have locality, the overhead of flexible consistency drops further because

larger numbers of invalidations are combined into an imprecise invalidation.

Fourth, coarse-grained synchronization has lower overhead than fine-grained

synchronization because it avoids per-object subscription setup costs.

Advantage of partial replication. Figure 8.2 demonstrates the bandwidth

savings achieved due to PADS’s support for partial replication. Node A up-

dates 500 objects of size 3KB and Node B establishes subscriptions to syn-

chronize these updates. Two scenarios are investigated: in one, subscriptions

are established with the subscription set covering 100% of the data, and in the

117

0

200

400

600

800

1000

1200

1400

1600

0 100 200 300 400 500

Dat

a T

rans

fere

d (K

B)

Number of Writes

SS=100%

Ideal

SS=10%

Figure 8.2: Bandwidth to synchronize 500 objects of size 3KB

Coherence-only PADSBursty workload (1 in 10) 1 1.1Worst Case (1 in 2) 1 2

Table 8.3: Messages per interested update sent by a coherence-only system and PADS.

other, the subscriptions only cover 10% of the data. The bandwidth required

by the subscriptions for synchronization is measured. The results depicted in

Figure 8.2 allow us to come to two conclusions. First, the results demonstrate

that the overhead for synchronization is relatively small even for small files

compared to an ideal implementation (plotted by counting the bytes of data

that must be sent ignoring all metadata overheads). More importantly, they

demonstrate that if a node requires only a fraction (e.g., 10%) of the data,

PADS keeps information that is not relevant to that subset to a minimum and

in the process greatly reduces the bandwidth required for synchronization.

118

Cost of flexible consistency. We evaluate the cost PADS pays to support

flexible consistency. For systems that require weak consistency such as coher-

ence, they simply send updates without imprecise invalidations. For systems

that require strong consistency, they need to send imprecise invalidations to

ensure casual semantics over which stronger guarantees can be implemented.

In particular, we quantify the cost of sending imprecise invalidations in an

invalidation stream. Table 8.3 compares the number of messages per update

between a coherence-only system and PADS. In a coherence-only system, only

updates to objects in the synchronization set are sent on the stream. On the

other hand, PADS also sends imprecise invalidations for updates outside the

subscription set.

The number of imprecise invalidations sent depends on the workload.

For a bursty workload, say if 9 out of 10 updates occur to objects in the sub-

scription set, an imprecise invalidation is only sent after nine invalidations.

In the worst case workload, PADS sends an imprecise invalidation after every

precise invalidation. Thus, PADS sends at most twice the number of messages

when compared to a coherence-only system. However, since imprecise invalida-

tions are significantly smaller than actual bodies, the overhead remains within

reasonable bounds.

Advantage of multiplexing subscriptions. Efficient support for multiple

dynamic subscriptions is important because they are used to implement de-

mand caching with per-object callbacks [49]. For example, each time a client

119

0

200000

400000

600000

800000

1e+06

1.2e+06

1.4e+06

1.6e+06

1.8e+06

2e+06

0 200 400 600 800 1000 1200 1400

Syn

c B

W (

Byt

es)

Files

Withmultiplexing

No multiplexing

Ideal

Figure 8.3: Bandwidth to subscribe to varying number of single-object interest sets withand without subscription multiplexing.

caches a new object, it creates a new subscription for that object.

We compare the efficiency of establishing multiple dynamic subscrip-

tions with and without subscription multiplexing. Figure 8.3 depicts the

bandwidth costs for establishing single-object invalidation subscriptions for

the ideal case, without subscription multiplexing, and with subscription mul-

tiplexing (i.e. PADS). In the ideal case, only the object is sent for every sub-

scription request. Without subscription multiplexing, a separate invalidation

stream is established for each request. Other than subscription start and end

messages, imprecise invalidations are sent on each stream. PADS multiplexes

subscription requests on a single stream—only the catchup start, the object,

and catchup end messages are sent for each request. The major cost savings

comes from the reduction of redundant invalidation information received by

120

a node. Since the prototype is implemented in Java, the inefficiency of Java

serialization does affect the bandwidth cost. However, it is not difficult to see

that the bandwidth subscription multiplexing achieves comes much closer to

ideal when compared to no multiplexing.

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

0 500 1000 1500 2000

Sub

scrip

tion

BW

(B

ytes

)

Number of updates to 500 objects

Log catchup

CP catchup

Figure 8.4: Bandwidth to subscribe to varying number of updates to 500 objects sets forcheckpoint and log synchronization.

Log vs. checkpoint catchup. Figure 8.4 compares the bandwidth cost

for log and checkpoint synchronization. A set of 500 objects were updated

uniformly and invalidation subscriptions are established separately for each

object. As the figure illustrates, the synchronization cost for both options are

proportional to the number of updates when each object is not updated more

than once. Checkpoint synchronization does worse because the size of the

meta-data sent in a checkpoint is slightly larger than an imprecise invalidation

sent for log synchronization. However, when an object is updated multiple

times, checkpoint catchup outperforms log synchronization.

121

Write Write Read Read(sync) (async) (cold) (warm)

ext3 6.64 0.02 0.04 0.02PADS Object Store 8.47 1.27 0.25 0.16

Table 8.4: Read/write performance for 1KB objects/files in milliseconds.

Write Write Read Read(sync) (async) (cold) (warm)

ext3 19.08 0.13 0.20 0.19PADS Object Store 52.43 43.08 0.90 0.35

Table 8.5: Read/write performance for 100KB objects/files in milliseconds.

8.3 Absolute Performance

This section examines the absolute performance of the PADS proto-

type. Our goal is to provide sufficient performance for the system to be useful,

but we expect to pay some overheads relative to a local file system for three

reasons. First, PADS is a relatively untuned prototype rather than well-tuned

production code. Second, our implementation emphasizes portability and sim-

plicity, so PADS is written in Java and stores data using BerkeleyDB rather

than running on bare metal. Third, PADS provides additional functionality

such as tracking consistency metadata not required by a local file system.

Local performance. Tables 8.4 and 8.5 summarize the performance for

reading and writing 1KB and 100KB objects stored locally in PADS and com-

pare it to the performance for reading or writing a file on the local ext3 file

system. In each run, we read/write 100 randomly selected objects/files from a

122

collection of 10,000 objects/files. We measure the performance of synchronous

writes (i.e. the write returns after update is committed to disk), asynchronous

writes (i.e. the write returns after the update is committed to memory), cold

reads (i.e. the data need to be fetched from disk), and warm reads (i.e. the

data is already located in memory). The values reported are averages of 5

runs.

Overheads are significant. Upon further investigation, we find the bot-

tleneck in the mechanism layer object store and consistency bookkeeping im-

plementation. When an update occurs, the disk may be accessed 4 times:

to store the invalidation in the log, to store the body in the object store, to

update the object meta-data, and to update the byte-range meta-data. Also,

by allowing writes to occur to any object byte-range, a decent amount of com-

plexity is introduced—to clean per-range meta-data and the bodies stored in

the database when a new write overlaps multiple existing writes. We have

strong evidence to believe that by optimizing object and meta-data storage,

by restricting writes to fixed-size chunks, and by implementing a customized

disk sync policy, performance can be greatly improved. In the meantime,

the current prototype still provides sufficient performance for a wide range of

systems.

R/OverLog performance. Execution of routing rules has significant im-

pact on the performance of systems built on PADS. An earlier version of our

system used P2 [56] to execute the routing rules. Unfortunately, the P2 run-

123

P2 Runtime R/OverLog RuntimeLocal Ping Latency 3.8ms 0.322msLocal Ping Throughput 232 request/s 9,390 requests/sRemote Ping Latency 4.8ms 1.616msRemote Ping Throughput 32 requests/s 2,079 requests/s

Table 8.6: Performance numbers for processing NULL trigger to produce NULL event.

time was not designed to receive external events. With a workaround, it was

possible to insert and receive events from P2, but with inferior performance.

As described in Chapter 7, we currently use a customized translator that con-

verts R/OverLog programs to Java and a R/OverLog runtime that natively

implements an interface to allow PADS to inject and receive events from the

rules. In addtion, R/OverLog imposes a local-lookups-only limitation, i.e. a

rule can only be look up local tables in its body. In OverLog, a single rule

can look up tables on multiple nodes. The P2 runtime establishes and main-

tains tuple streams between nodes in order to support that. However, by

imposing the local-lookups-only limitation, commuincation between nodes in

R/OverLog is reduced to explicit event transfers making the runtime smaller

and more streamlined. As a result, we were able to achieve a significant per-

formance boost. Table 8.6 quantifies the performance difference between the

P2 and R/OverLog runtimes.

8.4 Summary

In this chapter, we evaluate the overheads associated with the generality

of PADS. We find that the benefits of using PADS as a development platform

124

compelling. Architecturally, PADS is sound— its network and storage over-

heads are within small constant factors of ideal implementations. Also, PADS

provides sufficient flexibility to allow designers to pick the tradeoffs best suited

for their target workloads. Finally, in terms of absolute performance, since the

PADS prototype is a user-level Java implementation, the performance is suf-

ficient for prototyping and moderately demanding workloads. However, this

drawback can be seen as an artifact of the prototype implementation rather

than that of the approach.

125

Chapter 9

Evaluation 2: Case-studies

This chapter evaluates PADS’s approach for building new distributed

systems. The goal of the evaluation is to demonstrate that the primitives are

sufficient to implement a broad range of systems, that the policy architecture

is easy to use, and that the systems developed with it are real and can be

easily adapted to new requirements. One way to evaluate PADS would be to

construct a new system for a new demanding environment and report on that

experience. We choose a different approach—constructing a broad range of

existing systems—for three reasons. First, a single system may not cover all

of the design choices or test the limits of PADS. Second, it might not be clear

how to generalize the experience from building one system to building others.

Third, it might be difficult to disentangle the challenges of designing a new

system for a new environment from the challenges of realizing a given design

using PADS.

This chapter provides details of our experience. First, it defines the

evaluation criteria for a good framework from a developer’s point of view.

Then, it defines notion of architectural equivalence that is used as a yardstick

to compare our implementation of a system with its original description. It

126

then details the implementation of each case-study system. Finally, it describes

the experiments carried out to evaluate the performance, realism, or the agility

of the systems developed.

9.1 Evaluation Overview

The goal of PADS is to make it easier to develop new systems. Our

evaluation aims to answer the following questions from a system developer’s

point of view:

• Is PADS flexible enough to support my system? We build a wide range of

systems inspired from literature, including client-server systems, server-

replication systems, and ad-hoc peer-to-peer systems. These systems

were chosen because they embody a wide range of replication techniques

that are often used by new systems. PADS’s ability to support this wide

range of systems suggests that PADS is sufficiently powerful to define

systems spanning a major portion of the design space.

• Is development easier on PADS? The fact that a small team could quickly

build a dozen significant replication systems is evidence of our experi-

ence that PADS greatly reduces development time and effort. Developers

only need to focus on implementing and debugging control plane policy

instead of implementing data plane mechanisms such as object storage,

consistency management, and communication. In addition, each system

built on PADS requires relatively few lines of code – typically several

127

dozens of routing rules and a handful of blocking conditions. In our

experience, it is easier to understand and debug a system that has a

hundred lines rather than thousands of lines of code. There is, how-

ever, an initial learning curve to learn the routing language, to learn the

API, and to change the development mindset to the blocking and routing

approach. We note, however, that students in a graduate operating sys-

tem class have used PADS as a development platform for their projects

without much trouble after the initial learning curve.

• Does PADS facilitate evolving system design to incorporate new features

or meet new demands? We added significant new features to several

systems we built that allowed order-of-magnitude performance improve-

ments over the baseline design. The features were added by changing

fewer than 10 routing rules.

• Can PADS be used to build real systems? When building a concrete

distributed storage system, a system designer often needs to deal with

practical issues such as setting up configuration options, handling crashes

and recovery, maintaining consistency and durability during periods of

crashes, and dynamic addition and removal of nodes. PADS makes it

easy to implement the above details: First, the stored events primi-

tive allows routing policies to dynamically store and retrieve configura-

tion options from data objects. Second, during recovery, the underlying

mechanisms take care of complex details such as state recovery so that

128

policy writers only need to re-establish subscriptions. Third, blocking

predicates make it easy to reason about consistency during crashes be-

cause they prevent any “unsafe” data from being accessed. All systems

we built are concrete in the sense that they handle the above issues.

To demonstrate the “concreteness” of the systems, we examine the be-

haviour of some of the systems under adverse network conditions (refer

to Section 9.4).

• What are the overheads? We evaluate the overheads associated with the

generality of PADS. As described in Chapter 8, architecturally PADS

seems sound—its network and storage overheads are within small con-

stant factors of hand-tuned systems. In terms of absolute performance,

since the PADS prototype is a user-level Java implementation, a system

built on PADS is within ten to fifty percent of the original system in

most cases and 3.3 times worse in the worst case we measured.

9.2 Architectural Equivalence

We build systems based on system designs from the literature, but

constructing perfect, “bug-compatible” duplicates of the original systems using

PADS is not a realistic (or useful) goal. On the other hand, if we were free

to pick and choose arbitrary subsets of features to exclude, then the bar for

evaluating PADS is too low: we can claim to have built any system by simply

excluding any features PADS has difficulty supporting.

129

Section 3 identifies three aspects of system design—security, interface,

and conflict resolution—for which PADS provides limited support, and our

implementations of the above systems do not attempt to mimic the original

designs in these dimensions.

Beyond that, we have attempted to faithfully implement the designs

in the papers cited. More precisely, although our implementations certainly

differ in some details, we believe we have built systems that are architecturally

equivalent to the original designs. We define architectural equivalence in terms

of three properties:

E1. Equivalent overhead. A system’s network bandwidth between any pair

of nodes and its local storage at any node are within a small constant

factor of the target system.

E2. Equivalent consistency. The system provides consistency and staleness

properties that are at least as strong as the target system’s.

E3. Equivalent local data. The set of data that may be accessed from the sys-

tem’s local state without network communication is a superset of the set

of data that may be accessed from the target system’s local state. Notice

that this property addresses several factors including latency, availability,

and durability.

There is a principled reason for believing that these properties capture some-

thing about the essence of a replication system: they highlight how a system

130

resolves the fundamental CAP (Consistency vs. Availability vs. Partition-

resilience) [30] and PC (Performance vs. Consistency) [55] tradeoffs that any

distributed storage system must make. More specifically, omitting any of these

properties could allow a system to significantly cut corners. For example, one

can improve read performance by increasing network and storage resource con-

sumption to speculatively replicate more data to each node. Similarly, one can

improve the availability a system offers for a given level of consistency by using

more network bandwidth to synchronize more often [94], or one can reduce the

resources consumed by reducing the amount of data cached at a node.

9.3 Case-studies

In this section, we evaluate PADS’s flexibility and ease of use as a

development platform by constructing 8 distributed storage systems inspired

from literature. The constructed systems include

• client-server systems such as Simple Client-Server(SCS), Full Client Server(FCS),

Coda [49], and TRIP [64],

• server replication systems such as Bayou [69], and Chain Replication [88],

and

• object-replication systems such as TierStore [26] and Pangaea [77].

These systems were chosen because they cover a wide range of design

features in a number of key dimensions. For example,

131

• Replication: full replication (Bayou, Chain Replication, and TRIP),

partial replication (Coda, Pangaea, FCS, and TierStore), and demand

caching (Coda, Pangaea, and FCS),

• Topology: structured topologies such as client-server (Coda, FCS, and

TRIP), hierarchical (TierStore), and chain (Chain Replication); unstruc-

tured topologies (Bayou and Pangaea). Invalidation-based (Coda and

FCS) and update-based (Bayou, TierStore, and TRIP) propagation.

• Consistency: monotonic-reads coherence (Pangaea and TierStore), ca-

sual (Bayou), sequential (FCS and TRIP), and linearizability (Chain

Replication); techniques such as callbacks (Coda, FCS, and TRIP) and

leases (Coda and FCS).

• Availability: Disconnected operation (Bayou, Coda, TierStore, and TRIP),

crash recovery (all), and network reconnection (all).

Table 1.1 on Page 5 summarizes the features for each system and the following

subsections provide details of the implementations. In order to differentiate

the PADS implementations from the original implementations, we add a prefix

“P-” to the PADS implementations.

9.3.1 Simple Client Server (SCS)

This simple system includes support for a client-server architecture,

invalidation callbacks [42], sequential consistency [53], correctness in the face

132

of crash/recovery of any node, and configuration. For simplicity, it assumes

that a write overwrites an entire file.

We choose this example not because it is inherently interesting but be-

cause it is simple yet sufficient to illustrate the main aspects of PADS, including

support for coarse- and fine-grained synchronization, consistency, durability,

and configuration. In Section 9.3.2, we extend the example with features

that are relevant for practical deployments, including leases [32], cooperative

caching [23], and support for partial-file writes.

9.3.1.1 System Overview

The server stores the full set of data. Clients cache a subset of data

locally. On a read miss, a client fetches the missing object from the server

and registers for callback for the fetched object. On a write, a client sends the

update to the server and waits for an acknowledgement from the server. Upon

receiving an update of an object, a server sends invalidations to all clients

registered to receive callbacks for that object. Upon receiving an invalidation

of an object, a client sends an acknowledgment to the server and cancels the

callback. Once the server has received all the acknowledgements for an update

from other clients, the server informs the original writer so it can continue.

For durability reasons, a write by a client is not seen by any other client until

the server has persistently stored the update.

Since multiple clients can issue concurrent writes to multiple objects, for

sequential consistency, the system defines a global sequential order on those

133

writes and ensures that they are observed in that order. In a client-server

system, it is natural to have the server set that total order. Therefore, upon

receiving acknowledgements for all of an update’s invalidations, the server

assigns the update a position in the global total order. The system guarantees

sequential consistency by ensuring that a write of an object is blocked until

all earlier versions have been invalidated, a read of an object is blocked until

the reader holds a valid, consistent copy of the object, and a read or write of

an object is blocked until the client is guaranteed to observe the effects of all

earlier updates in the sequence of updates defined by the server.

9.3.1.2 P-SCS Implementation

The implementation of P-SCS requires 24 routing rules and 5 blocking

conditions. In this section, we provide a brief overview of the implementation.

Details of the implementation is described Appendix A.1. We include rule

counts in our descriptions to emphasize how concisely the design features can

be specified.

System configuration. Every client needs to know the address of the server

to contact. At startup, a node looks up a configuration file using the stored

events interface and stores the identifier in an R/OverLog table (2 rules: ini1,

ini2).

Sending updates to server. As soon as a client receives the server address,

it establishes invalidation and body subscriptions to the server so that local

134

updates (2 rules: csSb1, csSb2) can be automatically transferred to the server.

In order to deal with failures, the client tries to reestablish the subscriptions

when they fail (2 rules: csSb3, csSb4).

Read misses and callbacks. Whenever a client suffers a read miss, the

routing policy is informed by a operationBlock trigger for the readNowBlock.

When that happens, the client informs the server and stores outstanding misses

in a local table (2 rules: rm1, rm2). When the server is informed of a client

read miss, it sends the body to the client and establishes a callback. In PADS,

a callback is simply an invalidation subscription from the server to the client.

The server also puts an entry into a table to keep track of the clients have that

callbacks for objects (3 rules: cb1, cb2, cb3).

Client writes and ack management. Whenever a client updates an ob-

ject, the invalidation and the body propagates to server via the subscriptions

established at startup. An invalidation of the update automatically propagates

to other clients that hold callbacks via the established invalidation subscrip-

tions. When a client receives an invalidation, it will check if the received

invalidation corresponds to an outstanding read miss (1 rule:rinv0). If not,

it breaks the callback (by removing the invalidation subscription) and sends

an ack to the server (2 rules: rinv1, rinv2). Otherwise, the client will con-

sider the miss satisfied, and delete the entry from its read miss table. (1 rule:

rinv3). The server gathers acks from the clients. Once all the required acks

135

have been received, the server assigns a sequence number to the write and

informs the original writer. The server-side ack management takes 8 rules in

total: ack1-ack8.

Enforcing durability and consistency. The blocking policy helps enforce

durability and consistency guarantees by specifying 5 conditions.

In order to ensure durability, the isValid condition is specified for the

applyUpdateBlock at the server and at clients. Since a received invalidation is

not applied until the corresponding body is received, it is guaranteed that no

update can be seen by any other client until the server has stored it.

Consistency is guaranteed both by routing rules and access blocks. A

routing rule (ack7), ensures that an update is assigned a sequence number only

after all other copies of the data have been invalidated. On the client side, the

readNowBlock is set to 3 conditions:isValid and isComplete and isSequenced.

These conditions ensure that reads only return valid consistent copies that

have been sequenced by the server. The writeAfterBlock on the client is set

to B action writeComplete. In other words, the write only unblocks after it

receives a message from the server, which, due to the ack management routing

rules, is only sent to the writer when all other copies have been invalidated.

Even when multiple clients are updating the same object, the observed order

at any client is as the same sequenced order observed at server.

136

9.3.2 Full Client Server (FCS)

The full client server adds several features to the simple client server

implementation to make it more practical including volume leases [93], coop-

erative caching [23], partial-file writes, and blind writes. Complete details of

the implementation can be found in Appendix A.2.

To ensure liveness for all clients that can communicate with the server,

we use volume leases to expire callbacks from unreachable clients. Adding

volume leases requires an additional blocking condition (i.e. MaxStaleness) to

block client reads if the client’s view of the server’s state is too stale. The

routing implementation keeps the client’s view up-to-date by sending periodic

heartbeats via a volume lease object. It requires 3 routing rules to have clients

maintain subscriptions to the volume lease object and have the server put

heartbeats into that object, and 4 more to check for expired leases and to

allow a write to proceed once all leases expire. Note that by transporting

heartbeats via a PADS object, we ensure that a client observes a heartbeat

only after it has observed all causally preceding events, which greatly simplifies

reasoning about consistency.

We add cooperative caching by replacing the rule that sends a body

from the server with 5 rules: 3 rules (cc2, cc3, cc4) to find a helper and get

data from the helper, and 2 rules to fall back on the server if no helper is found

(cc5) or when the helper fails to satisfy the request (cc1). Note that reason-

ing about cache consistency remains easy because invalidation metadata still

follows client-server paths, and the blocking predicates ensure that a body is

137

not read until the corresponding invalidation has been processed. In contrast,

some previous implementations of cooperative caching found it challenging to

reason about consistency [18].

We add support for partial-file writes by removing one and adding six

rules (pfw1-pfw6) to track which blocks each client is caching and to cancel a

callback subscription for a file only when all blocks have been invalidated.

Finally, we add three rules (bw1-bw3) to the server that check for blind

writes when no callback is held and to establish callbacks for them.

In total, P-FCS requires 43 routing rules and 6 blocking conditions.

9.3.3 Coda

We implement P-Coda, a system inspired by the version of Coda de-

scribed by Kistler et. al. [49]. P-Coda supports disconnected operation, reinte-

gration, crash recovery, whole-file caching, open/close consistency (when con-

nected), causal consistency (when disconnected), and hoarding. We know of

one feature from this version that we are missing: we do not support cache

replacement prioritization. In Coda, some files and directories can be given a

lower priority and will be discarded from cache before others. Coda is long-

running project with many papers worth of ideas. We omit features discussed

in other papers like server replication [78], trickle reintegration [60], and vari-

able granularity cache coherence [61]. We see no fundamental barriers to

adding them in P-Coda. We also illustrate the ease with which co-operative

caching can be added to P-Coda.

138


P-Coda is a client-server system, similar to the simple client server

system (SCS) discussed earlier. The main differences between P-Coda and

P-SCS are detailed below.

First, P-Coda provides open-to-close semantics which means that when

a file is opened at a client, the client will return a local valid copy or retrieve

the newest version from the server. Subsequent updates to the file are buffered

and are sent to the server on file is closed.

Second, every client has a list of files, the “hoard set”, that it will

prefetch from the server and store in its local cache whenever it connects to

the server.

Third, P-Coda supports disconnected operation by weakening sequen-

tial consistency to causal consistency when a client is disconnected from the

server: allowing writes to continue uncommitted and reads to access uncom-

mitted but valid objects.

9.3.3.2 P-Coda Implementation

As detailed in Appendix A.3, we extend the routing rules of the simple-

client server model to implement P-Coda in 31 rules. First, like the full client-

server example, we add 3 rules to check for blind writes when no callback is

held and establish callbacks for them. Second, we add 4 rules (ss1-ss4) to

keep track of server status and two rules (li1 and li2) to let writes and reads

139

continue when disconnected from the server. Third, we implement hoarding in

2 rules (hd1-hd2) by storing the hoard set as tuples in a configuration file and

establishing invalidation and body subscriptions for each of them whenever

the client connects to the server. Fourth, in order to allow a client to quickly

get information about the updates it has missed, we add a rule (csSb6) so that

when a client reconnects to a server, it establishes an invalidation subscription

for an empty set. This action causes the server to send an imprecise invalidation

for the missed updates.

The blocking policy adds the B ACT isDisconnected condition to read-

NowBlock and writeAfterBlock so as to allow reads and writes to continue when

operating in disconnected mode.

In order to support open-to-close semantics, we implement a wrapper

layer over the underlying read/write interface. When a file is opened, a read

for the whole object is issued at the read interface. All writes are buffered by

the wrapper layer and are issued to the underlying write interface when the file

is closed. The blocking policy ensures that the close returns after all updates

are propagated to the server and all copies cached on other clients have been

invalidated.

P-Coda and cooperative caching. In P-Coda, on a read miss, a client is

restricted to retrieving data from the server. We add cooperative caching to

P-Coda by adding 8 rules: 5 to monitor the reachability of nearby nodes, 1

to retrieve data from nearby peers on a read miss, and 2 to get invalidations

140

from nearby clients if the server is not reachable. Cooperative caching has two

advantages: first, it can greatly reduce the read latency on a miss if the peer

is much closer than the server, and second it allows disconnected clients to

access files they previously could not.

9.3.4 TRIP

TRIP [64] is a distributed storage system for large-scale information

dissemination: all updates occur at a server and all reads occur at clients.

TRIP uses a self-tuning prefetch algorithm to send higher priority updates

sooner and delays applying invalidations to a client’s locally cached data to

maximize the amount of data that a client can serve from its local state.

TRIP guarantees sequential consistency via a simple algorithm that exploits

the constraint that all writes are carried out by a single server.

9.3.4.1 P-TRIP Implementation

Information propagation in P-TRIP follows a star topology—every client

is connected to the server via invalidation and body subscriptions.

To support self-tuning, whenever the server issues a write, it assigns a

priority to the update. Since body subscriptions have no ordering constraints,

they send bodies of higher priority updates before lower-priority updates. Also,

the application of invalidations is delayed up to a threshold by setting the

applyUpdateBlock is to (isValid or maxStaleness). Since all writes occur at the

server, sequential consistency can be guaranteed by setting the readNowBlock

141

to (isValid and isComplete). As detailed in Appendix A.4, the routing policy

consists of 6 rules in total and the blocking policy of 4 conditions.

P-TRIP and hierarchical topology. TRIP assumes a single server and

a star topology. We can improve scalability by changing the topology from

a star to a static tree by simply changing a node’s configuration file to list a

different node as its parent. PADS’s mechanisms are general enough so that

this “just works”—invalidations and bodies flow as intended and consistency is

still maintained. Better still, if one writes a topology policy that dynamically

reconfigures a tree when nodes become available or unavailable [56], a few

additional rules to establish or remove subscriptions to produce a dynamic-

tree version of TRIP that still enforces sequential consistency. Note that we

have implemented the static tree policy (refer to Appendix A.4) but not the

dynamic tree policy.

9.3.5 Bayou

Bayou [69] is a server-replication protocol that focuses on peer-to-peer

data sharing. Every node has a local copy of all of the system’s data. From

time to time, a node picks a peer to exchange updates with via anti-entropy

sessions. Actually, Bayou transfers updates as operations. However, since

PADS does not support operation transfer, we implement a state-transfer ver-

sion of Bayou. Bayou guarantees causal and eventual consistency.

142

9.3.5.1 P-Bayou Implementation

As detailed in Appendix A.5, an anti-entropy session in PADS is imple-

mented by setting up invalidation and body subscriptions between two nodes.

Once all the updates have been transferred, the subscriptions are removed.

Note that if the log at the sender has been truncated to a point beyond the re-

ceiver’s consistency state, the invalidation subscription will automatically send

a checkpoint rather than the log; Bayou’s approach is similar. The complete

implementation of P-Bayou’s routing policy requires 10 rules: 1 for picking a

random peer, 4 for carrying out anti-entropy and 5 for connection manage-

ment.

Causal consistency is guaranteed by setting the readNowBlock to (is-

Valid and isComplete). Also, the applyUpdateBlock is set to isValid to ensure

that no local data is invalid.

P-Bayou and small device support. Since the protocol propagates up-

dates for the whole data set to every node, P-Bayou cannot efficiently support

smaller devices that have limited storage or bandwidth.

It is easy to change P-Bayou to support small devices. In the original

P-Bayou design, when anti-entropy is triggered, a node connects to a reachable

peer and subscribes to receive invalidations and bodies for all objects using a

subscription set “/*”. In our small device variation, a node uses stored events

to read a list of directories from a per-node configuration file and subscribes

143

only for the listed sub-directories. This change required us to modify two

routing rules.

This change raises an issue for the designer. If a small device C syn-

chronizes with a first complete server S1, it will not receive updates to objects

outside of its subscription sets. These omissions will not affect C since C

will not access those objects. However, if C later synchronizes with a second

complete server S2, S2 may end up with causal gaps in its update logs due

to the missing updates that C doesn’t subscribe to. The designer has three

choices: to weaken consistency from causal to per-object coherence; to restrict

communication to avoid such situations (e.g., prevent C from synchronizing

with S2); or to weaken availability by forcing S2 to fill its gaps by talking

to another server before allowing local reads of potentially stale objects. We

choose the first, so we change the blocking predicate for reads to no longer

require the isComplete condition. Other designers may make different choices

depending on their environment and goals.

9.3.6 Chain Replication

Chain Replication [88] is a server replication protocol in which the

nodes are arranged as a chain to provide high availability and linearizability.

All updates are introduced at the head of the chain and queries are handled

by the tail. An update does not complete until all live nodes in the chain have

received it. Chain management is carried out by an always-available master.

P-Chain-Replication implements this protocol with support for vol-

144

umes, the addition of new nodes to the chain, and node failure and recovery.

9.3.6.1 P-Chain-Replication Implementation

P-ChainReplication implements each link in the chain as an invalidation

and a body subscription. When an update occurs at the head, the update

flows down the chain via subscriptions (Rules sub1 and sub2). The master

process is also implemented in R/OverLog and is assumed to never fail. We

see no difficulty in extending the master to a Paxos-based [54] implementation.

Appendix A.6 details the routing and blocking policies of P-Chain-Replication.

Volume support. P-Chain-Replication assumes that the prefix of every ob-

jectId indicates its volume. For example, the object /V1/a/b belongs to vol-

ume V1. In order to aid the specification of the routing policy, we extend

R/OverLog runtime to implement a function f getVol() that will return the

volume for a given object. We also assume that there is a volume list that

specifies the volumes that should be replicated on each node.

Enforcing linearizability. In order to support linearizability, an update

occurs only at the head and is blocked until it has propagated down all nodes

to the tail of the chain. Once the tail receives an update, it sends an ack

upstream. In the original chain replication system, every node keeps track

of the updates it has sent downstream for which it has not received an ack.

However, in P-ChainReplication, it is not necessary to carry out this complex

145

bookkeeping. Because updates propagate along the links in causal order, there

is no need to send an ack through each link. Instead, the tail simply sends an

ack to the head. The ack indicates that all nodes in the chain have received

that update and all casually preceding updates. Rules con1 to con6 provide

details of ack management.

The blocking policy is defined so that reads return only causal data

(i.e. readNowBlock is set to isValid and isComplete), and a write is blocked

until an ack is received (i.e. writeAfterBlock is set to B Action(AckFromTail)).

Also, the applyUpdateBlock is set to isValid to ensure that every node has the

body of an update before its invalidation is applied.

Chain management. As in the original system, chain management is car-

ried out by the master. The master keeps track of the reachability of all

nodes and maintains a table that stores the configuration of the volume chain.

When the master detects a new node or a node failure, it rearranges the chain

as described below.

Addition of new nodes. When the master detects a new node, it is added

to the tail of the chain as follows: The master sends messages to the new

node informing it of the current head and tail of the volume chain and its

predecessor. The node then establishes subscriptions to its predecessor. Once,

the subscription has caught-up (i.e. the node’s state reflects that of the chain),

it can start functioning as a tail and notifies all other nodes. Rules nn1 to

146

nn15 handle new node addition.

Failure recovery. When a node fails, the master detects the failure and

rearranges the chain. If the head fails, the next node in the chain becomes the

the head. If the tail fails, the node before it in the chain becomes the new tail.

If a node in the middle fails, then its predecessor and successor are linked by

establishing subscriptions. Note that if a node recovers, it is added to the end

of the chain, just like a new node.

In the original system, recovery from mid-chain failure involves a com-

plex algorithm to ensure that the node preceding the failure sends all updates

that the new successor is missing and that the successor sends all the missing

acks to the predecessor. However, in the PADS implementation, all this com-

plexity is eliminated. The preceding node only needs to establish subscriptions

and the underlying mechanisms will automatically send the missing updates

down the chain. Also, since the tail sends an ack directly to the head, hop-

by-hop ack management is not needed. Rules dn1 to dn7 provide details of

failure management.

9.3.7 TierStore

TierStore [26] is a distributed object storage system that targets devel-

oping regions where networks are bandwidth-constrained and unreliable. Each

node reads and writes specific subsets of data. Since nodes must often oper-

ate in disconnected mode, the system prioritizes 100% availability over strong

147

consistency.


In order to achieve these goals, TierStore employs a hierarchical pub-

lish/subscribe system: all nodes are arranged in a tree. To propagate updates

up the tree, every node sends all of its updates and its children’s updates to

its parent. To flood data down the tree, data are partitioned into “publica-

tions” and every node subscribes to a set of publications from its parent node

covering its own interests and those of its children. For consistency, TierStore

only supports single-object monotonic reads coherence.

9.3.7.2 P-TierStore Implementation

A 14-rule routing policy establishes and maintains the publication ag-

gregation and multicast trees. A full listing of these rules is available in the

Appendix A.7. In terms of PADS primitives, each connection in the tree is

simply an invalidation subscription and a body subscription between a pair of

nodes. Every PADS node stores in configuration objects the ID of its parent

and the set of publications to subscribe to.

On start up, a node uses the stored events interface to read configura-

tion objects and store the configuration information in R/OverLog tables (4

rules: in0, pp0, pp1, pSb0). When it knows of the ID of its parent, it adds

subscriptions for every item in the publication set (2 rules: pSb1, pSb2). For

every child, it adds subscriptions for “/*” to receive all updates from the child

148

(2 rules: cSb1, cSb2). If an application decides to subscribe to another publi-

cation, it simply writes to the configuration object. When this update occurs,

a new stored event is generated and the routing rules add subscriptions for the

new publication.

Recovery. If an incoming or an outgoing subscription fails, the node pe-

riodically tries to re-establish the connection (2 rules: f1, f2). Crash recov-

ery requires no extra policy rules. When a node crashes and starts up, it

simply re-establishes the subscriptions using its local logical time as the sub-

scription’s start time. The underlying subscription mechanisms automatically

detect which updates the receiver is missing and send them.

Delay tolerant network (DTN) support. P-TierStore supports DTN

environments by allowing one or more mobile PADS nodes to relay information

between a parent and a child in a distribution tree. In this configuration,

whenever a relay node arrives, a node subscribes to receive any new updates

the relay node brings and pushes all new local updates for the parent or child

subscription to the relay node (4 rules: dtn1, dtn2, dtn3, dtn4). An alternative

approach would be to make use of existing DTN network protocols. This

approach is straight-forward to implement if the DTN layer informs the policy

layer when it has an opportunity to send to another node and when that

opportunity ends. An opportunity could be that a TCP connection opens up

or a USB drive was inserted. The routing policy would establish subscriptions

to send updates within that connection opportunity as a DTN bundle.

149

Blocking policy. Blocking policy is simple because TierStore has weak con-

sistency requirements. Since TierStore prefers stale available data to unavail-

able data, we set the applyUpdateBlock to isValid to avoid applying an invali-

dation until the corresponding body is received.

TierStore vs. P-TierStore. Publications in TierStore are defined by a

container name and depth to include all objects up to that depth from the

root of the publication. However, since P-TierStore uses a name hierarchy to

define publications (e.g., /publication1/*), all objects under the directory tree

become part of the subscription with no limit on depth.

Also, PADS provides a single conflict-resolution mechanism, which dif-

fers from that of TierStore in some details. Similarly, TierStore provides native

support for directory objects, while PADS supports a simple untyped object

store interface.

9.3.8 Pangaea

Pangaea [77] is a peer-to-peer distributed storage system for wide area

networks that supports high degrees of replication and high availability. For

each object, Pangaea maintains a connected graph and updates to that object

are pushed along graph edges. Pangaea also maintains three gold replicas for

every object to ensure data durability. The location of gold replicas for each

object is stored in the object’s parent (i.e. the directory entry for the file).

For consistency, Pangaea guarantees only weak best-effort coherence.

150

9.3.8.1 P-Pangaea Implementation

P-Pangaea implements object creation, replica creation, update prop-

agation, 3 gold nodes and 3-connected graph maintenance, temporary failure

recovery, and permanent failure recovery. The graph edges in P-Pangaea are

simply invalidation and body subscriptions. We currently do not implement

harbingers. However, since harbingers map to invalidations, support for them

can be easily added by only using invalidation subscriptions as graph edges

and fetching the bodies from the fastest link as required. Also, we do not

implement the “red button” feature, which provides applications confirmation

of update delivery or a list of unavailable replicas, but do not see any difficulty

in integrating them. Appendix A.8 details the implementation of P-Pangaea.

Network management. The original system implements a gossip-based

membership module to keep of track of node liveness and latency between

nodes pairs. P-Pangaea uses a simple ping-based membership protocol and

assumes that the latency between node pairs is static and is provided in a

configuration file.

Gold node information. In P-Pangaea, instead of mapping a file/directory

to a single object, it is mapped to two: a .data object to store the data

and a .meta to store the gold node location of its children. For example,

the file “/a/b” actually maps to objects “/a/b/.data” and “/a/b/.meta” and

gold node locations of “/a/b” are stored in “/a/.meta”. In order to aid the

151

specification of routing rules, we extend the R/OverLog runtime with string

manipulation functions to extract the parent directory of a file from its ID

(f getParent()), to get the fileId from the meta or data objectId (f getObj()),

and vice versa (f getData(), f getMeta()). Also, a wrapper is implemented at

the local interface to translate all reads and writes to an object to its .data

object. Like the original system, in P-Pangaea, all read and write requests are

preceded by a directory lookup.

Whenever an object locally replicated, the routing rules use the stored

events interface to read and watch the associated .meta object to be informed

of any updates to gold node information.

Update propagation. Replicas of an object are arranged in a connected

graph with invalidation and body subscriptions as the edges. Whenever an

update occurs, the update automatically floods the graph via the subscriptions.

The blocking policy sets the applyUpdateBlock to isValid so that both the body

and invalidation of an update are stored locally before the update is propagated

to other nodes.

Replica addition. When a node, N, tries to access an object that is not

present locally (i.e. read miss), the system proceeds to create a replica locally.

It will look up the location of the gold node for that object from the parent

directory. If the parent directory is not locally available, it looks up its parent.

This can continue recursively up to the root directory.

152

In order to create a local replica, N contacts the gold node of the object

to retrieve the data and creates links to 3 replicas (1 gold and 2 others) so

that the local replica is integrated in the connected graph with at least one

direct link to a gold node.

File creation. In order to create a new object, a node will pick two other

nodes and establish connections to them for that object. It then declares the

two nodes and itself as gold gold nodes for that object and updates the parent

directory object with that information. Note that the update automatically

floods the parent object graph.

Failures. As in the original system, temporary failures are simply dealt with

by retrying. For permanent failures, if a non-gold replica fails, its peers es-

tablish connections to other replicas to ensure the connectivity of the graph.

If a gold node fails, a non-gold node is promoted to gold and connections are

established to maintain the gold node clique.

In total, the implementation of P-Pangaea requires 59 routing rules and

1 blocking predicate.

9.4 Properties of Constructed Systems

We explore the properties of constructed systems in terms of their real-

ism, agility, and performance. In particular, we run experiments to investigate

153

how the systems handle failures, the ease of adding new features for perfor-

mance benefits, and the absolute performance of constructed systems.

9.4.1 Realism

When building a distributed storage system, a system designer needs to

address issues that arise in practical deployments such as configuration options,

local crash recovery, distributed crash recovery, and maintaining consistency

and durability despite crashes and network failures. PADS makes it easy to

tackle these issues for three reasons. First, since the stored events primitive

allows routing policies to access local objects, policies can store and retrieve

configuration and routing options on-the-fly. For example, in P-TierStore, a

node stores in a configuration object the publications it wishes to access. In

P-Pangaea, the parent directory object of each object stores the list of nodes

from which to fetch the object on a read miss.

Second, for failure and crash recovery, the underlying subscription mech-

anisms insulate the designer from implementing low-level recovery logic. Upon

recovery, local mechanisms first reconstruct local state from persistent logs.

Also, PADS’s subscription primitives abstract away many challenging details

of resynchronizing node state. Notably, these mechanisms track consistency

state even across crashes that could introduce gaps in the sequences of in-

validations sent between nodes. As a result, crash recovery in most systems

simply entails restoring lost subscriptions and letting the underlying mecha-

nisms ensure that the local state reflects any updates that were missed.

154

Third, blocking predicates greatly simplify maintaining consistency dur-

ing failures. If there is a failure and the required consistency semantics cannot

be guaranteed, the system will simply block access to “unsafe” data. On re-

covery, once the subscriptions are restored and the predicates are satisfied, the

data become accessible again.

For example, in an anti-entropy session in P-Bayou, Node A receives up-

dates from Node B via invalidation and body subscriptions. Every time A ap-

plies a received update to its local state, its current logical time currentV V ad-

vances to include that update. Let’s say that the subscription disconnects be-

fore all the updates were transferred due to a network failure. Two things must

be noted: First, reads of local data at A still guarantee causal consistency—

since the applyUpdateBlock is set to isValid, invalidations are only applied

if the corresponding body is received, and since invalidations are applied in

causal order, all locally stored objects are valid and causally consistent. Sec-

ond, recovery is as simple as re-establishing subscriptions to receive updates

from another node. A’s currentV V indicates the latest update in causal order

it has received. By setting the subscription start time to currentV V , all the

updates that A missed will be transferred. The new subscription, in effect,

starts from where the previous on left off. Even if A crashes before the anti-

entropy session was complete, on recovery, A will recover its currentV V from

the persistent logs. By establishing subscriptions, it can retrieve the updates

it missed.

In P-Coda, even if a client looses the connection to the server, either due

155

to a network partition or due to server crash, it is not necessary to implement

special logic to maintain consistency during failures. The blocking conditions

will automatically prevent the read of locally invalid data by blocking the read

until the connection to the server is re-established and a valid version of the

object is retrieved.

In each of the PADS systems we constructed, we implemented support

for these practical concerns. Due to space limitations we focus this discussion

on the behaviour of two systems under failure: the full featured client-server

system (P-FCS) and TierStore (P-TierStore). Both are client-server systems,

but they have very different consistency guarantees. We demonstrate that the

systems are able to provide their corresponding consistency guarantees despite

failures.

Consistency, durability, and crash recovery in P-FCS and P-TierStore.

Our experiment uses one server and two clients. To highlight the interactions,

we add a 50ms delay on the network links between the clients and the server.

Client C1 repeatedly reads an object and then sleeps for 500ms, and Client C2

repeatedly writes increasing values to the object and sleeps for 2000ms. We

plot the start time, finish time, and value of each operation.

Figure 9.1 illustrates behavior of P-FCS under failures. P-FCS guar-

antees sequential consistency by maintaining per-object callbacks [42], main-

taining object leases [32], and blocking the completion of a write until the

server has stored the write and invalidated all other client caches. We config-

156

0

5

10

15

20

25

30

0 10 20 30 40 50 60 70 80

Val

ue o

f rea

d/w

rite

oper

atio

n

Seconds

ServerUnavailable

ReaderUnavailable

Reads continueuntil lease expires

Reads then blockuntil server recovers

Write blocked untilserver recovers

Write blocked untillease expires

ReaderWriter

Figure 9.1: Demonstration of full client-server system, P-FCS, under failures.The x-axis shows time and the y-axis shows the value of each read or write operation.

ure the system with a 10 second lease timeout. During the first 20 seconds

of the experiment, as the figure indicates, sequential consistency is enforced.

We kill (kill -9) the server process 20 seconds into the experiment and restart

it 10 seconds later. While the server is down, writes block immediately but

reads continue until the lease expires after which reads block as well. When

we restart the server, it recovers its local state and resumes processing re-

quests. Both reads and writes resume shortly after the server restarts, and

the subscription reestablishment and blocking policy ensure that consistency

is maintained.

We kill the reader, C1, at 50 seconds and restart it 15 seconds later.

Initially, writes block; but as soon as the lease expires, they proceed. When

the reader restarts, reads resume as well.

157

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

Val

ue o

f rea

d/w

rite

oper

atio

n

Seconds

ServerUnavailable

ReaderUnavailable

Reads satisfiedlocally

Writescontinue Writes

continue

ReaderWriter

Figure 9.2: Demonstration of TierStore under a workload similar to that in Figure 9.1.

Figure 9.2 illustrates a similar scenario using P-TierStore. P-TierStore

enforces monotonic reads coherence rather than sequential consistency, and

propagates updates via subscriptions when network is available. As a result,

all reads and writes complete locally without blocking during network failures.

During periods of no failures, the reader receives updates quickly and reads

return recent values. However, if the server is unavailable, writes still progress,

and the reads return values that are locally stored even if they are stale.

9.4.2 Agility

A system’s requirements also change, as workloads and goals change.

We explore how systems built with PADS can be adapted. We highlight two

cases in particular: our implementation of Bayou and Coda. Even though

they are simple examples, they demonstrate that being able to easily adapt a

158

0

100

200

300

400

500

P-Coda + Cooperative CachingP-Coda

Ave

rage

rea

d la

tenc

y (m

s)

Figure 9.3: Average read latency of P-Coda and P-Coda with cooperative caching.

system to send the right data along the right paths can pay big dividends.

P-Bayou and small device enhancement. P-Bayou was extended to sup-

port small devices by a simple change of two routing rules. The results are

consistent with that depicted in Figure 8.2. If a node requires only a fraction

(e.g., say 10%) of the data, the small-device enhancement allows a node to

synchronize only the required subset of data thereby reducing the bandwidth

required for anti-entropy.

P-Coda and cooperative caching. In P-Coda, support for cooperative

caching was implemented by adding only 8 rules. Figure 9.3 depicts the dif-

ference in read latencies for misses on a 1KB file with and without support

for cooperative caching. For the experiment, the round-trip latency between

the two clients is 10ms, whereas the round-trip latency between a client and

server is almost 500ms. When data can be retrieved from a nearby client, read

159

1KB objects 100KB objectsCoda P-Coda Coda P-Coda

Cold read 1.51 4.95 (3.28) 11.65 9.10 (0.78)Hot read 0.15 0.23 (1.53) 0.38 0.43 (1.13)Connected Write 36.07 47.21 (1.31) 49.64 54.75 (1.10)Disconnected Write 17.2 15.50 (0.88) 18.56 20.48 (1.10)

Table 9.1: Read and write latencies in milliseconds for Coda and P-Coda.The numbers in parentheses indicate factors of overhead. The values are averages of 5 runs.

performance is greatly improved. More importantly, with this new capability,

clients can share data even when disconnected from the server.

9.4.3 Absolute Performance

Our goal is to provide sufficient performance to be useful. We compare

the performance of a hand-crafted implementation of a system (Coda) that

has been in production use for over a decade and a PADS implementation of

the same system (P-Coda). We expect P-Coda to pay some overheads over

Coda because PADS is a relatively untuned prototype rather than well-tuned

production code.

Table 9.1 compares the client-side read and write latencies of Coda and

P-Coda. The systems are set up in a two client configuration. To measure the

read latencies, client C1 has a collection of 1,000 objects and Client C2 has

none. For cold reads, Client C2 randomly selects 100 objects to read. Each

read fetches the object from the server and establishes a callback for the object.

C2 re-reads those objects to measure the hot-read latency. To measure the

160

connected write latency, both C1 and C2 initially store the same collection of

1,000 objects. C2 selects 100 objects to write. The write will cause the server

to store the update and break a callback with C1 before the write completes

at C2. Disconnected writes are measured by disconnecting C2 from the server

and writing to 100 randomly selected objects.

The performance of PADS’s implementation is comparable to the orig-

inal system in most cases and is at most 3.3 times worse in the worst case we

measured.

9.5 Summary

This section evaluates the benefits of PADS as a development plat-

form for distributed storage systems. First, by constructing 8 very different

systems, this section proves that the primitives provided by PADS are suffi-

cient to implement a wide range of systems. Second, the systems were easy to

build—each required only a couple dozens of routing rules and a few blocking

conditions; each requiring only a couple of weeks of development time. Third,

despite the conciseness of the specification, each system was concrete in the

sense that it could handle handle real-world issues such as recovery from fail-

ures and configuration options. Fourth, the systems could be easily adapted to

new environments by changing just a handful of rules. Lastly, the performance

of constructed systems was reasonable—in the case of P-Coda, read/write per-

formance is within 50% of the original Coda system and three times worse in

the worst case we measured. This evaluation has led us to the conclusion

161

that PADS does indeed achieve the goal set out by this dissertation—to make

development of new distributed replication systems easier.

162

Chapter 10

Related Work

What sets PADS apart is the flexibility of its replication protocol, its

use of declarative specification, and its efficient conflict detection mechanisms.

In this chapter, we describe other replication protocols and the limits of their

flexibility, other applications for declarative domain specific languages, and

other approaches for conflict detection.

10.1 Replication Protocols

The ability of PADS to support a wide range of systems stems from

the bookkeeping information maintained at each node and the update transfer

protocol. Every node is associated with an interest set—the set of data the

node replicates locally and is interested in receiving updates about. In PADS,

instead of maintaining data and meta-data pertaining to its interest set only, a

node also maintains information about updates to objects not belonging to its

interest set. This information is summarized as logical gaps in its causal update

log. In addition, when a node transfers updates to another node, it transmits

summaries of updates to objects not in the receiver’s interest set as logical gap

markers. This extra information allows a node maintain consistency invariants

163

despite topology independence (TI) and partial replication (PR). With these

invariants, various consistency semantics (AC) can be implemented. Other

replication protocols do not send or maintain such information, and therefore,

cannot simultaneously provide all three PR-AC-TI properties.

Client-server and hierarchical systems. Client-server systems like Spri-

te [65] and Coda [49] and hierarchical systems like TierStore [26] and hierar-

chical AFS [42] are able support a range of consistency semantics (AC), from

monotonic read coherence to linearizability. They assume that there is a pri-

mary node (such as a central server) that maintains the full repository of data

and other nodes replicate subsets of the data (PR). However, these systems

allow updates to propagate via fixed paths only (i.e. along hierarchical connec-

tions). Even if some systems support co-operative caching, control messages

propagate via the fixed topology.

Even though the systems support partial replication, the systems re-

quire that a child node’s interest set is a subset of its parent’s. Because of the

restricted communication paths and replication options, every node is able

track the interest sets of its children and only transmits updates and meta-

data pertaining to their interest sets. On the contrary, when these systems

are implemented on PADS, extra information (i.e. logical gap markers) for

updates not pertaining to the interest set is also sent down the tree.

164

Server-replication systems. The main requirements for server-replication

systems is strong consistency (or at least causal) (AC) and high availability.

Some server-replication systems like Replicated Dictionary [90] and Bayou [69]

allow any node to send updates to any other node (TI), whereas others like

Chain Replication [88] define specific paths for update propagation. These

systems fundamentally assume full replication: all nodes store all data from

any volume they export and all nodes receive all updates. Because of full

replication, the required consistency semantics can be efficiently implemented.

For example, in Bayou, a log-based update propagation protocol, per-object

time stamps, and a single version vector are sufficient to order all received

updates to enforce causal consistency. If Bayou’s protocol is used in a scenario

with partial replication and ad-hoc communication patterns, the bookkeeping

information would not be sufficient, and the system would fail to meet its

consistency requirements. On the other hand, PADS sends and maintains extra

information that allows a node to order all updates and enforce its consistency

invariants.

Object replication systems. Object replication systems like Ficus [36],

Pangaea [77], and WinFS [67] allow nodes to store arbitrary subsets of data

(PR) and support arbitrary transmission patterns (TI). Unfortunately, the or-

der in which updates to different objects are sent from one node to another

is not guaranteed. Therefore, these systems support state-based rather than

log-based transfer and can only provide weak consistency guarantees such as

165

monotonic-read coherence. On the other hand, PADS transmits updates in

causal order with logical gap markers for missing updates. In addition, detect-

ing concurrent updates to a single object requires more per-object bookkeeping

to be maintained (refer to Section 10.3). PADS takes advantage of the up-

date transfer ordering to keep the information maintained for an object to a

minimum.

Some systems, such as Cimbiosys [70], distribute data among nodes not

based on object identifiers or file names, but rather on content-based filters.

We see no fundamental barriers to incorporating filters in PADS to identify sets

of related objects. This would allow system designers to set up subscriptions

and maintain consistency state in terms of filters rather than object-name

prefixes.

DHT-based systems. DHT-based storage systems such as BH [87], PA-

ST [76], and CFS [21], allow different nodes to store subsets of different data

(PR). However, they implement a specific update propagation topology and

replication policy that makes it difficult to maintain update ordering informa-

tion across objects. Therefore, they only provide weak per-object coherence

guarantees.

Tunable systems. Systems like the TACT toolkit [43] and Swarm [85]

provide applications with a menu of options for consistency and dynamically

initiate update propagation (or reconciliations) in order to meet the specified

166

consistency levels. However, in order to support the range of consistency levels,

they impose restrictions either on replication or topology. For example, the

TACT toolkit only supports full replication and Swarm assumes hierarchical

topology in addition to full replication. On the contrary, PADS is able to

support the whole range of TACT consistency dimensions without imposing

the replication and topology restrictions. Fluid replication [20] also exposes a

range of consistency options. It dynamically creates replicas and carries out

reconciliations to meet performance and consistency needs. However, it only

supports hierarchical topology. In fact, like PADS, fluid replication employs

update summaries to to quickly transfer update inform and establish update

ordering at the server. However, unlike PADS, these summaries are only sent

from waystations to the server, rather than on all communication links.

Other systems like Deceit [80], WheelFS [84], and Dynamo [25], provide

applications with a wide range of options to control the level of replication,

placement of replicas, and the sizes of read/write quorums for a given topology.

Because every object may have a different replication policy, it is difficult to

support cross-object consistency. Instead, these systems provide a range of

single-object guarantees, from one-copy serializability to maximum staleness

to eventual consistency, that can be specified with the provided options.

10.2 Declarative Domain Specific Languages

R/OverLog falls in the same category of declarative domain specific lan-

guages (DSL) such as SQL [3], HTML [1], OverLog [56], and SQCK [35], that

167

have been used to aid development. The common benefit of these languages

is that they allow designers to specify high-level intent rather than implement

low-level details (i.e. what to do rather than how to do). For example, the

SQL select statement describes what data is required and not how it should

be retrieved, an HTML file describes what the interface should look like and

not how it should be rendered, and an OverLog program describes what data

flows should be established but does not explicitly establish them. Similarly,

an R/OverLog policy describes what update paths should be established but

does not implement the update propagation protocol.

These languages rely on an underlying layer that takes care of the

low-level details required to implement the high-level intent. A high-level

specification has multiple benefits. First, it leads to concise programs making

them easy to maintain and modify. Second, it is portable in the sense that

the program is isolated from specific implementations and optimizations of the

underlying layer.

A common criticism for R/OverLog is the steep learning curve. How-

ever, as the popularity of SQL and HTML indicate, a declarative DSL can be

the better approach for some tasks. R/OverLog demonstrates the feasibility

of a declarative approach for the first generation of the systems. Perhaps in

the next generation, effort should be spent on developing a more user-friendly

syntax to further aid development.

On a different note, Mace [48]/Macedon [74] is another event driven

language for setting up network overlays. It uses the abstraction of finite

168

state machine for describing overlays. We do not use Mace in PADS because

we worry about the number of states that need to be defined for a node in

order to handle failures and recovery of multiple subscriptions and for various

consistency policies. However, given that Ramasubramanian et. al. [70] have

used Mace to implement the Cimbiosys distributed file system, it would be

interesting to compare the ease of defining PADS policy with Mace and with

R/OverLog.

10.3 Conflict Detection

PADS focuses on syntactic conflict detection based on the causal re-

lationship [52] rather than relying on any application-specific semantics. In

particular, any two updates to the same object are not causally related are

considered to be conflicting.

Current schemes to detect conflicts fall into three main families:

• Previous stamps. In this approach [16, 33], whenever an object is over-

written, the time stamp of the previous version previous stamp is also

stored. When the update is propagated to another node, the previous

stamp is also sent with the write stamp of the update. When a node re-

ceives an update, it compares the previous stamp of the received update

with the write stamp of the locally stored version. If they mismatch,

then the update is marked as a conflict. This approach can accurately

detect all conflicts in any log exchange protocol that ensures the prefix

169

property [22], but it adds an extra per-update overhead for both storage

and network bandwidth. More importantly, in the case when the log is

truncated and a node falls back to state-based exchange, false positives

are possible due to missing intermediary updates.

• Hash histories. Kang et. al. [45] use hash histories to detect conflicts.

Whenever an update is locally applied, a node creates a new hash sum-

marizing the entire current state. Each node keeps a list of hashes or-

dered by when they were generated. Whenever a node A synchronizes

its state with another node B, it looks up B’s last hash in its own hash

history. If the hash exists, then A’s version is a newer version. Similarly,

if B finds A’s last hash in B’s hash history, then B’s version is a newer

version. If neither of the last hashes exists in the other’s history, then

the system marks the synchronization as conflicting. Although the size

of hash history is independent of the number of replicas, it grows pro-

portionally to the total number of updates. More importantly, because

hashes summarize the entire local state, false positives are can occur

when different objects are concurrently updated.

• Version vectors. Many systems [37, 49, 71, 77] use version vectors to de-

tect conflicts. A version vector [44] accurately captures the causality

relationship between two updates. Two writes are conflicting if and only

if neither of their version vectors dominates the other. Although this

approach can accurately detect conflicts, it is expensive to maintain a

version vector for every object, especially in large-scale systems.

170

Predecessor vectors with exceptions (PVE) [59] and vector sets [58] are

variations of the version vectors approach employed by WinFS [67] designed

to reduce the total number of version vectors maintained and communicated.

PVEs can reach an unbounded size if synchronizations are frequently dis-

rupted, making them unsuitable for environments with intermittent connec-

tions. Vector sets maintain predecessor vectors for subsets of data, and in

the worst case, have overheads equivalent to a simple version vector scheme.

PADS’s dependency summary vectors (DSV) are actually equivalent to prede-

cessor vectors. However, due to the properties of the synchronization protocol,

instead of storing DSVs, explicitly in a data structure, PADS derives them from

the consistency meta-data already stored. In addition, the metadata stored

and sent during synchronization does not increase with network disruptions.

Chapter 8 evaluates the overheads of PADS and demonstrates that PADS’s

overheads are equivalent to other state-of-the-art schemes.

171

Chapter 11

Limits and Future Work

PADS’s flexibility sufficient covers a broad range of systems. However,

because of the small API, there are several design aspects and trade-offs it

does not expose to designers. This chapter discusses the limits of the PADS

architecture and the avenues of the future work that can be explored.

11.1 Limits

The limits of PADS can be roughly divided into three categories:

• Limits imposed by ther model itself: There are several high-level design

aspects, such as security and transactions, that are not covered by the

routing and blocking abstractions.

• Limited extensibility of the implementation: The mechanisms layer pro-

vides a default implementation of PADS primitives. Even though the

default implementation is generally applicable, some parts may need to

be reimplemented to build systems that are better suited for the target

requirements.

172

• Limited parameters for performance tuning: The mechanisms layer pro-

vides limited options to tweak current primitive implementation leading

to non-optimal system performance.

In this section, we discuss each category in turn.

11.1.1 PADS Model

The routing and blocking abstractions exposed by PADS do not cover

the following aspects that play an important role in distributed storage system

design:

Security. An important part of distributed data storage system design de-

fines the security guarantees a system provides. Security guarantees cover the

following properties:

• Data access control: ensures that data is accessible only to authorized

users.

• Data integrity: ensures that any malicious or accidental altering of data

can be detected.

• Data privacy: ensures that no information (such as the amount of data

stored or pattern of access) is exposed to unauthorized users.

Different systems provide different levels of guarantees for each security

property. In order to enforce these guarantees, a wide range of technologies,

173

such as encryption, secure channels, access control lists and trusted hardware,

are used.

Security specification does not match the blocking and routing abstrac-

tions provided by the current PADS model. We should consider augmenting

the PADS architecture to allow security specification as a separate policy com-

ponent. We suspect that the security abstraction would allow designers to

address three aspects:

• Basic access control: It seems plausible that straightforward use of en-

cryption and signatures could secure reads and writes so that one is only

able to read and write an object if one has the appropriate credentials.

• Flexible policy hooks: It seems plausible to add something similar to the

five blocking points to manage encryption and validation decisions.

• Consistency: In addition to enforcing security on reads and writes, se-

curing on ordering is also needed. It may be plausible to apply the

content-hash based protocol describe by Mahajan et. al. [57] for that

purpose.

Distributed transactions. The current PADS model is sufficient for a

small subset of distributed transactions. For example, for a single update,

two-phase commit can be implemented by writing appropriate routing policy

and blocking policy and using the assign seq action to commit updates (re-

fer to Section 5.4.7). However, for transactions that involve reads and writes

174

to multiple objects stored across multiple nodes, the PADS model falls short.

Applications have no means to specify what reads and writes make up a trans-

action. Also, because only a single version of an object is maintained, if the

transaction that created the updated is aborted, the mechanisms do not sup-

port rollback to a previously committed version.

We suspect that support for distributed transactions can be easily

added by implementing a tentative buffer that stores intermediate results and

a primitive that moves tentative updates to stable store once the transaction

is committed. We would also require additional triggers that provide informa-

tion about the propagation of tentative updates so that appropriate commit

policies can be implemented.

Erasure coding. PADS provides data availability by employing data repli-

cation rather than erasure coding. Even though PADS objects cannot be

erasure-coded, PADS can be used to replicate erasure-coded fragments. A

library can be implemented at the read/write interface that translates a file

into fragments and stores each fragment a separate PADS object. The routing

policy and the blocking policy can be used to specify how the fragments are

replicated.

Quorums. Some distributed storage systems [25] use quorums to provide

consistency guarantees. Write quorums can be easily implemented with cur-

rent blocking and routing abstractions—every node sends an ACK to the orig-

175

inal writer when it receives an update and the writeAfterBlock predicate is set

to unblock if sufficient ACKS have been received. However, implementing a

read quorum with PADS is complex. It involves implementing a library over

the read/write interface that enforces the read quorum—when an object is

read, out-of-band communication is used to retrieve data from other nodes,

consolidate the responses, and return the latest version. It would good to allow

designers to easily specify read and write quorums.

11.1.2 Extensibility

The implementation of PADS makes specific choices that work well

for a broad range of environments. These choices are not fundamental to

the PADS architecture but have implications for performance, resource, and

availability guarantees of the system. Sometimes a designer may want to

change specific aspects of the system in order to optimize the system for the

target environment. The aspects can include:

• Local storage: In the current implementation, BerkeleyDB is used as the

back-end for the persistent update log and the in-memory data store.

Designers may want to control how data is locally stored for perfor-

mance reasons, such as allowing objects to be stored as fixed-size chunks

rather than as variable-sized byte-ranges, compressing stored bodies, im-

plementing an in-memory-only log, customizing when the log or store is

synchronized to disk, or using a different back-end for storing data.

176

• Replacement protocols: Currently, objects in the store are never replaced.

Designers may want to implement cache replacement protocols (such as

LRU or priority-based schemes) to control the resources taken up by the

storage.

• Network protocol: Currently, all communication occur over TCP connec-

tions. Designers may want to use connections for data propagation and

policy communication that is best suited for their needs, such as TCP-

Nice connections, UDP connections, secure connections, delay-tolerant

connections, and pluggable protocols.

• Namespace: PADS imposes a hierarchical namespace on objects. For

example, the string /a/* covers all objects with the prefix /a/ such as

/a/x and /a/y. This hierarchy is used as a basis for subscription es-

tablishment and for specifying imprecise invalidation target sets. The

hierarchical namespace is well suited for implementing systems that use

the file-name abstraction for access. However, increasingly tags or fil-

ters [70] are being used to access data and the hierarchical namespace is

not efficient for filter-based access. Designers may want to modify how

the object namespace is defined so that alternate access schemes can be

efficiently implemented. For example, by using bloom filters, a group of

unrelated objects can be concisely summarized.

• Conflict resolution: Current implementation uses the last-writer-wins

scheme to resolve conflicting updates (i.e. for two conflicting updates,

177

one of them will overwrite the other). With this scheme, one of the

updates is lost which may not be acceptable for all applications. Some

applications require that the final version of the object consolidate both

updates in a manner that may be application-dependent. For exam-

ple, an object that represents a directory object in a file system will be

consolidated differently from an object that represents a shopping-cart

object [25].

Currently, if a designer wants to change a certain aspect of the imple-

mentation, she has to make that modification by hand. The PADS imple-

mentation is sufficiently modular—changing one of the above aspects requires

updates to only a small part of the system. One can imagine making all

this extensible—by defining interfaces so that designers can pick or plug in

alternative implementations.

11.1.3 Current Implementation Limits

In addition to providing a single standard implementation of the prim-

itives, the current mechanisms layer also hides options that if exposed, would

lead to more efficient implementations. These options include:

• Scheduling of outgoing updates: Currently, if an invalidation subscrip-

tion and a body subscription are established from Node A to Node B,

whenever an update occurs at Node A, the invalidation and the body

of the update are sent immediately to Node B. If the timing of when

178

the updates are sent can be controlled, then techniques to improve per-

formance can be incorporated [64]. These techniques include removing

updates pertaining to temporary objects, and batching updates so that

for quick consecutive updates to the same file, only a single body is sent.

• Splitting and joining interest-sets: Currently, consistency-related meta-

data such as preciseness is maintained on a per interest set basis. The

meta-data is stored as a version vector in a hierarchical structure, called

ISSTATUS, in which each node corresponds to an interest set. Depend-

ing on the subscriptions established and the imprecise invalidations re-

ceived, the hierarchical structure may be split to store version vectors

associated with smaller interest sets. For example, assuming that ini-

tially the ISSTATUS stores a version vector for /a/*. If an incoming

subscription is established for the object /a/x, the interest set in the

ISSTATUS is split so that it stores 2 version vectors—one for /a/x and

another version vector for all other objects in /a/*. If another incoming

subscription is established for /a/y, then the interest set is split further

to store a version vector for /a/y separately. If many such subscrip-

tions are established, it is possible that over the course of time, a version

vector is stored for every local object. By implementing the capability

of joining interest sets and allowing designers to specify the policy for

splitting and joining interest sets, designers gain fine grained control over

how meta-data is maintained and possibly lead to storage savings.

179

• Generation of imprecise invalidations: Currently, for an outgoing sub-

scription, an imprecise invalidation is generated for one of two reasons: a

precise invalidation needs to be sent on the subscription or the time since

the last precise invalidation was sent has reached the threshold time (i.e.

a time out). In the first case, the imprecise invalidation summarizes all

updates that have occurred to objects not in the subscription set since

the last precise invalidation sent. In the second case, the imprecise inval-

idation summarizes all updates that have occurred to objects not in the

subscription set since the last precise invalidation sent up to the current

time.

Since current generation policy aims to be conservative and to achieve

good compression, the granularity of the imprecise invalidation may be

too big and make a larger-than-necessary portion of the receiver’s objects

imprecise. For example, in order to summarize updates to /a/x/p and

to /a/y/q, an imprecise invalidation for /a/* may be generated. This

invalidation will make the whole /a/* subtree imprecise and possibly

making all objects in /a/* unavailable for access. By exposing the pol-

icy for the generation of imprecise invalidations, the granularity of the

imprecise invalidations can be controlled in order to improve the overall

data availability.

180

11.2 Future Work

This dissertation opens up several avenues for future work that can be

important improvements to PADS as a development platform.

11.2.1 Streamlined Implementation

The performance achieved by PADS is not sufficient for demanding

workloads. We suspect that the following aspects of the mechanisms layer can

be redesigned or reimplemented to improve the performance.

• Instead of allowing updates to cover variable-sized byte-ranges of an ob-

ject, we should consider restricting updates to fixed-size chunks, like the

NFS protocol. Because different parts of an object are stored as variable-

sized chunks corresponding to different byte-ranges, extra complexity is

introduced for storage, body look-up, and meta-data maintenance. For

example, when an update is received, the update may overlap multiple

chunks but partially overlap some chunks. Extra processing is needed to

detect overlap boundary, consolidate the chunks and update the meta-

data for each chunk. By storing only fixed size chunks, much complexity

can be reduced.

• The data store currently a big bottleneck for local read/write perfor-

mance. The back-end of the data store is BerkeleyDB. Even though, the

BerkeleyDB environment is configured to asynchronously sync to disk,

BerkeleyDB carries out a lot of processing under-the-hood, such as log

181

garbage collection, transaction locks, etc. We may want to consider im-

plementing a custom in-memory store that can give us better control

over the performance.

• We may want to redesign how meta-stored is stored and how it is up-

dated when an invalidation is received. Currently, when an invalidation

is locally applied, it requires updating object meta-data, the meta-data

associated with the byte-range, and clearing up invalid bytes. This op-

eration involves several database accesses and can be pretty slow if mul-

tiple threads are updating the meta-data. Perhaps performance can be

improved by implementing in-memory meta-data maintenance and finer-

grained locks for meta-data and object stores.

• The current implementation of outgoing body subscriptions is inefficient.

Whenever an outgoing body subscription is established, every byte-range

in the object store is iteratively scanned in order to look for bodies that

belong to the subscription set and that are newer than the subscription

start-time. If the store contains a lot of objects, this scan can take a

very long time. It is plausible that by introducing secondary keys or by

maintaining extra bookkeeping at the object level, the time required for

subscription establishment can be reduced.

• The mechanisms layer maintains several buffers that have no clean up

policy allowing them to increase in size indefinitely. These include the

body buffer that keeps bodies that arrived before their corresponding

182

invalidations, and the object store that does not implement any object

replacement policy. It will be beneficial spend some effort to track down

such buffers and implement appropriate clean-up policies.

• We should consider implementing protocols to summarize version vec-

tors. Since meta-data is maintained in terms of version vectors, the

overheads are proportional to the number of nodes in the system. For

environments that involve a large number of nodes, such as clusters and

data centers, PADS imposes unacceptable overheads.

11.2.2 Formal Model and Verification

Designs and implementations of distributed data storage systems are

complex and proving that they are correct can be difficult. In fact, ensur-

ing that a system meets its consistency guarantees is further complicated by

concurrent updates and failures.

With PADS, because a system is described with a high-level language

over a small set of abstractions, we can reason about it more easily. It is

plausible that we can take advantage of the abstractions provided by PADS

to develop a formal model that can allow designers to check or prove high-

level properties of their systems. The model should allow designers to specify,

check, and/or verify both the safety and the liveness properties of systems (i.e.

whether consistency guarantees are satisfied or whether the implementation is

free from deadlocks).

One approach to verify a system is to create a model of the system, ver-

183

ify the model and generate an executable/deployable system from the model.

This approach requires writing the PADS architecture, the system design, and

the system constraints in a modeling language such as Murφ [27], Promela [2],

or MACE [48]. A model checker is run to ensure that the design meets the

specified constraints. Once the model is verified, the model is translated so

that it can be executed. There are two options for translation. First, the trans-

lator can generate R/OverLog code and blocking conditions that work with

the current PADS implementation. Alternatively, the model can be translated

directly into C++ or Java code. This option is relevant if we were to use

MACE as the modeling language which directly generates C++ code. How-

ever, its state-machine abstraction is does not easily match current PADS

abstractions. The implications of this option need to be further examined

since it may require a rewrite of the underlying framework.

Another possible approach is to do just the opposite. Instead of spec-

ifying a system in a modeling language, the system is written in terms of

R/OverLog rules and blocking conditions and is then either directly checked

or is translated into a modeling language that can be checked.

Whatever the approach, the main challenge is to define a constraint lan-

guage so that consistency, availability, performance and durability constraints

of a distributed storage system can be easily written and verified. Pip [72]

uses a declarative language for specifying expected behaviour of systems. Its

declarative nature makes it an attractive candidate for specifying constraints

in PADS. Its applicability in the context of distributed storage systems requires

184

further study.

On a different note, some checkers, such as CystalBall [91], MaceODB [24],

and Pip, verify that a system is running correctly during execution. The ad-

vantage of this approach is that in addition to safety and liveness properties,

performance and resource constraints can also be specified.

11.2.3 Self-tuning File System

Current technology trends indicate that the amount of data users gen-

erate and access on a daily basis is on an increase. In addition, users have

access to a wide range of devices including mobile phones, laptops, netbooks,

ebooks, portable music players. Because multiple copies of a user’s data may

be spread over multiple devices, data management becomes a major headache.

For an average user, locating latest versions of data, consolidating divergent

copies, and ensuring that the required data is always available is no easy feat.

In fact, with most devices being equipped with multiple networking

technologies, such as WIFI, Bluetooth, and 3G connections, and the almost-

always availability of cloud storage, it should be possible to implement a dis-

tributed file system that transparently manages user data, enabling users to

access the latest version of their data from any device without any effort. In

addition, the file system should be self-tuning, i.e. it should be able take into

account available Internet and peer connectivity, costs for cloud storage, usage

patterns and available battery life in order to to propagate data along paths

that yield the best monetary costs, power, or performance tradeoffs.

185

Our preliminary design for such a file system is based on a client-server

scheme with support for peer-to-peer transmissions. Because of its availability,

cloud storage is best suited to act as the server for meta-data and a large

portion of recently accessed data. Updates are propagated based on an eager-

meta-data-lazy-data scheme—whenever an update occurs, the meta-data is

quickly propagated to the cloud whereas the data of the update is sent if

the opportunity is appropriate. The advantage of this approach is that the

eager meta-data propagation makes it easy to keep track of where the latest

version of an object is, and the lazy data propagation makes it possible to

delay sending the data if the network conditions are not good (i.e. either

the network is too slow or not energy-efficient). As the first step, we have

designed a policy that focuses on energy efficiency that can potentially lead

to two orders of magnitudes of savings [15]. However, more work needs to

be done to build a hero file system that self-tunes and ensures that data is

accessible most the time.

11.2.4 New Applications

PADS makes it easy to implement and enforce consistency guarantees

for distributed data. It would be interesting to explore whether we can apply

PADS to domains other that data storage. One domain in which PADS can

be useful is structured overlays. Structured peer-to-peer overlays are often

used as the underlying substrate for content-dissemination, file sharing and

VOIP applications. Most existing overlay protocols [75, 83, 95] focus on the

186

design of overlay topology, reducing maintenance costs, and improving rout-

ing performance. They do not focus on enforcing consistency of the routing

information stored at each node. The lack of consistency guarantees can re-

sult in routing errors especially during failures and churn. By enforcing routing

consistency and reducing routing errors, it is possible to provide better overlay

performance, availability, and in turn, better QOS guarantees to applications.

Unfortunately, enforcing routing consistency is complex [19] and so most sys-

tems give up on consistency and implement mechanisms to detect and remove

erroneous state instead [83].

We suspect that the complexity of enforcing consistency can be reduced

if PADS is used as the infrastructure to maintain and propagate routing state.

First, PADS provides high-level abstractions with which consistency semantics

can be easily specified. Second, PADS mechanisms ensure that updates are

sent in order. Third, with the support of formal verification, it is possible to

prove that consistency guarantees are enforced even during failures.

187

Chapter 12

Conclusion

This dissertation aims to make it easier to build new distributed storage

systems. In order to achieve this goal, it presents the PADS approach according

to which development is carried out by specifying policy over an underlying

mechanisms layer. In order to demonstrate the benefits of this approach,

instead of building a single new distributed storage system, this dissertation

constructs several systems inspired from literature that cover a broad spectrum

of the design space. This chapter summarizes the contributions of this work

and the research lessons learned.

12.1 Contributions

From a developer’s point of view, using the PADS has several advan-

tages: First, it provides a clean separation of system design into routing pol-

icy and blocking policy. Second, it is sufficiently flexible to support a broad

range of systems which is demonstrated by the fact that we have built client-

server systems, server-replication systems, and object-replication systems with

PADS. Third, development on PADS is significantly easier and shorter than

building a system from scratch. The systems constructed on PADS required

188

less than a hundred routing rules and a couple of blocking conditions; each re-

quired about a few weeks of development time. Fourth, the systems developed

on PADS can be easily modified to incorporate new features or address new

requirements. Lastly, the overheads associated with the PADS prototype are

within small constant factors of hand-tuned systems.

In the development and the evaluation of the PADS approach, this

thesis makes the following contributions:

• It demonstrates that the effort required to build distributed storage sys-

tems can be significantly reduced by adopting a policy specification-

based approach.

• It defines a new dichotomy for distributed storage system design. In

particular, system design and implementation is separated into routing

policy that addresses performance and availability goals, and blocking

policy that addresses consistency and durability concerns.

• It demonstrates that a small API is sufficient to build a broad range of

distributed storage systems.

• It defines and provides an implementation of a general set of replication

mechanisms that are sufficiently flexible to support all three PR-AC-TI

properties.

• It defines a new domain specific language, R/OverLog, that allows con-

cise and elegant specification of routing policy.

189

• It provides a set of useful prototypes for different distributed storage

systems that can be deployed for moderating workloads.

12.2 Research Experience

The most important lesson we learned during the pursuit of this thesis is

that a great solution is often the result of well thought out evolution. Similarly,

current PADS is the result of 4 stages of evolution.

We started with a basic mechanisms layer which had an extensive API.

Unfortunately, we had no concrete way to map the designs of existing systems

on to the provided API. We realized that a lot of the design of distributed

storage systems is routing of information among nodes. That realization en-

abled us to develop the subscriptions primitive. We were also able to take

advantage of OverLog, and later R/OverLog, to define routing policy.

In the second step of the evolution, we were struggling with how to

cleanly define the “rest” of the system design. We realized that the “rest” boils

down to consistency and durability guarantees which could be implemented

by blocking access to data. At first, we required that systems designers imple-

ment libraries over the local interface and only some simple read-block flags

were provided. However, after implementing several consistency libraries, we

noticed that the enforcement of consistency guarantees often required looking

up local consistency bookkeeping and sending or waiting for messages from

routing policy. This observation led to the definition of blocking points and

blocking predicates.

190

The next step in the evolution was the definition of the stored events

interface. During the development of P-Pangaea, we realized that the routing

and blocking API were not sufficient to allow us to persistently store gold node

information. We designed the stored event interface so that local data objects

can be easily accessed in a manner that works well with the event-driven

model for routing policy. The stored events interface also allowed reading

and writing configuration information such as server IDs into objects. We

previously handled configuration information in ad-hoc ways such as hard-

coding it in the routing policy.

Finally, the last step in the evolution was the addition of the commit

operation and the isSequenced predicate. We found difficulty in implementing

the primary-commit protocol due to the lack of ordering guarantees provided

by R/OverLog. For example, when a primary commits updates, the commit

information may be sent to different nodes in different orders. By introducing

the commit operation and by ensuring that commit invalidations are causally

transferred, we were able to implement the primary-commit protocol in a clean,

concise manner.

191

Appendices

192

Appendix A

Code Listings

This appendix provides the code listings for the systems constructed

with PADS. In order to make it the R/OverLog code easier to read, we add

prefixes to R/OverLog tuples: ACT for actions that call into the mechanisms

layer, TRIG for triggers from the mechanisms layer, TBL for table lookups, RCV

for events received via the stored events interface, and B ACT for actions that

go to blocking policy. For the blocking policy, we only list the predicates that

are not set to the default “true” value.

A.1 Simple Client Server (P-SCS)

The implementation of P-SCS requires 24 routing rules and 5 blocking

conditions.

A.1.1 Routing Policy

/***********************************************************************/

// Initialization: Read server configuration, and

// store serverId in a table

/***********************************************************************/

in1 ACT readEvent(@X, ObjId) :-

initialize(@X), ObjId = "/.serverCFG".

in2 TBL server(@X, S) :-

RCV serverConfig(@X, S).

193

/***********************************************************************/

// Client establishes subscriptions to

// server for transferring updates.

/***********************************************************************/

csSb1 ACT addInvalSub(@X, X, S, SS, CTP) :-

RCV serverConfig(@X, S), SS="/*", CTP=="LOG", X 6=S.

csSb2 ACT addBodySub(@X, X, S, SS) :-

RCV serverConfig(@X, S), SS="/*", X 6=S.


TRIG subEnd(@X, X, S, SS, , Type), Type=="inval", CTP=="LOG".


TRIG subEnd(@X, X, S, SS, , Type), Type=="body".

/***********************************************************************/

// On read miss, client store info in a table and informs server.

/***********************************************************************/

rm1 TBL readMiss(@C, Obj, Off, Len) :-

TRIG operationBlock(@C, Obj, Off, Len, Bpoint, ),

Bpoint == "readAt", TBL serverId(@C, S), C 6=S.

rm2 clientRead(@S, C, Obj, Off, Len) :-



/***********************************************************************/

// On client read miss, server sends body and establishes callback

/***********************************************************************/

cb1 ACT sendBody(@S, S, C, Obj, Off, Len) :-

clientRead(@S, C, Obj, Off, Len).

cb2 ACT addInvalSub(@S, S, C, Obj, CTP) :-

clientRead(@S, C, Obj, Off, Len), CTP = "CP".

cb3 TBL hasCallback(@S, Obj, C) :-

clientRead(@S, C, Obj, Off, Len),

/***********************************************************************/

// When client receives an invalidation, if it does

// not satisfy a read miss, ack server and cancel callback.

// Otherwise, remove it from the miss table

/***********************************************************************/

rinv0 satsifiesReadMiss(@C, S, Obj, Off, Len, Writer, Stamp, a COUNT<*>) :-

TRIG invalArrives(@C, S, Obj, Off, Len, Writer, Stamp),

TBL serverId(@C, S), S 6=C,

TBL readMiss(@C, S, Obj, Off, Len).

rinv1 ackServer(@S, C, Obj, Off, Len, Writer, Stamp) :-

satsifiesReadMiss(@C, S, Obj, Off, Len, Writer, Stamp, Count), Count==0,

rinv2 ACT removeInvalSub(@C, S, C, Obj) :-


rinv3 delete TBL readMiss(@C, S, Obj, Off, Len) :-

satsifiesReadMiss(@C, S, Obj, Off, Len, Writer, Stamp, Count), Count > 1.

194

/***********************************************************************/

// When server receives an invalidation,

// gather acks from others who have callbacks.

// Note:

// - Acks are accumulative so one received ack also

// satisfies all previous required acks.

// - needAck table keeps track of which acks are needed.

/***********************************************************************/

ack1 TBL needAck(@S, Obj, Off, Len, C2, Writer, Stamp, Need) :-

TRIG invalArrives(@S, C, Obj, Off, Len, Writer, Stamp),

TBL hasCallback(@S, Obj, C2), C2 6= Writer,

Need = 1, TBL serverId(@S, S).


TRIG invalArrives(@S, C, Obj, Off, Len, Stamp, Writer),

C2 == Writer, Need = 0, TBL serverId(@S, S).

ack3 TBL needAck(@S, Obj, Off, Len, C, Writer, NeedStamp, Need) :-

ackServer(@S, C, , , , , RecvStamp),

TB needAck(@S, Obj, Off, Len, C, Writer, NeedStamp, ),

NeedStamp < RecvStamp, Need= 0, TBL serverId(@S, S).

ack4 TBL needAck(@S, Obj, C, Writer, NeedStamp, Need) :-

ackServer(@S, C, , , , RecvWriter, RecvStamp),

TBL needAck(@S, Obj, C, NeedWriter, NeedStamp, ),

NeedStamp == RecvStamp, NeedWriter <= RecvWriter,

Need= 0, TBL serverId(@S, S).

ack5 delete TBL hasCallback(@S, Obj, C) :-

TBL needAck@S, Obj, , , C, , , Need)


ack6 acksNeeded(@S, Obj, Off, Len, Writer, Recvtamp, a COUNT<*>) :-

TBL needAck(@S, Obj, Off, Len, C, Writer, RecvStamp, NeedTrig),

TBL needAck(@S, Obj, Off, Len, C, Writer, RecvStamp, NeedCount),

NeedTrig == 0, NeedCount == 1.

/***********************************************************************/

// If no more acks are needed, assign Sequence number,

// inform client, and clear out ack table

/***********************************************************************/

ack7 ACT assignSeq(@S, Obj, Off, Len, Stamp, Writer) :-

acksNeeded(@S, Obj, Off, Len, Writer, RecvStamp, Count), Count == 0.

ack8 BACT writeComplete(@Writer, Obj) :-


ack9 delete TBL needACK(@S, Obj, C, WriterId, RecvStamp, Need) :-


A.1.2 Blocking Policy

Blocking Point PredicateReadNowBlock isValid and isComplete and isSequenced (at Client)WriteAfterBlock B Action(writeComplete, objId) (at Client)ApplyUpdateBlock isValid (at Client & Server)

195

A.2 Full Client Server (P-FCS)

The implementation of P-FCS requires 43 routing rules and 6 blocking

conditions.


/***********************************************************************/

// Initialization: Read server configuration, and

// store serverId in a table

/***********************************************************************/





/***********************************************************************/



/***********************************************************************/









/***********************************************************************/

// On read miss, client stores info in a table and informs server.

/***********************************************************************/







/***********************************************************************/

// If peer fails to send body, ask server to send body.

/***********************************************************************/

cc1 ACT sendBody(@S, S, C, Obj, Off, Len) :-

TRIG sendBodyFailed(@C, , C, Obj, Off, Len, , ).

/***********************************************************************/

// On client read miss, server establishes callback and finds

// another client with callbacks to send the body.

/***********************************************************************/

196


clientRead(@S, C, Obj, Off, Len), CTP = "CP".

cb2 TBL hasCallback(@S, Obj, C, T) :-

clientRead(@S, C, Obj, Off, Len), T=f now().

cc2 anyPeerAvailable(@S, C, Obj, Off, Len, a Count<*>) :-


TBL hasCallback(@S, Obj, C2, ).

cc3 selectedPeer(@S, C, Obj, Off, Len, a RANDOM<C2>) :-

anyPeerAvailable (@S, C, Obj, Off, Len, Count), Count > 0.

TBL hasCallback(@S, Obj, C2, ).

cc4 ACT sendBody(@SP, SP, C, Obj, Off, Len, , ) :-

seletedPeer(@S, C, obj, Off, Len, SP).

cc5 ACT sendBody(@S, S, C, Obj, Off, Len) :-

anyPeerAvailable (@S, C, Obj, Off, Len, Count), Count==0.

/***********************************************************************/

// When client receives an invalidation,

// checks if it satisfies a read miss. If so,

// removes it from the table. If not, ACK server.

/***********************************************************************/


TRIG InvalArrives(@C, S, Obj, Off, Len, Writer, Stamp),

TBL serverId(@C, S), S 6=C, TBL readMiss(@C, S, Obj, Off, Len).





/***********************************************************************/



// Note:

// - Acks are accumulative so one received ack also

// satisfies all previous required acks.


/***********************************************************************/



TBL hasCallback(@S, Obj, C2, ), C2 6= Writer,




C2 == Writer, Need = 0, TBL serverId(@S, S).




NeedStamp < RecvStamp, Need= 0, TBL serverId(@S, S).

197





Need= 0, TBL serverId(@S, S).

ack5 delete TBL hasCallback(@S, Obj, C, ) :-







/***********************************************************************/



/***********************************************************************/

ack7 ACT assignSeq(@S, Obj, Off, Len, Stamp, Writer) :-


ack8 B ACTwriteComplete(@Writer, Obj) :-




/***********************************************************************/

// Lease Support:

// Server writes to lease object every (lease interval/2)

// Client subscribes for lease object.

/***********************************************************************/

cl1 ACT writeEvent(@S, Obj, Value) :-

periodic(@S, I, -1), I=leaseTime/2,

intervalObj="/volumelease", TBL serverId(@S,S).

cl2 ACT addInvalSub(@X, X, S, SS, CTP) :-

RCV serverConfig(@X, S), SS="/volumelease", CTP=="CP", X 6=S.

cl3 ACT addBodySub(@X, X, S, SS) :-

RCV serverConfig(@X, S), SS="/volumelease", X 6=S.

/***********************************************************************/

// Lease maintenance:

// At the end of every interval, for all expired leases,

// remove inval subscriptions, generate acks if needed,

// and remove from hasCallback table.

/***********************************************************************/

ll01 expiredLease(@S, C, Obj) :-

periodic(@S, LeastTime, -1), hasCallback(@S, Obj, C, ST),

TimeElapsed=f now()-ST, TimeElapsed < LeaseTime, TBL serverId(@S, S).

ll02 ACT removeInvalSub(@C, S, C, Obj) :-

expiredLease(@S, C, Obj).

ll03 ackServer(@S, C,Obj, Off, Len, Writer, NeedStamp) :-

TBL needAck(@S, Obj, Off, Len, C, Writer, NeedStamp, Need),

expiredLease(@S, C, Obj), Need= 0.

198

ll04 delete hasCallback(@S, Obj, C, ) :-

expiredLease(@S, C, Obj).

/***********************************************************************/

// Support for partial file writes:

// - for every local write, add to currentValid Table.

// - When an inval arrives, and if it satisfies a read miss,

// add to currentValid table,

// - If the inval does not correspond to read miss,

// update table accordingly and remove callback only when

// all block ranges for the object are invalid

/***********************************************************************/

pfw1 TBL currentValid(@C, Obj, Off, Len, State) :-

TRIG write(@C, Obj, Off, , ), State="Valid".

TBL serverId(@C, S), S 6=C.

pfw2 TBL currentValid(@C, Obj, Off, Len, State) :-

satsifiesReadMiss(@C, S, Obj, Off, Len, Writer, Stamp, Count),

Count==1, State="Valid".

pfw3 TBL currentValid(@C, Obj, Off, Len, NewState) :-


TBL currentValid(@C, Obj, Off. Len, ), Count==0, NewState="Invalid".

pfw4 isAnythingElseValid(@C, Obj, a COUNT<*>) :-


TBL currentValid(@C, Obj, Off2, Len2, State),

Count==0, State=="Valid", Off 6=Off2, Len 6=len2.

pfw5 ACT removeInvalSub(@C, S, C, Obj) :-

isAnythingElseValid(@C, Obj, Count), Count==0.

pfw6 delete TBL currentValid(@C, Obj, , , ) :-

isAnythingElseValid(@C, Obj, Count), Count==0.

/***********************************************************************/

// If server recieves a blind write,

// establishes a callback if there isn’t one.

/***********************************************************************/

bw1 blindWriteWithCallback(@S, C, Obj, a COUNT<*>) :-

TRIG invalArrives(@S, C, Obj, , , Writer, ),

TBL hasCallback(@S, Obj, C, ),

C== Writer, TBL serverId(@S, S),

bw2 TBL hasCallback(@S, Obj, C, T) :-

blindWriteWithCallback(@SS, C, Obj, Count),

Count==0, T=f now().

bw3 ACT addInvalSub(@S, S, C, Obj, CTP) :-

blindWriteWithCallback(@SS, C, Obj, Count), Count==0.

199


Blocking Point PredicateReadNowBlock isValid and isComplete and isSequenced and

maxStaleness(Server, 1, leaseInterval) (at Client)WriteAfterBlock B Action(writeComplete, objId) (at Client)ApplyUpdateBlock isValid (at Client & Server)

A.3 P-Coda

The implementation of P-Coda requires 37 routing rules and 7 blocking

conditions.


/***********************************************************************/

// Initialization:

// Read server configuration and hoard list.

// Initialize server state to notConnected.

/***********************************************************************/





ss1 TBL serverState(@X, State) :-

initialize(@X), State="notConnected".

/***********************************************************************/



/***********************************************************************/





/***********************************************************************/

// If subscriptions are successful, update server state

// otherwise try again

/***********************************************************************/

ss2 connectedToServer(@X) :-

TRIG subStart(@X, X, S, , ).


TRIG connectedToServer(@X), State="Connected".

200

csSb3 ACT addInvalSub(@X , X, S, SS, CTP) :-





TRIG subEnd(@X, X, S, SS, , Type), Type=="inval", State="notConnected".

/***********************************************************************/

// If a server is detected, establish a subscription for "empty"

// to get information about missing updates.

/***********************************************************************/

csSb6 ACT addInvalSub(@X,S , X, SS, CTP) :-

connectedToServer(@X), TBL serverId(@X, S), SS="EMPTY", CTP=="LOG".

/***********************************************************************/

// If not connected to server, simply inform blocking policy so that

// local writes and local read misses do not block.

/***********************************************************************/

li1 BACT writeComplete(@C, Obj) :-

TRIG write(@C, Obj, , , , ), TBL serverId(@C, S), C 6=S,

TBL serverState(@C, St), St=="notConnected".

li2 BACT writeComplete(@C, Obj) :-


Bpoint == "readAt", TBL serverId(@C, S), C 6=S,

TBL serverState(@C, St), St=="notConnected".

/***********************************************************************/

// On read miss, client store it in a table and informs server.

/***********************************************************************/

rm1 readMiss(@C, Obj, Off, Len) :-



TBL serverState(@C, State), State=="Connected".


readMiss(@C, Obj, Off, Len), TBL serverState(@C, State), State=="Connected".


readMiss(@C, Obj, Off, Len), TBL serverState(@C, State), State=="Connected".

/***********************************************************************/

// On client read miss, server sends body and establishes callback

/***********************************************************************/

cb1 ACT sendBody(@S, S, C, Obj, Off, Len) :-

clientRead(@S, C, Obj, Off, Len).


clientRead(@S, C, Obj, Off, Len), CTP := "CP".

cb3 TBL hasCallback(@S, Obj, C) :-


/***********************************************************************/

// When client receives an invalidation, if it does

// not satisfy a read miss, ack server and cancel callback.

// Otherwise, remove it from the table

/***********************************************************************/

201


TRIG InvalArrives(@C, S, Obj, Off, Len, Writer, Stamp),

TBL serverId(@C, S), S 6=C,

TBL readMiss(@C, S, Obj, Off, Len).



rinv2 ACT removeInvalSub(@C, S, C, Obj) :-




/***********************************************************************/



// Note:

// - Acks are accumulative so on received ack also

// acks all previous required acks.


/***********************************************************************/



TBL hasCallback(@S, Obj, C2), C2 6= Writer,

Need := 1, TBL serverId(@S, S).



C2 == Writer, Need := 0, TBL serverId(@S, S).




NeedStamp < RecvStamp, Need:= 0, TBL serverId(@S, S).





Need:= 0, TBL serverId(@S, S).

ack5 delete TBL hasCallback(@S, Obj, C) :-







/***********************************************************************/



/***********************************************************************/

ack7 ACT assignSeq(@S,Obj, Off, Len, Stamp, Writer) :-


ack8 BACT writeComplete(@Writer, Obj) :-


202



/***********************************************************************/

// Hoard when files when connected to the server,

// implement it like a read miss so server can establish call backs

/***********************************************************************/

hd1 ACT readEvent(@C, ObjId) :-

connectedToServer(@C), ObjId = "/.hoardList".

hd2 readMiss(@C, Obj, Off, Len) :-

RCV hoardItems(@C, Obj), Off=0, Len=-1.

/***********************************************************************/

// If server recieves a blind write,

// it establishes a callback if there isn’t one.

/***********************************************************************/

bw1 blindWriteWithCallback(@SS, C, Obj, a COUNT<*>) :-

TRIG invalArrives(@S, C, Obj, , , Writer, ),

TBL hasCallback(@S, Obj, C, ),

C== Writer, TBL serverId(@S, S),

bw2 TBL hasCallback(@S, Obj, C) :-

blindWriteWithCallback(@SS, C, Obj, Count), Count==0,

bw3 ACT addInvalSub(@S, S, C, Obj, CTP) :-

blindWriteWithCallback(@SS, C, Obj, Count), Count==0.


Blocking Point PredicateReadNowBlock isValid and isComplete and

(isSequenced or B ACT(isDisconnected))WriteAfterBlock B Action(writeComplete, objId) or

B ACT(isDisconnected)ApplyUpdateBlock isValid (at Client & Server)

A.3.3 Co-operative Caching

The following 8 rules were added to support co-operative caching./***********************************************************************/

// Liveness monitoring:

// - At initialization, read from neighbor config.

// - Periodically ping a neighbor.

// - Update neighbor table when ping event is received

// - Periodically, check which neighbors are dead

/***********************************************************************/

nin1 ACT readEvent(@X, ObjId) :-

initialize(@X), ObjId="/.neighborList".

203

nin2 TBL neighbor(@X, N, T, Live) :-

RCV neighborConfig(@X, N), T=f now(), Live=0.

png refresh(@N, X) :-

periodic(@X, 5, -1), TBL neighbor(@X, N, , ).

rPng TBL neighbor(@X, N, T, Live) :-

refresh(@X, N), T=f now(), Live=1.

chk TBL neighbor(@X, N, T, newLive) :-

periodic(@X, 20, -1), TBL neighbor(@X, N, T, Live),

Live==1, TimePassed=f now()-T, TimePassed>20, newLive=0.

/***********************************************************************/

// Co-operative Caching:

// On a read miss, ask all reachable peers for body.

// if server is not available, contact reachable peer

// and establish an inval subscription to get metada.

// Note: we remove the inval subs when it is caught up

// (i.e. we know the know about the status of peer)

/***********************************************************************/

cc1 ACT sendBody(@N, N, X, Obj, Off, Off) :-

readMiss(@C, Obj, Off, Len), TBL neighbor(@X, N, , Live), L==1.

cc2 ACT addInvalSub(@X, N, X, Obj, CTP) :-

readMiss(@C, Obj, Off, Len), TBL serverState(@C, State),

TBL neighbor(@X, N, , Live), L==1, State=="notConnected", CTP="CP.

cc3 ACT removeInvalSub(@C, N, C, Obj) :-

TRIG subCaughtup(@C, N, C, Obj), TBL serverId(@X, S), S 6=N.

A.4 P-TRIP

The implementation of P-TRIP requires 6 routing rules and 4 blocking

conditions.






/***********************************************************************/

// Client establishes subscriptions to get updates.

/***********************************************************************/

csSb1 ACT addInvalSub(@X, S, X, SS, CTP) :-


csSb2 ACT addBodySub(@X, P, X, SS) :-


204

/***********************************************************************/

// Re-try in cases of failure

/***********************************************************************/

csSb3 ACT addInvalSub(@X, S, X, SS, CTP) :-

TRIG subEnd(@X, S, X, SS, , Type), Type=="inval", CTP=="LOG".

csSb4 ACT addBodySub(@X, S, X, SS) :-

TRIG subEnd(@X, S, X, SS, , Type), Type=="body".


Blocking Point PredicateReadNowBlock isValid and isCompleteApplyUpdateBlock isValid or maxStaleness(Server, 1, threshold)

A.4.3 Hierachical topology

We can easily change the topology from star to a static tree as follows.in1 ACT readEvent(@X, ObjId) :-

initialize(@X), ObjId = "/.parentCFG".

in2 TBL parent(@X, P) :-

RCV serverConfig(@X, X, P).

/***********************************************************************/

// Client establishes subscriptions to get updates.

/***********************************************************************/

csSb1 ACT addInvalSub(@X, P, X, SS, CTP) :-

RCV parentConfig(@X, P), SS="/*", CTP=="LOG", X 6=P.


RCV parentConfig(@X, P), SS="/*", X 6=P.

/***********************************************************************/

// Re-try in cases of failure

/***********************************************************************/

csSb3 ACT addInvalSub(@X, P, X, SS, CTP) :-

TRIG subEnd(@X, P, X, SS, , Type), Type=="inval", CTP=="LOG".


TRIG subEnd(@X, P, X, SS, , Type), Type=="body".

A.5 P-Bayou

The implementation of P-Bayou requires 10 routing rules and 3 blocking

conditions.

205


/***********************************************************************/

// Periodically, pick a peer and set up subscriptions

// to get updates. Once caught up, remove subscriptions.

/***********************************************************************/

ae1 peerPicked(@X, a random<N>) :-

periodic(@X, Interval, -1), TBL neighbor(@X, N, , Live), Live==1.

ae2 ACT addInvalSub(@X, N, X, SS, CTP) :-

peerPicked(@X, N), SS="/*", CTP="LOG".

ae3 ACT addBodySub(@X, N, X, SS) :-

peerPicked(@X, N), SS="/*".

ae4 ACT removeInvalSub(@X, N, X, SS) :-

TRIG subCaughtup(@X,N,X,SS).

ae5 ACT removeBodySub(@X, N, X, SS) :-

TRIG subCaughtup(@X,N,X,SS).

/***********************************************************************/




// - Update neighbor table when ping event received.

// - Periodically, check which neighbors are dead.

/***********************************************************************/







rPng TBL neighbor(@X, N, T, Live) :-


chk TBL neighbor(@X, N, T, newLive) :-


Live==1, TimePassed=f now()-T, TimePassed>20, newLive=0.


Blocking Point PredicateReadNowBlock isValid and isCompleteApplyUpdateBlock isValid

206

A.6 P-ChainReplication

The implementation of P-ChainReplication requires 45 routing rules

and 4 blocking conditions.


/***********************************************************************/

// Initialization: read master info

// If I am master, read in volume to node mappings

/***********************************************************************/

ini1 ACT readEvent(@X, ObjId) :-

initialize(@X), ObjId = "/.masterCFG".

ini2 TBL master(@X, M) :-

RCV masterConfig(@X, M).

ini3 ACT readEvent(@M, ObjId) :-

RCV masterConfig(@X, M), ObjId="/.volList".

ini4 TBL volList(@M, Vol, N) :-

RCV volListConfig(@X, Vol, N).

/***********************************************************************/

// Keep head, tail, my successor, my predecessor information

// provided by the master in tables

/***********************************************************************/

s1 TBL head(@X, Vol, H) :-

currHead(@X, Vol, H).

s2 TBL tail(@X, Vol, H) :-

currTail(@X, Vol, H).

s3 TBL predecessor@X, Vol, P) :-

currPredecessor(@X, Vol, P).

/***********************************************************************/

// Establish subscriptions to predecessors to receive updates

/***********************************************************************/

sub1 ACT addInvalSub(@X, P, X , Vol, CTP) :-

predecessor(@X, Vol, P), CTP=="LOG".

sub2 ACT addBodySub(@X, P, X , Vol) :-

predecessor(@X, Vol, P).

/***********************************************************************/

// Inform master when the subscription is caught up.

// i.e. if X is a new node, it can be used as a tail now

/***********************************************************************/

sub4 connEstablished(@M, X, Vol) :-

TRIG subCaughtup(@X,P, X, Vol), TBL master(@X, M).

207

/***********************************************************************/

// For consistency, every node keeps track of the latest

// received write.

/***********************************************************************/

con1 rcvdWrite(@X, Vol, Writer, Stamp) :-

TRIG invalArrives(@X, S, Obj, , , Writer, Stamp), Vol=f getVol(Obj),

con2 TBL LatestRcvdWrite(@X, Vol, NWriter, NStamp) :-

rcvdWrite(@X, Vol, NWriter, NStamp),

TBL LatestRcvdWrite(@X, Vol, Writer, Stamp), Stamp < NStamp.

con3 TBL LatestRcvdWrite(@X, Vol, NWriter, NStamp) :-

rcvdWrite(@X, Vol, NWriter, NStamp),

TBL LatestRcvdWrite(@X, Vol, Writer, Stamp), Stamp == NStamp, Stamp < NStamp.

/***********************************************************************/

// If I am the tail or I become the new tail, send an ACK to the head

/***********************************************************************/

con4 BACT ackFromTail(@H, Vol, Writer, Stamp) :-

rcvdWrite(@X, Vol, Writer, Stamp).

TBL head(@X, Vol, H), TBL tail(@X, Vol, T), X==T.


currTail(@X, Vol, T), X==T,

TBL LatestRcvdWrite(@X, Vol, Writer, Stamp).

/***********************************************************************/

// If there is a new head and I am the tail, send an ACK to the head

/***********************************************************************/


currHead(@X, Vol, H), TBL tail(@X, Vol, T), X==T,

TBL LatestRcvdWrite(@X, Vol, Writer, Stamp).

/***********************************************************************/

// When master detects a new node, it looks up

// what volume chain to add it to.

/***********************************************************************/

nn1 nn1 newVolMember(@M,Vol, N) :- newNeighbor(@M, N), volList(@M,Vol, N). :-

/***********************************************************************/

// If volume chain has not been set up, set the

// new node to head and tail and let it know.

/***********************************************************************/

nn2 currChainCount(@M, Vol, N, a COUNT<*>) :-

newVolMember(@M, Vol, N), TBL chain(@M, Vol, , ).

nn3 newHead(@N,Vol, N) :-

currChainCount(@M, Vol, N, C), C==0.

nn4 newTail(@N,Vol, N) :-

currChainCount(@X, Vol, N, C), C==0.

nn5 TBL chain(@M, Vol, N) :-

currChainCount(@X, Vol, N, C), C==0.

/***********************************************************************/

// When you have a new head or new tail,

// update tables and inform others.

/***********************************************************************/

208

nn6 currHead(@N, Vol, H) :-

newHead(@M, Vol, H), TBL chain(@M, Vol, N, ).

nn7 TBL head(@M, Vol, H) :-

newTail(@M, Vol, H).

nn8 currTail(@N, Vol, T) :-

newTail(@M, Vol, T), TBL chain(@M, Vol, N, ).

nn9 TBL tail(@M, Vol, T) :-

newTail(@M, Vol, T).

/***********************************************************************/

// If volume chain exists, send new node head and

// predecessor info and add it to the end of the chain

/***********************************************************************/

nn10 currHead(@N, Vol, H) :-

currChainCount(@M, Vol, N, C), C 6=0, TBL head(@M, Vol, H).

nn11 currMaxChainPos(@M, Vol, N, a MAX<Pos>) :-

currChainCount(@M, Vol, N, C), C 6=0, TBL chain(@M, Vol, , Pos).

nn12 currPredecessor(@N,Vol, Pred) :-

currMaxChainPos(@M, Vol, N, PredPos), TBL chain(@M, Vol, Pred, PredPos).

nn13 TBL chain(@M, Vol, N, Pos) :-

currMaxChainPos(@M, Vol, N, PredPos), Pos = PredPos + 1.

/***********************************************************************/

// When a node has caught up, if it is at the

// last position in the volume chain, set it as tail.

/***********************************************************************/

nn14 currTailPos(@M, Vol, N, NPos, a MAX<Pos>) :-

connEstablished(@M, Vol, N), TBL chain(@M, Vol, N, NPos),

TBL chain(@M, Vol, , Pos).

nn15 newTail(@M, Vol, T) :-

currTailPos(@M, Vol, N, NPos, MaxPos), NPos==MaxPos.

/***********************************************************************/

// Dead node: delete it from chain

/***********************************************************************/

dn1 delete TBL chain(@M, Vol, N, NPos) :-

deadNeighbor(@M, N), TBL chain(@M, Vol, N, NPos).

/***********************************************************************/

// find predecessor if dead node is not head

/***********************************************************************/

dn2 findPredecessor(@M, N, Vol, NPos, a MAX<PPos>) :-

deadNeighbor(@M, N), TBL chain(@M, Vol, N, NPos),

TBL chain(@M, Vol, , PPos), PPos < Pos,

TBL head(@M, Vol, H), H6=N.

/***********************************************************************/

// If dead node is the current tail, set

// the predecessor as new tail.

/***********************************************************************/

209

dn3 newTail(@M, Vol, Pred) :-

findPredecessor(@M, N, Vol,NPos, PPos), TBL tail(@M, Vol, T),

T==N, TBL chain(@M, Vol, Pred, PPos).

/***********************************************************************/

// Otherwise, find successor and inform it of its

// new predecessor.

/***********************************************************************/

dn4 findSuccessor(@M, N, Vol, NPos, PPos, a MIN<SPos>) :-

findPredecessor(@M, N, Vol, NPos, PPos), TBL tail(@M, Vol, T),

T 6=N, TBL chain(@M, Vol, , SPos), SPos > NPos.

dn5 currPredecessor(@Suc, Pred) :-

findSuccessor(@M, N, Vol, NPos, PPos, SPos), TBL chain(@M, Vol, Succ, SPos),

TBL chain(@M, Vol, Succ, PPos),

/***********************************************************************/

// if dead node is current head,

// find successor and set it to new head.

/***********************************************************************/

dn6 findNewHead(@M, N, Vol, NPos, a MIN<PPos>) :-

deadNeighbor(@M, N), TBL chain(@M, Vol, N, NPos),

TBL chain(@M, Vol, , SPos), SPos > Pos,

TBL head(@M, Vol, H), H==N.

dn7 newHead(@M, Vol, H) :-

findNewHead(@M, N, Vol, NPos, HPos), TBL chain(@M, Vol, H, HPos).

/***********************************************************************/




// - when a ping event is received,

// check if it is a new neighbor and

// udpate the neighbor table


/***********************************************************************/







rPng1 newNeighbor(@X, N) :-

refresh(@X, N), TBL neighbor(@X, N, , Live), Live==0.

rPng2 TBL neighbor(@X, N, T, Live) :-


chk1 deadNeighbor(@X, N) :-


Live==1, TimePassed=f now()-T, TimePassed>20.

chk2 TBL neighbor(@X, N, T, Live) :-

deadNeighbor(@X, N), TBL neighbor(@X, N, T, ), Live=0.

210


Blocking Point PredicateReadNowBlock isValid and isCompleteWriteAfterBlock B Action(ackFromTail, writer, Stamp)ApplyUpdateBlock isValid

A.7 P-TierStore

The implementation of P-TierStore requires 14 routing rules and 1

blocking conditions.


/***********************************************************************/

// Initialization: Read parent id.

/***********************************************************************/

in0 TRIG readEvent(@X, ObjId) :-

EVT initialize(@X), ObjId := "/.parent".

/***********************************************************************/

// When node X receives its own parent id, store it in

// a table and read subscription list.

/***********************************************************************/

pp0 TBL parent(@X, P) :-

RCV parent(@X, P).

pp1 TRIG readAndWatchEvent(@X, ObjId) :-

RCV initialize(@X), ObjId := "/.subList".

/***********************************************************************/

// When node X receives a subscription event for

// one of its subscriptions, store it in a

// subscription table and establish an inval

// and body subscription from the parent.

/***********************************************************************/

pSb0 TBL subscription(@X, SS) :-

RCV subscription(@X, SS).

pSb1 ACT addInvalSub(@X, P, X, SS, CTP) :-

RCV subscription(@X, SS), TBL parent(@X, P),

CTP=="LOG".

pSb2 ACT addBodySub(@X, P, X, SS) :-

RCV subscription(@X, SS), TBL parent(@X, P).

/***********************************************************************/

// If parent subscription fails, retry.

/***********************************************************************/

211

f1 ACT addInvalSub(@X, P, X, SS, CTP) :-

TRIG subEnd(@X, P, X, SS, , Type),

TBL parent(@X, P), Type=="Inval", CTP:="LOG".

f2 ACT addBodySub(@X, P, X, SS) :-

TRIG subEnd(@X, P, X, SS, , Type),

TBL parent(@X, P), TYPE=="Body", CTP:="LOG".

/***********************************************************************/

// If a child contacts me, establish subscriptions

// for "/*’’ to receive updates.

/***********************************************************************/

cSb1 ACT addInvalSub(@X, C, X, SS, CTP) :-

TRIG subStart(@X, X, C, , Type), C 6= P,

Type == "Inval", SS := "/*", CTP := "LOG".

cSb2 ACT addBodySub(@X, C, X, SS, CTP) :-

TRIG subStart(@X, X, C, , Type), C 6= P,

Type == "Body", SS := "/*".

/***********************************************************************/

// DTN Support: if a relay node arrives,

// establish subscriptions to receive updates

// and to send local receive new updates.

/***********************************************************************/

dtn1 ACT addInvalSub(@X, R, X, SS, CTP) :-

EVT relayNodeArrives(@X, R),

TBL subscription(@X, SS), CTP=="LOG".

dtn2 ACT addBodySub(@X, R, X, SS) :-

EVT relayNodeArrives(@X, R),

TBL subscription(@X, SS), CTP=="LOG".

dtn3 ACT addInvalSub(@X, X, R, SS, CTP) :-

EVT relayNodeArrives(@X, R), SS:="/*", CTP=="LOG".

dtn4 ACT addBodySub(@X, X, R, SS) :-

EVT relayNodeArrives(@X, R), SS:="/*", CTP=="LOG".


Blocking Point PredicateApplyUpdateBlock isValid

A.8 P-Pangaea

The implementation of P-Pangaea requires 59 routing rules and 1 block-

ing predicate.

212


/***********************************************************************/

// Assumptions:

// - the FileEntries table stores the gold nodes for every file entry for

// a directory as tuples FileEntries(@X, dir, child, GoldNode).

// - Every node has a Peer table to store links to gold and bronze replicas.

// - Every server has a latency table that stores all pairs RTT information.

/***********************************************************************/

/***********************************************************************/

// Initalization: Read in file entries of root directory

/***********************************************************************/

ini01 ACT readAndWatchEvent(@S, ObjMeta) :-

initalize(@S), ObjMeta="/.meta".

/***********************************************************************/

// When you receive new gold node information, add it to file entry table.

/***********************************************************************/

fe01 TBL fileEntry(@S, Dir, Obj, G1) :-

RCV fileEntryConfig(@S, Dir, Obj, G1, G2, G3).





/***********************************************************************/

// Initialization: Read latency information

/***********************************************************************/

ini02 ACT read(@S, Obj) :-

intialize(@S), Obj = "/.latencyConfig".

ini03 TBL latency(@S, X, Y, Lat) :-

RCV latencyConfig(@S, X, Y, Lat).

/***********************************************************************/

// On a read miss, find parentdir of the object

// and add to outstanding read miss table

/***********************************************************************/

rm1 readMiss(@S, Obj) :-

TRIG operationBlock(@S, ObjData, Off, Len, Bpoint, ),

Bpoint == "readAt", Obj=f getObjFromData(ObjData).

rm2 missParentChild(@S, Parent, Child) :-

readMiss(@S, Child), f getParent(Child).

rm3 TBL outstandingMiss(@S, Parent, Child) :-

missParentChild(@S, Parent, Child).

/***********************************************************************/

// Check FileEntries table:

// if no parent entry found, issue read miss for parent

// if parent entries exist but no obj entry, new file needs to be created

// if obj entry found, pick Gold Node, establish connections to it

// and ask it to pick peers.

/***********************************************************************/

213

rm4 findParentEntries(@S, Parent, Child, a count<*>) :-

missParentChild(@S, Parent, Child), TBL fileEntries(@S, Parent, , ),

rm5 readMiss(@S, Parent) :-

findParentEntries(@S, Parent,Child, Count), Count==0.

rm6 findObjEntries(@S, Parent, Child, a count<*>) :-

findParentEntries(@S, Parent,Child, Count), Count>0,

TBL fileEntries(@S, Parent, Child, ).

rm7 createNewObj(@S, Parent, Child) :-

findObjEntries(@S, Parent, Child, Count), Count==0.

rm8 pickGoldNode(@S, Child, a RANDOM<GNode>) :-

findObjEntries(@S, Parent, Child, Count), Count 6=0.

TBL FileEntries(@S, Dir, File, GNode).

rm9 needReplica(@GNode, S, Child) :-

pickGoldNode(@S, Child, GNode).

rm10 establishConnections(@S, GNode, Child) :-

pickGoldNode(@S, Child, GNode).

/***********************************************************************/

// When a gold node recieves a needReplica message,

// it picks 2 peers from peer list for S to contact.

/***********************************************************************/

rm11 pickClosestPeer(@G, S, Obj, a MIN<Lat>) :-

needReplica(@G, S, Obj), TBL peer(@G, Obj, P),

TBL neighbor(@G, P, , Live), Live == 1,

TBL latency(@G, S, P, Lat).

rm12 pickRandomPeer(@G, S, Obj, CP, a RANDOM<RP>) :-

pickClosestPeer(@G, S, Obj, MinLat),

TBL latency(@G, S, CP, MinLat), TBL peer(@G, Obj, RP),

TBL neighbor(@G, RP, , Live), CP 6=RP, Live==1.

/***********************************************************************/

// Once peers are picked, ask S to establish connections

/***********************************************************************/

rm13 establishConnections(@S, Obj, CP) :-

pickRandomPeer(@G, S, Obj, CP, RP).

rm14 establishConnections(@S, Obj, RP) :-

pickRandomPeer(@G, S, Obj, CP, RP).

/***********************************************************************/

// Establish connections by setting up bi-directional

// body and inval connections for both data and meta data

/***********************************************************************/

con1 TRIG addInvalSub(@S, P, S, SS, CTP) :-

establishConnections(@S, Obj, P), SS=Obj+"/*",

CTP="CP".

con2 TRIG addInvalSub(@S, S, P, SS, CTP) :-

establishConnections(@S, Obj, P), SS=Obj+"/*",

CTP="CP".

con3 TRIG addBodySub(@S, P, S, SS) :-

establishConnections(@S, Obj, P), SS=Obj+"/*".

214

con4 TRIG addBodySub(@S, S, P, SS) :-

establishConnections(@S, Obj, P), SS=Obj+"/*".

/***********************************************************************/

// When an inval connection catches up, add to peer table.

// If there is an outstanding miss for object,

// readAndWatch meta data and remove from outstanding miss table.

/***********************************************************************/

con5 TBL peer(@S, Obj, P) :-

ACT subCaughtup(@S, P, S, SS), Obj=f getObj(SS).

con6 receivedObj(@S, Obj) :-

ACT subCaughtup(@S, P, S, SS), Obj=f getObj(SS).

con7 TRIG readAndWatchEvent(@S, Meta) :-

receivedObj(@S, Obj), TBL outstandingMiss(@S, , Obj), Meta=Obj+"/.meta".

con8 delete TBL outstandingMiss(@S, Parent, Obj) :-

receivedObj(@S, Obj), TBL outstandingMiss(@S, Parent, Obj).

/***********************************************************************/

// As we get new meta-data, check if there is an outstanding

// miss for the child. If so, reissue miss.

// Note: we use a retryMiss table to ensure that we issue the miss

// after the new file entries have been added to the table.

/***********************************************************************/

ret1 TBL retryMiss(@S, Dir, Obj) :-

RCV fileEntryConfig(@S, Dir, Obj, G1, G2, G3),

TBL outstandingMiss(@S, Dir, Obj).

ret2 missParentChild(@S, Dir, Obj) :-

TBL retryMiss(@S, Dir, Obj).

/***********************************************************************/

// if we have received all entries from parent of an outstanding

// miss, need to create the object

/***********************************************************************/

ret3 createNewObj(@S, Dir, Child) :-

RCV EndFileEvent(@S, ObjMeta), Dir = f getObj(ObjMeta),

TBL outstandingMiss(@S, Dir, Child).

/***********************************************************************/

// File Creation:

// find new gold nodes, establish connections.

// Then update directory object.

/***********************************************************************/

fc1 pickFirstPeerGold(@S, Dir, Obj, a RANDOM<P>) :-

createNewObj(@S, Dir, Obj) TBL neighbor(@S, P, , Live), Live==1.

fc2 pickSecondPeerGold(@S, Dir, Obj, G1, a RANDOM<P>) :-

pickFirstPeerGold(@S, Dir, Obj, G1), TBL neighbor(@, P, , Live),

G1 6=P, Live==1.

fc3 establishConnections(@S, Obj, G1) :-

pickSecondPeerGold(@S, Dir, Obj, G1, G2).

fc4 establishConnections(@S, Obj, G2) :-

pickSecondPeerGold(@S, Dir, Obj, G1, G2).

215

fc5 ACT WriteEvent((@S, ObjMeta, EName, Dir, Obj, S, G1, G2) :-

pickSecondPeerGold(@S, Dir, Obj, G1, G2), ObjMeta =Obj+"/.meta",

EName = "fileEntryConfig".

/***********************************************************************/

// Temporary Failures:

// retry connections when they become live

/***********************************************************************/

tf1 establishConnection(@S, Obj, P) :-

liveNeighbor(@S, P), TBL peer(@S, Obj, P).

/***********************************************************************/

// Permanent Failures:

// check how long a node has been dead.

// if deadPeer, ask a gold node to pick a new peer

/***********************************************************************/

pf1 deadNode(@S, Obj, P) :-

periodic(@S, 60*60, -1), TBL neighbors(@S, P, T, Live),

T > f now() - 60*60, Live==0, TBL peer(@S, Obj, P).

pf2 delete TBL peer(@S, Obj, P) :-

deadNode(@S, Obj, P).

pf3 isNodeGold(@S, Obj, P, a COUNT<*>) :-

deadNode(@S, Obj, P), TBL fileEntry(@S, Obj, G), G==P.

pf4 deadPeer(@S, Obj, P) :-

isNodeGold(@S, Obj, P, C), C==0.

pf5 deadGoldPeer(@S, Obj, P) :-

isPeerGold(@S, Obj, P, C), C>0.

pf6 pickGoldForRecovery(@S, Obj, a random<G>) :-

deadPeer(@S, Obj, P), TBL fileEntry(@S, Obj, G).

pf7 needNewPeer(@G, Obj, S) :-

pickGoldForRecovery(@S, Obj, G).

/***********************************************************************/

// Gold node picks a peer and sends it over

/***********************************************************************/

pf8 needPeer(@S, G, Obj, NP) :-

needNewPeer(@G, Obj, S), TBL peer(@G, Obj, NP),

TBL neighbor(@G, NP, , Live), Live==1.

/***********************************************************************/

// When receive new peer, check if it is already a peer.

// if not, establish connections.

// if so, ask gold node for another peer.

/***********************************************************************/

pf9 checkIfExistingPeer(@S, G, Obj, NP, a COUNT<*>) :-

newPeer(@S, G, Obj,NP), TBL peer(@S, Obj, P), NP==P.

pf10 establishConnections(@S, Obj, NP) :-

checkIfExistingPeer(@S, G, Obj,NP, C), C==0.

pf11 needNewPeer(@G, S, Obj) :-

checkIfExistingPeer(@S,G, Obj,NP, C), C>0.

216

/***********************************************************************/

// Permanent Failures:

// if dead gold node, pick a random peer to upgrade

// to gold and update object meta data

/***********************************************************************/

pf12 pickNewGold(@G, Obj, OP, a RANDOM<NP>) :-

deadGoldPeer(@G, Obj, OP), TBL peer(@G, Obj, NP),

OP 6= NP, TBL neighbor(@G, NP, , Live), Live==1.

pf13 findOtherGold(@G, Dir, Obj,OP, LG, NP) :-

pickNewGold(@G, Obj, OP, NP), TBL FileEntry(@G,Dir,Obj, LG),

LG 6=G, LG 6=OP.

pf14 ACT WriteEvent((@G, ObjMeta, EName, Dir, Obj, S, G1, G2) :-

findOtherGold(@G, Dir, Obj, OP, G1, G2), ObjMeta =Obj+"/.meta",

EName = "fileEntryConfig".

pf16 establishConnections(@NP, Obj, LG) :-

findOtherGold(@G, , Obj, , LG, NP).

/***********************************************************************/




// - when a ping event is received,

// check if it is a new neighbor and

// udpate the neighbor table


/***********************************************************************/







rPng1 liveNeighbor(@X, N) :-

refresh(@X, N), TBL neighbor(@X, N, , Live), Live==0.

rPng2 TBL neighbor(@X, N, T, Live) :-


chk1 deadNeighbor(@X, N) :-


Live==1, TimePassed=f now()-T, TimePassed>20.

chk2 TBL neighbor(@X, N, T, Live) :-

deadNeighbor(@X, N), TBL neighbor(@X, N, T, ), Live=0.


Blocking Point PredicateApplyUpdateBlock isValid

217

Bibliography

[1] http://www.w3.org/html/.

[2] Spin formal verification. http://spinroot.com/spin/whatispin.html.

[3] Sql. http://en.wikipedia.org/wiki/SQL.

[4] Proceedings of the Summer 1994 USENIX Conference, June 1994.

[5] Proceedings of the 15th ACM Symposium on Operating Systems Princi-

ples, December 1995.

[6] Proceedings of the 18th ACM Symposium on Operating Systems Princi-

ples, October 2001.

[7] Proceedings of the 5th ACM Symposium on Operating Systems Design and

Implementation, December 2002.

[8] Proceedings of the 6th ACM Symposium on Operating Systems Design and

Implementation, December 2004.

[9] Proceedings of the 3rd USENIX Symposium on Networked Systems Design

and Implementation, May 2006.

[10] Proceedings of the 6th USENIX Symposium on Networked Systems Design

and Implementation, April 2009.

218

[11] S. Adve and K. Gharachorloo. Shared memory consistency models: A

tutorial. IEEE Computer, pages 66–76, December 1996.

[12] A. Adya, W. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. Douceur,

J. Howell, J. Lorch, M. Theimer, and R. Wattenhofer. Farsite: Federated,

available, and reliable storage for an incompletely trusted environment.

In Proceedings of the 5th ACM Symposium on Operating Systems Design

and Implementation [7].

[13] M. Ahamad, P. Hutto, and R. John. Implementing and programming

causal distributed shared memory. In Proceedings of the International

Conference on Distributed Computing Systems, May 1991.

[14] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang.

Serverless Network File Systems. ACM Transactions on Computer Sys-

tems, 14(1):41–79, February 1996.

[15] N. Belaramani and M. Dahlin. Achieving energy efficiency while syn-

chronizing personal data. Technical Report TR-09-21, The University of

Texas at Austin, August 2009.

[16] N. Belaramani, M. Dahlin, L. Gao, A. Nayate, A. Venkataramani, P. Yala-

gandula, and J. Zheng. PRACTI replication. In Proceedings of the 3rd

USENIX Symposium on Networked Systems Design and Implementation

[9].

219

[17] M. Blaze and R. Alonso. Dynamic Hierarchical Caching in Large-Scale

Distributed File Systems. In Proceedings of the 12th International Con-

ference on Distributed Computing Systems, June 1992.

[18] S. Chandra, M. Dahlin, B. Richards, R. Wang, T. Anderson, and J. Larus.

Experience with a Language for Writing Coherence Protocols. In Pro-

ceedings of USENIX Conference on Domain-Specific Languages, October

1997.

[19] W. Chen and X. Liu. Enforcing routing consistency in structured peer-

to-peer overlays: Should we and could we? In Proceedings of the 5th

International workshop on Peer-to-Peer Systems, February 2006.

[20] L. Cox and B. Noble. Fast reconciliations in fluid replication. In Pro-

ceedings of the 21st International Conference on Distributed Computing

Systems, April 2001.

[21] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica. Wide-

area cooperative storage with CFS. In Proceedings of the 18th ACM

Symposium on Operating Systems Principles [6].

[22] M. Dahlin, L. Gao, A. Nayate, A. Venkataramani, P. Yalagandula, and

J. Zheng. PRACTI replication for large-scale systems. Technical Report

TR-04-28, University of Texas at Austin, March 2004.

[23] M. Dahlin, R. Wang, T. Anderson, and D. Patterson. Cooperative

Caching: Using Remote Client Memory to Improve File System Per-

220

formance. In Proceedings of the First ACM Symposium on Operating

Systems Design and Implementation, pages 267–280, November 1994.

[24] D. Dao, J. Albrecht, C. Killian, and A. Vahdat. Live debugging of

distributed systems. In Proceedings of the 18th International Conference

on Compiler Construction, March 2009.

[25] G. Decandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,

A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo:

Amazon’s highly available key-value store. In Proceedings of the 21st

ACM Symposium on Operating Systems Principles, October 2007.

[26] M. Demmer, B. Du, and E. Brewer. Tierstore: A distributed file-system

for challenged networks. In Proceedings of the 6th USENIX Conference

on File and Storage Technologies, February 2008.

[27] D. L. Dill. Ther murphi verification system. In 8th International Con-

ference on Computer Aided Verification, 1996.

[28] M. Feeley, W. Morgan, F. Pighin, A. Karlin, H. Levy, and C. Thekkath.

Implementing Global Memory Management in a Workstation Cluster. In

Proceedings of the 15th ACM Symposium on Operating Systems Principles

[5], pages 201–212.

[29] S. Ghemawat, H. Gobioff, and S. Leung. The Google file system. In Pro-

ceedings of the 19th ACM Symposium on Operating Systems Principles,

December 2003.

221

[30] S. Gilbert and N. Lynch. Brewer’s conjecture and the feasibility of Con-

sistent, Available, Partition-tolerant web services. In ACM SIGACT

News, 2002.

[31] R. Golding. A weak-consistency architecture for distributed information

services. Computing Systems, 5(4):379–405, 1992.

[32] C. Gray and D. Cheriton. Leases: An Efficient Fault-Tolerant Mechanism

for Distributed File Cache Consistency. In Proceedings of the 12th ACM

Symposium on Operating Systems Principles, pages 202–210, 1989.

[33] J. Gray, P.Helland, P. E. O’Neil, and D. Shasha. Dangers of replication

and a solution. In Proceedings of SIGMOD International Conference on

Management of Data, pages 173–182, 1996.

[34] Robert Grimm. Better extensibility through modular syntax. In Pro-

ceedings of SIGPLAN Conference on Programming Language Design and

Implementation, pages 38–51, June 2006.

[35] H. S. Gunawi, A. Rajimwale, A. C. Arpaci-Dusseau, and R. H. Arpaci-

Dusseau. Proceedings of the 8th acm symposium on operating systems

design and implementation. In Proceedings of the 8th ACM Symposium

on Operating Systems Design and Implementation, December 2008.

[36] R. Guy, J. Heidemann, W. Mak, T. Page, Gerald J. Popek, and D. Roth-

meier. Implementation of the Ficus Replicated File System. In Proceed-

ings of the Summer 1990 USENIX Conference, June 1990.

222

[37] R. Guy, P. Reiher, D. Ratner, M. Gunter, and W. Ma. Rumor: Mobile

data access through optimistic peer-to-peer replication. In Workshop on

Mobile Data Access, pages 254–265, 1998.

[38] J. Hartman and J. Ousterhout. The Zebra Striped Network File Sys-

tem. In Proceedings of the 14th ACM Symposium on Operating Systems

Principles, pages 29–43, December 1993.

[39] J. Heidemann and G. Popek. File-system development with stackable

layers. ACM Transactions on Computer Systems, 12(1):58–89, February

1994.

[40] M. Herlihy and J. Wing. Linearizability: A correctness condition for

concurrent objects. ACM Transactions on Programming Languages and

Systems Prog. Lang. Sys., 12(3), 1990.

[41] Martin Hirzel and Robert Grimm. Jeannie: Granting Java native inter-

face developers their wishes. In The International Conference on Object

Oriented Programming, Systems, Languages and Applications, October

2007.

[42] J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Side-

botham, and M. West. Scale and Performance in a Distributed File Sys-

tem. ACM Transactions on Computer Systems, 6(1):51–81, February

1988.

223

[43] H. Hu and A. Vahdat. Building replicated internet services using TACT:

A toolkit for tunable availability and consistency tradeoffs. In Proceedings

of 2nd International Workshop on Advance Issues of E-Commerce and

Web-Based Information Systems, 2000.

[44] D. S. Parker (Jr.), G. J. Popek, G. Rudisin, A. Stoughton, B. J. Walker,

E. Walton, J. M. Chow, S. Kiser, D. Edwards, and C. Kline. Detection

of Mutual Inconsistency in Distributed Systems. IEEE Transactions on

Software Engineering, SE-9(3):240–247, May 1983.

[45] B. Kang, R. Wilensky, and J. Kubiatowicz. Hash history approach for

reconciling mutual inconsistency in optimistic replication. In Proceedings

of the 23rd International Conference on Distributed Computing Systems,

May 2003.

[46] A. Karypidis and S. Lalis. Omnistore: A system for ubiquitous per-

sonal storage management. In Annual IEEE International Conference

on Pervasive Computing and Communications, pages 136–147, 2006.

[47] A. Kermarrec, A. Rowstron, M. Shapiro, and P. Druschel. The IceCube

aproach to the reconciliation of divergent replicas. In Proceedings of the

20th Symposium on the Principles of Distributed Computing, 2001.

[48] Charles Killian, James W. Anderson, Ryan Braud, Ranjit Jhala, and

Amin Vahdat. Mace: Language support for building distributed systems.

In Proceedings of the ACM SIGPLAN 2007 Conference on Programming

Language Design and Implementation, June 2007.

224

[49] J. Kistler and M. Satyanarayanan. Disconnected operation in the coda

file system. ACM Transactions on Computer Systems, 10(1):3–25, Febru-

ary 1992.

[50] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels,

R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and

B. Zhao. Oceanstore: An architecture for global-scale persistent stor-

age. In Proceedings of the 9th International Conference on Architectural

Support for Programming Languages and Operating Systems (ASPLOS

2000), November 2000.

[51] R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat. Providing high avail-

ability using lazy replication. ACM Transactions on Computer Systems,

10(4):360–391, 1992.

[52] L. Lamport. Time, clocks, and the ordering of events in a distributed

system. Communications of the ACM, 21(7), July 1978.

[53] L. Lamport. How to make a multiprocessor computer that correctly

executes multiprocess programs. IEEE Transactions on Computers, C-

28(9):690–691, September 1979.

[54] L. Lamport. Part time parliament. ACM Transactions on Computer

Systems, 16(2), May 1998.

[55] R. Lipton and J. Sandberg. PRAM: A scalable shared memory. Techni-

cal Report CS-TR-180-88, Princeton, 1988.

225

[56] B. Loo, T. Condie, J. Hellerstein, P. Maniatis, T. Roscoe, and I. Sto-

ica. Implementing declarative overlays. In Proceedings of the 20th ACM

Symposium on Operating Systems Principles, October 2005.

[57] P. Mahajan, S. Lee, J. Zheng, L. Alvisi, and M. Dahlin. Astro: Au-

tonomous and trustworthy data sharing. Technical Report TR-08-24,

The University of Texas at Austin, October 2008.

[58] D. Malkhi, L. Novik, and C. Purcell. P2P Replica Synchronization with

Vector Sets. ACM SIGOPS Operating Systems Review, 41(2):68–74,

2007.

[59] D. Malkhi and D. Terry. Concise version vectors in WinFS. In Proceed-

ings of the 20th Symposium on Distributed Computing, 2005.

[60] L. Mummert, M. Ebling, and M. Satyanarayanan. Exploiting Weak

Connectivity for Mobile File Access. In Proceedings of the 15th ACM

Symposium on Operating Systems Principles [5].

[61] L. Mummert and M. Satyanarayanan. Large Granularity Cache Coher-

ence for Intermittent Connectivity. In Proceedings of the Summer 1994

USENIX Conference [4].

[62] D. Muntz and P. Honeyman. Multi-level Caching in Distributed File

Systems or Your cache ain’t nuthin’ but trash. In Proceedings of the

Winter 1992 USENIX Conference, pages 305–313, January 1992.

226

[63] A. Muthitacharoen, R. Morris, T. Gil, and B. Chen. Ivy: A read/write

peer-to-peer file system. In Proceedings of the 5th ACM Symposium on

Operating Systems Design and Implementation [7].

[64] A. Nayate, M. Dahlin, and A. Iyengar. Transparent information dis-

semination. In Proceedings of the ACM/IFIP/USENIX 5th International

Middleware Conference, October 2004.

[65] M. Nelson, B. Welch, and J. Ousterhout. Caching in the Sprite Network

File System. ACM Transactions on Computer Systems, 6(1), February

1988.

[66] E. Nightingale and J. Flinn. Energy-efficiency and storage flexibility

in the blue file system. In Proceedings of the 6th ACM Symposium on

Operating Systems Design and Implementation [8].

[67] L. Novik, I. Hudis, D. Terry, S. Anand, V. Jhaveri, A. Shah, and Y. Wu.

Peer-to-peer replication in winfs. Technical Report MSR-TR-2006-78,

Microsoft Research, June 2006.

[68] B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz.

NFS Version 3 Design and Implementation. In Proceedings of the Summer

1994 USENIX Conference [4].

[69] K. Petersen, M. Spreitzer, D. Terry, M. Theimer, and A. Demers. Flexible

update propagation for weakly consistent replication. In Proceedings

227

of the 16th ACM Symposium on Operating Systems Principles, October

1997.

[70] V. Ramasubramanian, T. L. Rodeheffer, D. B. Terry, M. Walraed-Sullivan,

T. Wobber, and A. Vahdat C. C. Marshall. Cimbiosys: A platform for

content-based partial replication. In Proceedings of the 6th USENIX

Symposium on Networked Systems Design and Implementation [10].

[71] P. Reiher, J. Heidemann, D. Ratner, G. Skinner, and G. Popek. Resolving

File Conflicts in the Ficus File System. In Proceedings of the Summer

1994 USENIX Conference [4].

[72] P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and

A. Vahdat. Pip: Detecting the unexpected in distributed systems. In

Proceedings of the 3rd USENIX Symposium on Networked Systems Design

and Implementation [9].

[73] S. Rhea, P. Eaton, D. Geels, H. Weatherspoon, B. Zhao, and J. Kubi-

atowicz. Pond: the OceanStore prototype. In Proceedings of the 2nd

USENIX Conference on File and Storage Technologies, March 2003.

[74] A. Rodriguez, C. Killian, S. Bhat, D. Kostic, and A. Vahdat. MACE-

DON: Methodology for automatically creating, evaluating, and designing

overlay networks. In Proceedings of the First USENIX Symposium on

Networked Systems Design and Implementation, March 2004.

228

[75] A. Rowstron and P. Druschel. Pastry: Scalable, distirbuted object loca-

tion and routing for large-scale peer-to-peer systems. In Proceedings of

IFIP/ACM International Conference on Distributed Systems Platforms,

Nov 2001.

[76] A. Rowstron and P. Druschel. Storage management and caching in PAST,

a large-scale, persistent peer-to-peer storage utility. In Proceedings of the

18th ACM Symposium on Operating Systems Principles [6].

[77] Y. Saito, C. Karamanolis, M. Karlsson, and M. Mahalingam. Taming ag-

gressive replication in the pangaea wide-area file system. In Proceedings

of the 5th ACM Symposium on Operating Systems Design and Implemen-

tation [7].

[78] M. Satyanarayanan. Scalable, Secure, and Highly Available Distributed

File Access. IEEE Computer, 23(5):9–21, May 1990.

[79] M. Shapiro, K. Bhargavan, and N. Krishna. A constraint-based formal-

ism for consistency in replicated systems. In Proceedings of the 8th Inter-

national Conference on the Principles of Distributed Systems, December

2004.

[80] A. Siegel, K. Birman, and K. Marzullo. Deceit: A flexible distributed file

system. Technical Report 89-1042, Cornell, November 1989.

[81] A. Singla, U. Ramachandran, and J. Hodgins. Temporal notions of syn-

chronization and consistency in Beehive. In Proceedings of the Ninth

229

Annual ACM Symposium on Parallel Algorithms and Architectures, 1997.

[82] S. Sobti, N. Garg, F. Zheng, J. Lai, E. Ziskind, A. Krishnamurthy, and

R. Y. Wang. Segank: a distributed mobile storage system. In Proceedings

of the 3rd USENIX Conference on File and Storage Technologies, 2004.

[83] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan.

Chord: A scalable peer-to-peer lookup service for internet applications.

In Proceedings of the ACM SIGCOMM ’01 Conference on Applications,

Technologies, Architectures, and Protocols for Computer Communication,

Aug 2001.

[84] J. Stribling, Y. Sovran, I.Zhang, X. Pretzer, J. Li, M. F. Kaashoek,

and R. Morris. Flexible, wide-area storage for distributed systems with

WheelFS. In Proceedings of the 6th USENIX Symposium on Networked

Systems Design and Implementation [10].

[85] S. Susarla and J. Carter. Flexible consistency for wide area peer replica-

tion. In Proceedings of the 25th International Conference on Distributed

Computing Systems, June 2005.

[86] D. Terry, M. Theimer, K. Petersen, A. Demers, M. Spreitzer, and C. Hauser.

Managing Update Conflicts in Bayou, a Weakly Connected Replicated

Storage System. In Proceedings of the 15th ACM Symposium on Operat-

ing Systems Principles [5].

230

[87] R. Tewari, M. Dahlin, H. Vin, and J. Kay. Beyond Hierarchies: Design

Considerations for Distributed Caching on the Internet. In Proceedings of

the Nineteenth International Conference on Distributed Computing Sys-

tems, May 1999.

[88] R. van Renesse and Fred B. Schneider. Chain replication for support-

ing high throughput and availability. In Proceedings of the 6th ACM

Symposium on Operating Systems Design and Implementation [8].

[89] R. Wang and T. Anderson. xFS: A Wide Area Mass Storage File System.

In Proceedings of the 3rd Workshop on Workstation Operating Systems,

pages 71–78, October 1993.

[90] G. Wuu and A. Berstein. Efficient solutions to the replicated log and dic-

tionary problem. In Proceedings of the 3rd Symposium on the Principles

of Distributed Computing, pages 233–242, 1984.

[91] M. Yabandeh, N. Knezevic, D. Kostic, and V. Kuncak. Proceedings of

the 6th usenix symposium on networked systems design and implemen-

tation. In Proceedings of the 6th USENIX Symposium on Networked

Systems Design and Implementation [10].

[92] J. Yin, L. Alvisi, M. Dahlin, and C. Lin. Hierarchical Cache Consistency

in a WAN. In Proceedings of the 2nd USENIX Symposium on Internet

Technologies and Systems, October 1999.

231

[93] J. Yin, L. Alvisi, M. Dahlin, and C. Lin. Volume Leases to Support

Consistency in Large-Scale Systems. IEEE Transactions on Knowledge

and Data Engineering, February 1999.

[94] H. Yu and A. Vahdat. Design and evaluation of a conit-based contin-

uous consistency model for replicated services. ACM Transactions on

Computer Systems, 20(3), August 2002.

[95] B. Y. Zhao, L. Huang, J. Stribling, S. C. Rhea, A. D. Joseph, and J. D.

Kubiatowicz. Tapestry: A resilient global-sclae overlay for service de-

ployment. In IEEE Journal on Selected Areas in Communications, Jan

2004.

[96] J. Zheng. URA: A Universal Data Replication Architecture. PhD thesis,

The University of Texas at Austin, August 2008.

[97] J. Zheng, N. Belaramani, and M. Dahlin. Pheme: Synchronizing replicas

in diverse environments. Technical Report TR-09-07, University of Texas

at Austin, February 2009.

232

Vita

Nalini Belaramani was born in Hong Kong on the 2nd of November

1978 to two loving parents—Rani and Moti Belaramani. She was soon joined

by her sister, Kiran, who has been her constant source of entertainment ever

since.

Nalini got her first computer in 1996 and was totally intrigued by it.

She decided to “unravel” the mysteries of the computer by getting a BEng

degree in Computer Engineering and a MPhil degree in Computer Science

from The University of Hong Kong.

While working for Motorola Semi-Conductors HK Ltd as an engineer,

she realized that there were more mysteries to be solved and more knowledge

to be found. She join the PhD program at the University of Texas at Austin,

where the weather and the people made her quest for knowledge a grueling,

and yet delightful experience.

And now, with enough knowledge under her belt, she is ready to face

the world.

Permanent address: 16/A South Sea Mansion,81 Chatham Road,TST, KLN, Hong Kong.

This dissertation was typed by the author.

233

Documents

Copyright by Nalini Moti Belaramani 2009