IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research © 2003 IBM CorporationSRDS Workshop

Group Communication, Still Complex After All These Years

Position paper, byRichard Golding and Ohad Rodeh

IBM research

© 2003 IBM Corporation2 SRDS 2003

Table of contents

Introduction Three specific examples Why GCSes are complex to build A look at two interesting research systems Summary

IBM research


Disclaimer

This is a position paper It reflects the author’s opinion It does not convey “scientific truth”

Analysis comes from the Authors experience in IBM storage research HP storage research

Published papers on IBM and HP storage systems

IBM research


Introduction

An ideal GCS would be “one-size-fits-all” One system that would fit the needs of all applications

The authors claim that this is not feasible The application domain

storage systems that use group-communication

IBM research


Three examples

IBM General Parallel File SystemTM IBM Total Storage File SystemTM (a.k.a. StorageTank) IBM Total Storage Volume ControllerTM (SVC)

IBM research


GPFS

GPFS is the file-system at the core of the SPTM and ASCI white super-computers

It scales from a few nodes to a few hundreds

Disk1Disk3Disk2

N1

SP2 interconnect

N2N3N4

N5

IBM research


GPFS (2)

The group-services component (HACMPTM) allows cluster-nodes to Heartbeat each other Compute membership Perform multi-round voting schemes Execute merge/failure protocols

When a node fails, all of its file-system read/write tokens are lost The recovery procedure needs to regenerate lost tokens and file-system

operations to continue

IBM research


StorageTank

A SAN file-system Clients can perform IO directly to network-attached disks

Disk1Disk3Disk2

MDS1

Host1

IP

Host2

Host3MDS2

MDS3

Meta Data Server

cluster

Host4

SAN

Hosts

Disks

control messages

read/write

fencing

IBM research


StorageTank

Contains a set of meta-data servers responsible for all file-system meta-data: the directory structure and file-attributes

The file-system name-space is partitioned into a forest of trees The meta-data for each tree is stored in a small database on shared

disks Each tree is managed by one metadata server, so the servers share

nothing between them. This leads to a rather light-weight requirement from group-services

They only need to quickly (100ms) detect when one server fails and tell another to take its place

Recovery consists of reading the on-disk database and its logs, and recovering client state such as file-locks from the client

IBM research


SVC

SCV is a storage controller and SAN virtualizer. It is architected as a small cluster of Linux-based servers situated

between a set of hosts and a set of disks. The servers provide caching, virtualization, and copy services (e.g.

snapshot) to hosts. The servers use non-volatile memory to cache data writes so that they

can immediately reply to host write operations.

Disk1Disk3Disk2

L1HostL1

SAN Fibre Channel, iSCSI

Infiniband

L2 L2

L3 L3

Host

Host

IBM research


SVC (2)

Fault tolerance requirements imply that the data must be written consistently to at least two servers.

This requirement has driven the programming environment in SVC The cluster is implemented using a replicated state machine

approach, with a Paxos type two-phase protocol for sending messages.

However, the system runs inside a SAN, so all communication is through the FibreChannel, which does not support multicast

Messages must be embedded in the SCSI over FibreChannel (FCP) protocol.

This is a major deviation from most GCSes, which are IP/multicast based.

IBM research


Why GCSes are complex to build

Many general-purpose GCSes have been built over the years Spread, Isis

These systems are complex, why? Three reasons:

Scalability Semantics Target programming environment

IBM research


Semantics

What kind of semantics should a GCS provide? Some storage systems only want limited features in their group services

StorageTank only uses membership and failure detection. SVC uses Paxos

Some important semantics include: FIFO-order, causal-order, total-order, safe-order. Messaging: all systems support multicast, but what about virtually

synchronous point-to-point? Lightweight groups Handling state transfer Channels outside the GCS: how is access to shared disk state

coordinated with in-GCS communication?

IBM research


Scalability

Traditional GCSes have scalability limits. Roughly speaking, the basic virtual synchrony model cannot scale

beyond several hundred nodes GPFS on cluster supercomputers pushes these limits.

What pieces of a GCS are required for correct functioning and what can be dropped? Should we keep failure detection but leave virtual synchrony out? What about state transfer?

IBM research


Environment

The product's programming and execution environment bears on the ability to adapt an existing GCS to it.

Many storage systems are implemented using special purpose operating systems or programming environments with limited or eccentric functionality.

This makes products more reliable, more efficient, and less vulnerable to attacks.

SVC is implemented in a special-purpose development environment Designed around its replicated state machine Intended to reduce errors in implementation through rigorous

design

IBM research


Environment (2)

Some important aspects of the environment are: Programming Language: C, C++, C#, Java, ML (Caml, SML)? Threading: What is the threading model used?

co-routines, non-preemptive, in-kernel threads, user-space threads. Kernel: is the product in-kernel, user-space, both? Resource allocation: how are memory, message buffer space,

thread context, and so on allocated and managed? Build and execution environment: What kind of environment is used

by the customer? API: What does the GCS API look like: callbacks? socket-like?

This wide variability in requirements and environment makes it impossible to build one all encompassing solution.

IBM research


Two research systems

Two systems that attempt to bypass the scalability limits of GCS Palladio zFS

IBM research


Palladio

A research block storage system [Golding & Borowsky, SRDS ’99] Made of lots of small storage bricks Scales from tens of bricks to thousands Data striped, redundant across bricks

Main issues: Performance of basic IO

operations (read, write) Reliability Scaling to very large systems

When performing IO: Expect one-copy serializability Actually atomic and serialized even

when accessing discontiguous data ranges

Client

volume manager Storage device

System control

layout

information

read/write

data

layout change,

failure detection

events and

reactions

IBM research


Palladio layering of concerns

Three layers Each executes in context provided by layer below Optimized for the specific task, defers difficult/expensive things to

layer below

IO

layout epoch layout epoch

System management decision process

IO

operation

layout

management

system

control

IO IOIO IO

layout

change

failure

report

time

IBM research


Palladio layering (2)

IO operations Perform individual read, write, data copy operations Uses optimized 2PC write, 1-phase read (lightweight transactions) Executes within one layout (membership) epoch -> constant membership Freezes on failure

Layout management Detects failures Coordinates membership change Fault tolerant; uses AC and leader election subprotocols Concerned only with a single volume

System control Policy determination Reacts to failure reports, load measures, new resources Changes layout for one or more volumes Global view, but hierarchically structured

IBM research


Palladio as a specialized group communication system

IO processing fills role of an atomic multicast protocol For arbitrary, small set of message recipients With exactly two kinds of “messages” – read and write Scope limited to a volume

Layout management is like failure suspicion + group membership Specialized membership roles: manager and data storage devices Works with storage-oriented policy layer below it Scope limited to a volume

System control is policy that is outside the data storage/communication system

Scalability comes from Limiting IO and layout to a single volume and hence few nodes Structuring system control layer hierarchically

IBM research


zFS

zFS is a server-less file-system developed at IBM research. Its components are hosts and Object-Stores. It is designed to be extremely scalable. It can be compared to clustered/SAN file-systems that use group-

services to maintain lock and meta-data consistency. The challenge is to maintain the same levels of consistency, availability,

and performance without a GCS.

IBM research


An Object Store

An object store (ObS) is a storage controller that provides an object-based protocol for data access.

Roughly speaking, an object is a file, users can read/write/create/delete objects.

An object store handles allocation internally and secures its network connections.

In zFS we also assume an object store supports one major lease. Taking a major lease allows access to all objects on the object store. The ObS keeps track of the current owner of the major-lease

this replaces a name service for locating lease owners.

IBM research


Environment and assumptions

A zFS configuration can encompass a large number of ObSes and hosts. Hosts can fail, ObSes are assumed to be reliable

A large body of work invested in making storage-controllers highly reliable.

If an ObS fails then a client requiring a piece of data on it will wait until that disk is restored.

zFS assumes the timed-asynchronous message-passing model.

IBM research


File-system layout

A file is an object (simplification) A directory is an object A directory entry contains

A string nameA pointer to the object holding the file/directory

The pointer = (object disk, object id) The file-system data and meta data

is spread throughout the set of ObSes

oid=6

oid=19

ObS1

“bar” (5, 1)

“foo” (2, 4)

oid=9 oid=1

ObS5

oid=4

ObS2

IBM research


LMGRs

For each ObS in use there is an Lease-Manager (LMGR) The LMGR manages access to the ObS A strict cache consistency policy is used.

To locate LMGRx a host access ObS X and queries who is the LMGR If none exists then the host becomes LMGRx

There is no name-server for locating LMGRs All operations are lease based

This makes the LMGR state-less No replication of LMGR state is required.

ObS1 ObS3ObS2

L1 L2L3

appl

IBM research


Meta Data operations

Metadata operations are difficult to perform create file, delete file, create directory, …

For example, file creation should be an atomic operation. Implemented by

Creating a directory entry Creating a new object with the requested attributes.

A host can fail in between these two steps leaving the file system inconsistent.

A delete operation involves:

Removing a directory entry Removing an object.

A host performing the delete can fail midway through the operation This leaves dangling objects that are never freed

IBM research


Comparison

It is difficult to compare zFS and Palladio, this isn't an apples-to-apples comparison

Palladio: A block-storage system Uses group-services outside the IO path. Performs IO to a small set of disks per IO Uses light-weight transactions Does not use reliable disks Uses a name-server to locate internal services

zFS: A file-system Does not use group services. Performs IO to a small set of disks per transaction Uses full, heavy-weight, transactions Uses reliable disks Does not uses an internal name-server

IBM research


Summary

It is the authors' opinion that group communication is complex and is going to remain complex, without a one-size-fits-all solution.

There are three major reasons for this: Semantics, Scalability Operating environment.

Different customers Want different semantics from a GCS Have different scalability and performance requirements Have a wide range of programming environments

We believe that in the future users will continue to tailor their proprietary GCSes to their particular environment.

Documents

IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad