29
IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad Rodeh

IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research © 2003 IBM CorporationSRDS Workshop

Group Communication, Still Complex After All These Years

Position paper, byRichard Golding and Ohad Rodeh

Page 2: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation2 SRDS 2003

Table of contents

Introduction Three specific examples Why GCSes are complex to build A look at two interesting research systems Summary

Page 3: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation3 SRDS 2003

Disclaimer

This is a position paper It reflects the author’s opinion It does not convey “scientific truth”

Analysis comes from the Authors experience in IBM storage research HP storage research

Published papers on IBM and HP storage systems

Page 4: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation4 SRDS 2003

Introduction

An ideal GCS would be “one-size-fits-all” One system that would fit the needs of all applications

The authors claim that this is not feasible The application domain

storage systems that use group-communication

Page 5: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation5 SRDS 2003

Three examples

IBM General Parallel File SystemTM IBM Total Storage File SystemTM (a.k.a. StorageTank) IBM Total Storage Volume ControllerTM (SVC)

Page 6: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation6 SRDS 2003

GPFS

GPFS is the file-system at the core of the SPTM and ASCI white super-computers

It scales from a few nodes to a few hundreds

Disk1Disk3Disk2

N1

SP2 interconnect

N2N3N4

N5

Page 7: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation7 SRDS 2003

GPFS (2)

The group-services component (HACMPTM) allows cluster-nodes to Heartbeat each other Compute membership Perform multi-round voting schemes Execute merge/failure protocols

When a node fails, all of its file-system read/write tokens are lost The recovery procedure needs to regenerate lost tokens and file-system

operations to continue

Page 8: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation8 SRDS 2003

StorageTank

A SAN file-system Clients can perform IO directly to network-attached disks

Disk1Disk3Disk2

MDS1

Host1

IP

Host2

Host3MDS2

MDS3

Meta Data Server

cluster

Host4

SAN

Hosts

Disks

control messages

read/write

fencing

Page 9: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation9 SRDS 2003

StorageTank

Contains a set of meta-data servers responsible for all file-system meta-data: the directory structure and file-attributes

The file-system name-space is partitioned into a forest of trees The meta-data for each tree is stored in a small database on shared

disks Each tree is managed by one metadata server, so the servers share

nothing between them. This leads to a rather light-weight requirement from group-services

They only need to quickly (100ms) detect when one server fails and tell another to take its place

Recovery consists of reading the on-disk database and its logs, and recovering client state such as file-locks from the client

Page 10: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation10 SRDS 2003

SVC

SCV is a storage controller and SAN virtualizer. It is architected as a small cluster of Linux-based servers situated

between a set of hosts and a set of disks. The servers provide caching, virtualization, and copy services (e.g.

snapshot) to hosts. The servers use non-volatile memory to cache data writes so that they

can immediately reply to host write operations.

Disk1Disk3Disk2

L1HostL1

SAN Fibre Channel, iSCSI

Infiniband

L2 L2

L3 L3

Host

Host

Page 11: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation11 SRDS 2003

SVC (2)

Fault tolerance requirements imply that the data must be written consistently to at least two servers.

This requirement has driven the programming environment in SVC The cluster is implemented using a replicated state machine

approach, with a Paxos type two-phase protocol for sending messages.

However, the system runs inside a SAN, so all communication is through the FibreChannel, which does not support multicast

Messages must be embedded in the SCSI over FibreChannel (FCP) protocol.

This is a major deviation from most GCSes, which are IP/multicast based.

Page 12: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation12 SRDS 2003

Why GCSes are complex to build

Many general-purpose GCSes have been built over the years Spread, Isis

These systems are complex, why? Three reasons:

Scalability Semantics Target programming environment

Page 13: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation13 SRDS 2003

Semantics

What kind of semantics should a GCS provide? Some storage systems only want limited features in their group services

StorageTank only uses membership and failure detection. SVC uses Paxos

Some important semantics include: FIFO-order, causal-order, total-order, safe-order. Messaging: all systems support multicast, but what about virtually

synchronous point-to-point? Lightweight groups Handling state transfer Channels outside the GCS: how is access to shared disk state

coordinated with in-GCS communication?

Page 14: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation14 SRDS 2003

Scalability

Traditional GCSes have scalability limits. Roughly speaking, the basic virtual synchrony model cannot scale

beyond several hundred nodes GPFS on cluster supercomputers pushes these limits.

What pieces of a GCS are required for correct functioning and what can be dropped? Should we keep failure detection but leave virtual synchrony out? What about state transfer?

Page 15: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation15 SRDS 2003

Environment

The product's programming and execution environment bears on the ability to adapt an existing GCS to it.

Many storage systems are implemented using special purpose operating systems or programming environments with limited or eccentric functionality.

This makes products more reliable, more efficient, and less vulnerable to attacks.

SVC is implemented in a special-purpose development environment Designed around its replicated state machine Intended to reduce errors in implementation through rigorous

design

Page 16: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation16 SRDS 2003

Environment (2)

Some important aspects of the environment are: Programming Language: C, C++, C#, Java, ML (Caml, SML)? Threading: What is the threading model used?

co-routines, non-preemptive, in-kernel threads, user-space threads. Kernel: is the product in-kernel, user-space, both? Resource allocation: how are memory, message buffer space,

thread context, and so on allocated and managed? Build and execution environment: What kind of environment is used

by the customer? API: What does the GCS API look like: callbacks? socket-like?

This wide variability in requirements and environment makes it impossible to build one all encompassing solution.

Page 17: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation17 SRDS 2003

Two research systems

Two systems that attempt to bypass the scalability limits of GCS Palladio zFS

Page 18: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation18 SRDS 2003

Palladio

A research block storage system [Golding & Borowsky, SRDS ’99] Made of lots of small storage bricks Scales from tens of bricks to thousands Data striped, redundant across bricks

Main issues: Performance of basic IO

operations (read, write) Reliability Scaling to very large systems

When performing IO: Expect one-copy serializability Actually atomic and serialized even

when accessing discontiguous data ranges

Client

volume manager Storage device

System control

layout

information

read/write

data

layout change,

failure detection

events and

reactions

Page 19: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation19 SRDS 2003

Palladio layering of concerns

Three layers Each executes in context provided by layer below Optimized for the specific task, defers difficult/expensive things to

layer below

IO

layout epoch layout epoch

System management decision process

IO

operation

layout

management

system

control

IO IOIO IO

layout

change

failure

report

time

Page 20: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation20 SRDS 2003

Palladio layering (2)

IO operations Perform individual read, write, data copy operations Uses optimized 2PC write, 1-phase read (lightweight transactions) Executes within one layout (membership) epoch -> constant membership Freezes on failure

Layout management Detects failures Coordinates membership change Fault tolerant; uses AC and leader election subprotocols Concerned only with a single volume

System control Policy determination Reacts to failure reports, load measures, new resources Changes layout for one or more volumes Global view, but hierarchically structured

Page 21: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation21 SRDS 2003

Palladio as a specialized group communication system

IO processing fills role of an atomic multicast protocol For arbitrary, small set of message recipients With exactly two kinds of “messages” – read and write Scope limited to a volume

Layout management is like failure suspicion + group membership Specialized membership roles: manager and data storage devices Works with storage-oriented policy layer below it Scope limited to a volume

System control is policy that is outside the data storage/communication system

Scalability comes from Limiting IO and layout to a single volume and hence few nodes Structuring system control layer hierarchically

Page 22: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation22 SRDS 2003

zFS

zFS is a server-less file-system developed at IBM research. Its components are hosts and Object-Stores. It is designed to be extremely scalable. It can be compared to clustered/SAN file-systems that use group-

services to maintain lock and meta-data consistency. The challenge is to maintain the same levels of consistency, availability,

and performance without a GCS.

Page 23: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation23 SRDS 2003

An Object Store

An object store (ObS) is a storage controller that provides an object-based protocol for data access.

Roughly speaking, an object is a file, users can read/write/create/delete objects.

An object store handles allocation internally and secures its network connections.

In zFS we also assume an object store supports one major lease. Taking a major lease allows access to all objects on the object store. The ObS keeps track of the current owner of the major-lease

this replaces a name service for locating lease owners.

Page 24: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation24 SRDS 2003

Environment and assumptions

A zFS configuration can encompass a large number of ObSes and hosts. Hosts can fail, ObSes are assumed to be reliable

A large body of work invested in making storage-controllers highly reliable.

If an ObS fails then a client requiring a piece of data on it will wait until that disk is restored.

zFS assumes the timed-asynchronous message-passing model.

Page 25: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation25 SRDS 2003

File-system layout

A file is an object (simplification) A directory is an object A directory entry contains

A string nameA pointer to the object holding the file/directory

The pointer = (object disk, object id) The file-system data and meta data

is spread throughout the set of ObSes

oid=6

oid=19

ObS1

“bar” (5, 1)

“foo” (2, 4)

oid=9 oid=1

ObS5

oid=4

ObS2

Page 26: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation26 SRDS 2003

LMGRs

For each ObS in use there is an Lease-Manager (LMGR) The LMGR manages access to the ObS A strict cache consistency policy is used.

To locate LMGRx a host access ObS X and queries who is the LMGR If none exists then the host becomes LMGRx

There is no name-server for locating LMGRs All operations are lease based

This makes the LMGR state-less No replication of LMGR state is required.

ObS1 ObS3ObS2

L1 L2L3

appl

Page 27: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation27 SRDS 2003

Meta Data operations

Metadata operations are difficult to perform create file, delete file, create directory, …

For example, file creation should be an atomic operation. Implemented by

Creating a directory entry Creating a new object with the requested attributes.

A host can fail in between these two steps leaving the file system inconsistent.

A delete operation involves:

Removing a directory entry Removing an object.

A host performing the delete can fail midway through the operation This leaves dangling objects that are never freed

Page 28: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation28 SRDS 2003

Comparison

It is difficult to compare zFS and Palladio, this isn't an apples-to-apples comparison

Palladio: A block-storage system Uses group-services outside the IO path. Performs IO to a small set of disks per IO Uses light-weight transactions Does not use reliable disks Uses a name-server to locate internal services

zFS: A file-system Does not use group services. Performs IO to a small set of disks per transaction Uses full, heavy-weight, transactions Uses reliable disks Does not uses an internal name-server

Page 29: IBM research © 2003 IBM Corporation SRDS Workshop Group Communication, Still Complex After All These Years Position paper, by Richard Golding and Ohad

IBM research

© 2003 IBM Corporation29 SRDS 2003

Summary

It is the authors' opinion that group communication is complex and is going to remain complex, without a one-size-fits-all solution.

There are three major reasons for this: Semantics, Scalability Operating environment.

Different customers Want different semantics from a GCS Have different scalability and performance requirements Have a wide range of programming environments

We believe that in the future users will continue to tailor their proprietary GCSes to their particular environment.