View
215
Download
0
Tags:
Embed Size (px)
Citation preview
IBM research © 2003 IBM CorporationSRDS Workshop
Group Communication, Still Complex After All These Years
Position paper, byRichard Golding and Ohad Rodeh
IBM research
© 2003 IBM Corporation2 SRDS 2003
Table of contents
Introduction Three specific examples Why GCSes are complex to build A look at two interesting research systems Summary
IBM research
© 2003 IBM Corporation3 SRDS 2003
Disclaimer
This is a position paper It reflects the author’s opinion It does not convey “scientific truth”
Analysis comes from the Authors experience in IBM storage research HP storage research
Published papers on IBM and HP storage systems
IBM research
© 2003 IBM Corporation4 SRDS 2003
Introduction
An ideal GCS would be “one-size-fits-all” One system that would fit the needs of all applications
The authors claim that this is not feasible The application domain
storage systems that use group-communication
IBM research
© 2003 IBM Corporation5 SRDS 2003
Three examples
IBM General Parallel File SystemTM IBM Total Storage File SystemTM (a.k.a. StorageTank) IBM Total Storage Volume ControllerTM (SVC)
IBM research
© 2003 IBM Corporation6 SRDS 2003
GPFS
GPFS is the file-system at the core of the SPTM and ASCI white super-computers
It scales from a few nodes to a few hundreds
Disk1Disk3Disk2
N1
SP2 interconnect
N2N3N4
N5
IBM research
© 2003 IBM Corporation7 SRDS 2003
GPFS (2)
The group-services component (HACMPTM) allows cluster-nodes to Heartbeat each other Compute membership Perform multi-round voting schemes Execute merge/failure protocols
When a node fails, all of its file-system read/write tokens are lost The recovery procedure needs to regenerate lost tokens and file-system
operations to continue
IBM research
© 2003 IBM Corporation8 SRDS 2003
StorageTank
A SAN file-system Clients can perform IO directly to network-attached disks
Disk1Disk3Disk2
MDS1
Host1
IP
Host2
Host3MDS2
MDS3
Meta Data Server
cluster
Host4
SAN
Hosts
Disks
control messages
read/write
fencing
IBM research
© 2003 IBM Corporation9 SRDS 2003
StorageTank
Contains a set of meta-data servers responsible for all file-system meta-data: the directory structure and file-attributes
The file-system name-space is partitioned into a forest of trees The meta-data for each tree is stored in a small database on shared
disks Each tree is managed by one metadata server, so the servers share
nothing between them. This leads to a rather light-weight requirement from group-services
They only need to quickly (100ms) detect when one server fails and tell another to take its place
Recovery consists of reading the on-disk database and its logs, and recovering client state such as file-locks from the client
IBM research
© 2003 IBM Corporation10 SRDS 2003
SVC
SCV is a storage controller and SAN virtualizer. It is architected as a small cluster of Linux-based servers situated
between a set of hosts and a set of disks. The servers provide caching, virtualization, and copy services (e.g.
snapshot) to hosts. The servers use non-volatile memory to cache data writes so that they
can immediately reply to host write operations.
Disk1Disk3Disk2
L1HostL1
SAN Fibre Channel, iSCSI
Infiniband
L2 L2
L3 L3
Host
Host
IBM research
© 2003 IBM Corporation11 SRDS 2003
SVC (2)
Fault tolerance requirements imply that the data must be written consistently to at least two servers.
This requirement has driven the programming environment in SVC The cluster is implemented using a replicated state machine
approach, with a Paxos type two-phase protocol for sending messages.
However, the system runs inside a SAN, so all communication is through the FibreChannel, which does not support multicast
Messages must be embedded in the SCSI over FibreChannel (FCP) protocol.
This is a major deviation from most GCSes, which are IP/multicast based.
IBM research
© 2003 IBM Corporation12 SRDS 2003
Why GCSes are complex to build
Many general-purpose GCSes have been built over the years Spread, Isis
These systems are complex, why? Three reasons:
Scalability Semantics Target programming environment
IBM research
© 2003 IBM Corporation13 SRDS 2003
Semantics
What kind of semantics should a GCS provide? Some storage systems only want limited features in their group services
StorageTank only uses membership and failure detection. SVC uses Paxos
Some important semantics include: FIFO-order, causal-order, total-order, safe-order. Messaging: all systems support multicast, but what about virtually
synchronous point-to-point? Lightweight groups Handling state transfer Channels outside the GCS: how is access to shared disk state
coordinated with in-GCS communication?
IBM research
© 2003 IBM Corporation14 SRDS 2003
Scalability
Traditional GCSes have scalability limits. Roughly speaking, the basic virtual synchrony model cannot scale
beyond several hundred nodes GPFS on cluster supercomputers pushes these limits.
What pieces of a GCS are required for correct functioning and what can be dropped? Should we keep failure detection but leave virtual synchrony out? What about state transfer?
IBM research
© 2003 IBM Corporation15 SRDS 2003
Environment
The product's programming and execution environment bears on the ability to adapt an existing GCS to it.
Many storage systems are implemented using special purpose operating systems or programming environments with limited or eccentric functionality.
This makes products more reliable, more efficient, and less vulnerable to attacks.
SVC is implemented in a special-purpose development environment Designed around its replicated state machine Intended to reduce errors in implementation through rigorous
design
IBM research
© 2003 IBM Corporation16 SRDS 2003
Environment (2)
Some important aspects of the environment are: Programming Language: C, C++, C#, Java, ML (Caml, SML)? Threading: What is the threading model used?
co-routines, non-preemptive, in-kernel threads, user-space threads. Kernel: is the product in-kernel, user-space, both? Resource allocation: how are memory, message buffer space,
thread context, and so on allocated and managed? Build and execution environment: What kind of environment is used
by the customer? API: What does the GCS API look like: callbacks? socket-like?
This wide variability in requirements and environment makes it impossible to build one all encompassing solution.
IBM research
© 2003 IBM Corporation17 SRDS 2003
Two research systems
Two systems that attempt to bypass the scalability limits of GCS Palladio zFS
IBM research
© 2003 IBM Corporation18 SRDS 2003
Palladio
A research block storage system [Golding & Borowsky, SRDS ’99] Made of lots of small storage bricks Scales from tens of bricks to thousands Data striped, redundant across bricks
Main issues: Performance of basic IO
operations (read, write) Reliability Scaling to very large systems
When performing IO: Expect one-copy serializability Actually atomic and serialized even
when accessing discontiguous data ranges
Client
volume manager Storage device
System control
layout
information
read/write
data
layout change,
failure detection
events and
reactions
IBM research
© 2003 IBM Corporation19 SRDS 2003
Palladio layering of concerns
Three layers Each executes in context provided by layer below Optimized for the specific task, defers difficult/expensive things to
layer below
IO
layout epoch layout epoch
System management decision process
IO
operation
layout
management
system
control
IO IOIO IO
layout
change
failure
report
time
IBM research
© 2003 IBM Corporation20 SRDS 2003
Palladio layering (2)
IO operations Perform individual read, write, data copy operations Uses optimized 2PC write, 1-phase read (lightweight transactions) Executes within one layout (membership) epoch -> constant membership Freezes on failure
Layout management Detects failures Coordinates membership change Fault tolerant; uses AC and leader election subprotocols Concerned only with a single volume
System control Policy determination Reacts to failure reports, load measures, new resources Changes layout for one or more volumes Global view, but hierarchically structured
IBM research
© 2003 IBM Corporation21 SRDS 2003
Palladio as a specialized group communication system
IO processing fills role of an atomic multicast protocol For arbitrary, small set of message recipients With exactly two kinds of “messages” – read and write Scope limited to a volume
Layout management is like failure suspicion + group membership Specialized membership roles: manager and data storage devices Works with storage-oriented policy layer below it Scope limited to a volume
System control is policy that is outside the data storage/communication system
Scalability comes from Limiting IO and layout to a single volume and hence few nodes Structuring system control layer hierarchically
IBM research
© 2003 IBM Corporation22 SRDS 2003
zFS
zFS is a server-less file-system developed at IBM research. Its components are hosts and Object-Stores. It is designed to be extremely scalable. It can be compared to clustered/SAN file-systems that use group-
services to maintain lock and meta-data consistency. The challenge is to maintain the same levels of consistency, availability,
and performance without a GCS.
IBM research
© 2003 IBM Corporation23 SRDS 2003
An Object Store
An object store (ObS) is a storage controller that provides an object-based protocol for data access.
Roughly speaking, an object is a file, users can read/write/create/delete objects.
An object store handles allocation internally and secures its network connections.
In zFS we also assume an object store supports one major lease. Taking a major lease allows access to all objects on the object store. The ObS keeps track of the current owner of the major-lease
this replaces a name service for locating lease owners.
IBM research
© 2003 IBM Corporation24 SRDS 2003
Environment and assumptions
A zFS configuration can encompass a large number of ObSes and hosts. Hosts can fail, ObSes are assumed to be reliable
A large body of work invested in making storage-controllers highly reliable.
If an ObS fails then a client requiring a piece of data on it will wait until that disk is restored.
zFS assumes the timed-asynchronous message-passing model.
IBM research
© 2003 IBM Corporation25 SRDS 2003
File-system layout
A file is an object (simplification) A directory is an object A directory entry contains
A string nameA pointer to the object holding the file/directory
The pointer = (object disk, object id) The file-system data and meta data
is spread throughout the set of ObSes
oid=6
oid=19
ObS1
“bar” (5, 1)
“foo” (2, 4)
oid=9 oid=1
ObS5
oid=4
ObS2
IBM research
© 2003 IBM Corporation26 SRDS 2003
LMGRs
For each ObS in use there is an Lease-Manager (LMGR) The LMGR manages access to the ObS A strict cache consistency policy is used.
To locate LMGRx a host access ObS X and queries who is the LMGR If none exists then the host becomes LMGRx
There is no name-server for locating LMGRs All operations are lease based
This makes the LMGR state-less No replication of LMGR state is required.
ObS1 ObS3ObS2
L1 L2L3
appl
IBM research
© 2003 IBM Corporation27 SRDS 2003
Meta Data operations
Metadata operations are difficult to perform create file, delete file, create directory, …
For example, file creation should be an atomic operation. Implemented by
Creating a directory entry Creating a new object with the requested attributes.
A host can fail in between these two steps leaving the file system inconsistent.
A delete operation involves:
Removing a directory entry Removing an object.
A host performing the delete can fail midway through the operation This leaves dangling objects that are never freed
IBM research
© 2003 IBM Corporation28 SRDS 2003
Comparison
It is difficult to compare zFS and Palladio, this isn't an apples-to-apples comparison
Palladio: A block-storage system Uses group-services outside the IO path. Performs IO to a small set of disks per IO Uses light-weight transactions Does not use reliable disks Uses a name-server to locate internal services
zFS: A file-system Does not use group services. Performs IO to a small set of disks per transaction Uses full, heavy-weight, transactions Uses reliable disks Does not uses an internal name-server
IBM research
© 2003 IBM Corporation29 SRDS 2003
Summary
It is the authors' opinion that group communication is complex and is going to remain complex, without a one-size-fits-all solution.
There are three major reasons for this: Semantics, Scalability Operating environment.
Different customers Want different semantics from a GCS Have different scalability and performance requirements Have a wide range of programming environments
We believe that in the future users will continue to tailor their proprietary GCSes to their particular environment.