Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian

A. Reinefeld, F. Schintke, ZIB 1

Replication and Consistency

in Cloud File SystemsAlexander Reinefeld und Florian Schintke

Zuse-Institut Berlin

Cloud-Computing-Tag im IKMZ der BTU Cottbus 14.04.2011


Let’s start with a little quiz

Who invented Cloud Computing?

a) Werner Vogels

b) Ian Foster

c) Konrad Zuse

The correct answer is … c)

“Schließlich werden auch Rechenzentren über

Fernmeldeleitungen miteinander vernetzt werden, …”

Konrad Zuse in „Rechnender Raum“ (1969)

Konrad Zuse22.06.1910-18.12.1995


Zuse Institute Berlin

Research institute for applied mathematics

and computer science

• Peter Deuflhard

chair for scientific computing, FU Berlin

• Martin Grötschel

chair for discrete mathematics, TU Berlin

• Alexander Reinefeld

chair for computer science, HU Berlin


HPC Systems @ ZIB

1987

Cray X-MP

471 MFlops

1994

Cray T3D

38 GFlops

1997

Cray T3E

486 GFlops

2002

IBM p690

2,5 TFlops1984

Cray 1M

160 MFlops

1.000.000-fold performance increase in 25 years1984 2009

2008/09

SGI ICE, XE

150 TFlops

H

L

R

N

2 sites

98 computer racks

26112 CPU cores

128 TB memory

1620 TB disk

300 TF peak perf.

ST O R A G E

3 SL8500 robots

39 tape drives

19000 slots


What is Cloud Computing?

Cloud Computing = Grid Computing on Datacenters?

• not that simple …

Cloud and Grid both abstract resources through interfaces.

• Grid: via new middleware.

Requires Grid APIs.

• Cloud: via virtualization.

Allows legacy APIs.

Software as a Service (SaaS)Applications

Application Services

Platform as a Service (PaaS)Programming Environment

Execution Environment

Infrastructure as a Service (IaaS)Infrastructure Services

Resource Set

Alexander Reinefeld, ZIB 8

Why Cloud?

Pros

• It scales

because it‘s their resources, not yours

• It‘s simple

because they operate it

• Pay for what you need

don‘t pay for empty spinning disks

Cons

• It‘s expensive

Amazon S3 charges $.15 / GB / month.

= $1800 / TB / year

• It‘s not 100% secure

S3 now allows to bring your own

RSA key-pair. But: Would you put your

bank account into the cloud?

• It‘s not 100% available

S3 provides “service credits“ if availability

drops (10% for 99.0-99.9% availability)


File System Landscape

ext3, ZFS,NTFS

NFS, SMBAFS/Coda

Lustre, Panasas,GPFS, CEPH...

Cloud/Grid

Cluster FS/Datacenter

Network FS/Centralized

PC, local system

GDM"gridftp"

Grid File SystemGFarm


Consistency, Availability, Partition tolerance:

Pick two of three!

Consistency: All clients have the

same view of the data.

Availability: Each client can

always read and write.

Partition tolerance: Operations

will complete, even if individual

components are unavailable.

Brewer, Eric. Towards Robust Distributed Systems. PODC Keynote, 2004.

A

C P

A + PAmazon S3

Mercurial

Coda/AFS

C + Asingle server,

Linux HA

(one data center)

C + Pdistributed databases,

distributed file systems


Which semantic do you expect?

Distributed file systems should provide C + P

But recent hype was on A + P + eventual consistency (e.g. Amazon S3)


Grid File System

• provides access to heterogeneous

storage resources, but

• middleware causes additional

complexity, vulnerability

• requires explicit file transfer

• whole file: latency to 1st access,

bandwidth, disk storage

• also partial file access (gridftp)

and pattern access (falls)

• no consistency among replicas

• user must take care

• no access control on replicas


Cloud File System: XtreemFS

Focus on

• data distribution

• data replication

• object based

Key features

• MRCs are separated

from OSDs

• fat Client is the “link”

MRC = metadata & replica catalogue

OSD = object storage device

Client = file system interface


A closer look at XtreemFS

• Features

• distributed, replicated POSIX compliant file system

• Server software (Java) runs on Linux, OS X, Solaris

• Client software (C++) runs on Linux, OS X, Windows

• secure: X.509 and SSL

• open source (GPL)

• Assumptions

• synchronous clocks with max. time drift (needed for OSD lease

negotiation, reasonable assumption in clouds)

• upper limit on round trip time

• no need for FIFO channels (runs on either TCP or UDP)


XtreemFS Interfaces


File access protocol

XtreemFS Client (fuse)

User appl.

(Linux VFS) MRC OSD

FileSize = 128k

Update(Cap, FileSize=128k)


Client

• gets list of OSDs from MRC

• get a capability (signed by MRC) per file

• selects best OSD(s) for parallel I/O

• various striping policies: scatter/gather, RAIDx, erasure codes

• scalable and fast access

• no communication between OSD and MRC needed

• client is the “missing link”


MRC – Metadata and Replication Catalogue

• provides

• open(), close(), readdir(), rename(), …

• attributes per file: size, last access, access rights, location (OSDs), …

• capability (file handle) to authorize a client to access objects on OSDs

• implemented with a key/value store (BabuDB)

• fast index

• append-only DB

• allows snapshots


OSD – Object Storage Device

• serves file content operations

• read(), write(), truncate(), flush(), …

• implements object replication

• also partial replicas for read-access �

• data is filled on demand

• gets OSD list from MRC

• slave OSD redirects to master OSD

• write ops only on master OSD

• POSIX requires linearizable reads, hence reads are also redirected



• Which OSD to select?

• object list

• bandwidth

• rarest first

• network coordinates, datacenter map, …

• prefetching (for partial replicas)



• implements concurrency control for replica consistency

• POSIX compliant

• master/slave replication with failover

• group membership service provided by MRC

• lease service “Flease”: distributed, scalable and failure-tolerant

• 50,000 leases/sec with 30 OSDs

• based on quorum consensus (Paxos)


Quorum consensus

• Basic algorithm• When a majority is informed, each other majority has at least one

member with up-to-date information.

• A minority may crash at any time.

• Paxos Consensus• 1 Step: Check whether a consensus c was already established

• 2 Step: Re-establish c or try to establish own proposal

xx

xx

xx

x x


ProposerInit

r = 1 // lokale Runden-Nr

rlatest = 0 // Nr der höchsten bestätigten Runde

latestv = ⊥ // Wert d. höchsten bestätigten Runde

// Neues Proposal senden

acknum = 0 // Anzahl gültiger Bestätigungen

Sende prepare(r) an alle acceptors

Empfange ack(rack,vi,ri) von acceptor i

Falls r == rack

acknum ++

Falls ri > rlatest

rlatest = ri // jüngere akzeptierte Runde

latestv = vi // jüngerer Wert

Falls acknum ≥ maj

Falls latestv == ⊥schlage selbst einen Wert latestv ≠ ⊥ vor

sende accept(r, latestv) an alle acceptors

Learnernumaccepted = 0 // Anzahl gesammelter accepts

Empfange accepted(r, v) von acceptor i

Wenn r steigt: numaccepted = 0

numaccepted ++

Falls numaccepted == maj

decide v; inform client // v ist Konsens

Acceptor

Init

rack = 0 // zuletzt bestätigte Runde

raccepted = 0 // zuletzt akzeptierte Runde

v = ⊥ // aktueller lokaler Wert

Empfange prepare(r) von proposer

Falls r > rack ∧ r > raccepted // höhere Runde

rack = r

Sende ack(rack, v, raccepted) an Proposer

Empfange accept(r, w)

Falls r ≥ rack ∧ r > raccepted

raccepted = r

v = w

Sende accepted (raccepted, v) an Learners

Ende 1. Phase


Striping Performance on Cluster

• Striping

• parallel transfer from/to

many OSDs

• bandwidth scales with the

number of OSDs

• client is the bottleneck: (slower reads are caused by TCP

ingress problem)

RE

AD

WR

ITE

One client writes/reads a single 4GB file using asynchr. writes, read-ahead, 1MB chunk size, 29 OSDs. Nodes are connected with IP over IB (1.2 GB/s).


Snapshots & Backups

Metadata snapshots (MRC)

• need atomic operation without service interrupt

• asynchronous consolidation in background

• granularity: subdirectories or volumes

• implemented by BabuDB or Scalaris

File snapshots (OSD)

• taken implicitly when file is idle

or explicitly when closing file or fsync()

• versioning of file objects: copy-on-write


Atomic Snapshots in MRC

• implemented with BabuDB backend

• a large-scale DB for data that exceeds the system’s main memory

• 2 components:

• small mutable overlay trees (LSM trees)

• large immutable memory-mapped index on disk

• non-transactional key-value store

• prefix and range queries

• primary design goal: Performance!

• 300,000 lookups/sec (30M entries)

• fast crash recovery

• fast start-up


Log Structured Merge Trees: .

A lookup takes O(s log(n)) with s: #snapshots, n: #files


Replicating MRC, OSDs

Master/Slave Scheme

• Pros

• fast local read

• no distributed transactions

• easy to implement

• Cons

• master is performance bottleneck

• interrupt when master fails: needs stable master election

Replicated State Machine (Paxos)

• Pros

• no master, no single point of failure

• no extra latency on failure

• Cons

• slower: 2 round trips per op

• needs distrib. consensus


XtreemFS Features

Release 1.2.1 (current)

• RAID and parallel I/O

• POSIX compatibility

• Read-only replication

• Partial replicas (on-demand)

• Security (SSL, X.509)

• Internet ready

• Checksums

• Extensions

• OSD and replica selection (Vivaldi,

datacenter maps)

• Asynchronous MRC backups

• Metadata caching

• Graphical admin console

• Hadoop file system driver (experimental)

Release 1.3 (very soon)

• DIR and MRC replication with automatic

failover

• Read/write replication

Release 2.x

• Consistent Backups

• Snapshots

• Automatic replica creation, deletion and

maintenance


Source Code

• XtreemFS• http://code.google.com/p/xtreemfs

• 35.000 lines of C++ and Java code

• GNU GPL v2 license

• BabuDB• http://code.google.com/p/babudb

• 10.000 lines of Java code

• new BSD license

• Scalaris• http://code.google.com/p/scalaris

• 28.214 lines of Erlang and C++ code

• Apache 2.0 licenseScalaris


Summary

• Cloud file systems require replication

• availability

• fast access, striping

• Replication requires consistency algorithm

• when crashes are rare: use master/slave replication

• with frequent crashes: use Paxos

• Only Consistency + Partition tolerance from CAP theorem

• Our next step: Faster high-level data services

• for MapReduce, Dryad, key/value store, SQL, …

Documents

Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke, ZIB 1 Replication and Consistency in Cloud File Systems Alexander Reinefeldund Florian