Replication and Consistency in Cloud File Systems - TU · PDF fileA. Reinefeld, F. Schintke,...

Preview:

Citation preview

A. Reinefeld, F. Schintke, ZIB 1

Replication and Consistency

in Cloud File SystemsAlexander Reinefeld und Florian Schintke

Zuse-Institut Berlin

Cloud-Computing-Tag im IKMZ der BTU Cottbus 14.04.2011

A. Reinefeld, F. Schintke, ZIB 2

Let’s start with a little quiz

Who invented Cloud Computing?

a) Werner Vogels

b) Ian Foster

c) Konrad Zuse

The correct answer is … c)

“Schließlich werden auch Rechenzentren über

Fernmeldeleitungen miteinander vernetzt werden, …”

Konrad Zuse in „Rechnender Raum“ (1969)

Konrad Zuse22.06.1910-18.12.1995

A. Reinefeld, F. Schintke, ZIB 3

Zuse Institute Berlin

Research institute for applied mathematics

and computer science

• Peter Deuflhard

chair for scientific computing, FU Berlin

• Martin Grötschel

chair for discrete mathematics, TU Berlin

• Alexander Reinefeld

chair for computer science, HU Berlin

A. Reinefeld, F. Schintke, ZIB 4

HPC Systems @ ZIB

1987

Cray X-MP

471 MFlops

1994

Cray T3D

38 GFlops

1997

Cray T3E

486 GFlops

2002

IBM p690

2,5 TFlops1984

Cray 1M

160 MFlops

1.000.000-fold performance increase in 25 years1984 2009

2008/09

SGI ICE, XE

150 TFlops

H

L

R

N

2 sites

98 computer racks

26112 CPU cores

128 TB memory

1620 TB disk

300 TF peak perf.

ST O R A G E

3 SL8500 robots

39 tape drives

19000 slots

A. Reinefeld, F. Schintke, ZIB 7

What is Cloud Computing?

Cloud Computing = Grid Computing on Datacenters?

• not that simple …

Cloud and Grid both abstract resources through interfaces.

• Grid: via new middleware.

Requires Grid APIs.

• Cloud: via virtualization.

Allows legacy APIs.

Software as a Service (SaaS)Applications

Application Services

Platform as a Service (PaaS)Programming Environment

Execution Environment

Infrastructure as a Service (IaaS)Infrastructure Services

Resource Set

Alexander Reinefeld, ZIB 8

Why Cloud?

Pros

• It scales

because it‘s their resources, not yours

• It‘s simple

because they operate it

• Pay for what you need

don‘t pay for empty spinning disks

Cons

• It‘s expensive

Amazon S3 charges $.15 / GB / month.

= $1800 / TB / year

• It‘s not 100% secure

S3 now allows to bring your own

RSA key-pair. But: Would you put your

bank account into the cloud?

• It‘s not 100% available

S3 provides “service credits“ if availability

drops (10% for 99.0-99.9% availability)

Alexander Reinefeld, ZIB 9

File System Landscape

ext3, ZFS,NTFS

NFS, SMBAFS/Coda

Lustre, Panasas,GPFS, CEPH...

Cloud/Grid

Cluster FS/Datacenter

Network FS/Centralized

PC, local system

GDM"gridftp"

Grid File SystemGFarm

Alexander Reinefeld, ZIB 10

Consistency, Availability, Partition tolerance:

Pick two of three!

Consistency: All clients have the

same view of the data.

Availability: Each client can

always read and write.

Partition tolerance: Operations

will complete, even if individual

components are unavailable.

Brewer, Eric. Towards Robust Distributed Systems. PODC Keynote, 2004.

A

C P

A + PAmazon S3

Mercurial

Coda/AFS

C + Asingle server,

Linux HA

(one data center)

C + Pdistributed databases,

distributed file systems

Alexander Reinefeld, ZIB 11

Which semantic do you expect?

Distributed file systems should provide C + P

But recent hype was on A + P + eventual consistency (e.g. Amazon S3)

Alexander Reinefeld, ZIB 12

Grid File System

• provides access to heterogeneous

storage resources, but

• middleware causes additional

complexity, vulnerability

• requires explicit file transfer

• whole file: latency to 1st access,

bandwidth, disk storage

• also partial file access (gridftp)

and pattern access (falls)

• no consistency among replicas

• user must take care

• no access control on replicas

Alexander Reinefeld, ZIB 13

Cloud File System: XtreemFS

Focus on

• data distribution

• data replication

• object based

Key features

• MRCs are separated

from OSDs

• fat Client is the “link”

MRC = metadata & replica catalogue

OSD = object storage device

Client = file system interface

A. Reinefeld, F. Schintke, ZIB 14

A closer look at XtreemFS

• Features

• distributed, replicated POSIX compliant file system

• Server software (Java) runs on Linux, OS X, Solaris

• Client software (C++) runs on Linux, OS X, Windows

• secure: X.509 and SSL

• open source (GPL)

• Assumptions

• synchronous clocks with max. time drift (needed for OSD lease

negotiation, reasonable assumption in clouds)

• upper limit on round trip time

• no need for FIFO channels (runs on either TCP or UDP)

A. Reinefeld, F. Schintke, ZIB 15

XtreemFS Interfaces

Alexander Reinefeld, ZIB 16

File access protocol

XtreemFS Client (fuse)

User appl.

(Linux VFS) MRC OSD

FileSize = 128k

Update(Cap, FileSize=128k)

A. Reinefeld, F. Schintke, ZIB 17

Client

• gets list of OSDs from MRC

• get a capability (signed by MRC) per file

• selects best OSD(s) for parallel I/O

• various striping policies: scatter/gather, RAIDx, erasure codes

• scalable and fast access

• no communication between OSD and MRC needed

• client is the “missing link”

A. Reinefeld, F. Schintke, ZIB 18

MRC – Metadata and Replication Catalogue

• provides

• open(), close(), readdir(), rename(), …

• attributes per file: size, last access, access rights, location (OSDs), …

• capability (file handle) to authorize a client to access objects on OSDs

• implemented with a key/value store (BabuDB)

• fast index

• append-only DB

• allows snapshots

A. Reinefeld, F. Schintke, ZIB 19

OSD – Object Storage Device

• serves file content operations

• read(), write(), truncate(), flush(), …

• implements object replication

• also partial replicas for read-access �

• data is filled on demand

• gets OSD list from MRC

• slave OSD redirects to master OSD

• write ops only on master OSD

• POSIX requires linearizable reads, hence reads are also redirected

A. Reinefeld, F. Schintke, ZIB 20

OSD – Object Storage Device

• Which OSD to select?

• object list

• bandwidth

• rarest first

• network coordinates, datacenter map, …

• prefetching (for partial replicas)

A. Reinefeld, F. Schintke, ZIB 21

OSD – Object Storage Device

• implements concurrency control for replica consistency

• POSIX compliant

• master/slave replication with failover

• group membership service provided by MRC

• lease service “Flease”: distributed, scalable and failure-tolerant

• 50,000 leases/sec with 30 OSDs

• based on quorum consensus (Paxos)

A. Reinefeld, F. Schintke, ZIB 22

Quorum consensus

• Basic algorithm• When a majority is informed, each other majority has at least one

member with up-to-date information.

• A minority may crash at any time.

• Paxos Consensus• 1 Step: Check whether a consensus c was already established

• 2 Step: Re-establish c or try to establish own proposal

xx

xx

xx

x x

Alexander Reinefeld, ZIB 23

ProposerInit

r = 1 // lokale Runden-Nr

rlatest = 0 // Nr der höchsten bestätigten Runde

latestv = ⊥ // Wert d. höchsten bestätigten Runde

// Neues Proposal senden

acknum = 0 // Anzahl gültiger Bestätigungen

Sende prepare(r) an alle acceptors

Empfange ack(rack,vi,ri) von acceptor i

Falls r == rack

acknum ++

Falls ri > rlatest

rlatest = ri // jüngere akzeptierte Runde

latestv = vi // jüngerer Wert

Falls acknum ≥ maj

Falls latestv == ⊥schlage selbst einen Wert latestv ≠ ⊥ vor

sende accept(r, latestv) an alle acceptors

Learnernumaccepted = 0 // Anzahl gesammelter accepts

Empfange accepted(r, v) von acceptor i

Wenn r steigt: numaccepted = 0

numaccepted ++

Falls numaccepted == maj

decide v; inform client // v ist Konsens

Acceptor

Init

rack = 0 // zuletzt bestätigte Runde

raccepted = 0 // zuletzt akzeptierte Runde

v = ⊥ // aktueller lokaler Wert

Empfange prepare(r) von proposer

Falls r > rack ∧ r > raccepted // höhere Runde

rack = r

Sende ack(rack, v, raccepted) an Proposer

Empfange accept(r, w)

Falls r ≥ rack ∧ r > raccepted

raccepted = r

v = w

Sende accepted (raccepted, v) an Learners

Ende 1. Phase

A. Reinefeld, F. Schintke, ZIB 24

Striping Performance on Cluster

• Striping

• parallel transfer from/to

many OSDs

• bandwidth scales with the

number of OSDs

• client is the bottleneck: (slower reads are caused by TCP

ingress problem)

RE

AD

WR

ITE

One client writes/reads a single 4GB file using asynchr. writes, read-ahead, 1MB chunk size, 29 OSDs. Nodes are connected with IP over IB (1.2 GB/s).

A. Reinefeld, F. Schintke, ZIB 25

Snapshots & Backups

Metadata snapshots (MRC)

• need atomic operation without service interrupt

• asynchronous consolidation in background

• granularity: subdirectories or volumes

• implemented by BabuDB or Scalaris

File snapshots (OSD)

• taken implicitly when file is idle

or explicitly when closing file or fsync()

• versioning of file objects: copy-on-write

A. Reinefeld, F. Schintke, ZIB 26

Atomic Snapshots in MRC

• implemented with BabuDB backend

• a large-scale DB for data that exceeds the system’s main memory

• 2 components:

• small mutable overlay trees (LSM trees)

• large immutable memory-mapped index on disk

• non-transactional key-value store

• prefix and range queries

• primary design goal: Performance!

• 300,000 lookups/sec (30M entries)

• fast crash recovery

• fast start-up

A. Reinefeld, F. Schintke, ZIB 27

Log Structured Merge Trees: .

A lookup takes O(s log(n)) with s: #snapshots, n: #files

Alexander Reinefeld, ZIB 28

Replicating MRC, OSDs

Master/Slave Scheme

• Pros

• fast local read

• no distributed transactions

• easy to implement

• Cons

• master is performance bottleneck

• interrupt when master fails: needs stable master election

Replicated State Machine (Paxos)

• Pros

• no master, no single point of failure

• no extra latency on failure

• Cons

• slower: 2 round trips per op

• needs distrib. consensus

Alexander Reinefeld, ZIB 29

XtreemFS Features

Release 1.2.1 (current)

• RAID and parallel I/O

• POSIX compatibility

• Read-only replication

• Partial replicas (on-demand)

• Security (SSL, X.509)

• Internet ready

• Checksums

• Extensions

• OSD and replica selection (Vivaldi,

datacenter maps)

• Asynchronous MRC backups

• Metadata caching

• Graphical admin console

• Hadoop file system driver (experimental)

Release 1.3 (very soon)

• DIR and MRC replication with automatic

failover

• Read/write replication

Release 2.x

• Consistent Backups

• Snapshots

• Automatic replica creation, deletion and

maintenance

A. Reinefeld, F. Schintke, ZIB 30

Source Code

• XtreemFS• http://code.google.com/p/xtreemfs

• 35.000 lines of C++ and Java code

• GNU GPL v2 license

• BabuDB• http://code.google.com/p/babudb

• 10.000 lines of Java code

• new BSD license

• Scalaris• http://code.google.com/p/scalaris

• 28.214 lines of Erlang and C++ code

• Apache 2.0 licenseScalaris

A. Reinefeld, F. Schintke, ZIB 31

Summary

• Cloud file systems require replication

• availability

• fast access, striping

• Replication requires consistency algorithm

• when crashes are rare: use master/slave replication

• with frequent crashes: use Paxos

• Only Consistency + Partition tolerance from CAP theorem

• Our next step: Faster high-level data services

• for MapReduce, Dryad, key/value store, SQL, …

Recommended