Upload
lamdien
View
225
Download
0
Embed Size (px)
Citation preview
A. Reinefeld, F. Schintke, ZIB 1
Replication and Consistency
in Cloud File SystemsAlexander Reinefeld und Florian Schintke
Zuse-Institut Berlin
Cloud-Computing-Tag im IKMZ der BTU Cottbus 14.04.2011
A. Reinefeld, F. Schintke, ZIB 2
Let’s start with a little quiz
Who invented Cloud Computing?
a) Werner Vogels
b) Ian Foster
c) Konrad Zuse
The correct answer is … c)
“Schließlich werden auch Rechenzentren über
Fernmeldeleitungen miteinander vernetzt werden, …”
Konrad Zuse in „Rechnender Raum“ (1969)
Konrad Zuse22.06.1910-18.12.1995
A. Reinefeld, F. Schintke, ZIB 3
Zuse Institute Berlin
Research institute for applied mathematics
and computer science
• Peter Deuflhard
chair for scientific computing, FU Berlin
• Martin Grötschel
chair for discrete mathematics, TU Berlin
• Alexander Reinefeld
chair for computer science, HU Berlin
A. Reinefeld, F. Schintke, ZIB 4
HPC Systems @ ZIB
1987
Cray X-MP
471 MFlops
1994
Cray T3D
38 GFlops
1997
Cray T3E
486 GFlops
2002
IBM p690
2,5 TFlops1984
Cray 1M
160 MFlops
1.000.000-fold performance increase in 25 years1984 2009
2008/09
SGI ICE, XE
150 TFlops
H
L
R
N
2 sites
98 computer racks
26112 CPU cores
128 TB memory
1620 TB disk
300 TF peak perf.
ST O R A G E
3 SL8500 robots
39 tape drives
19000 slots
A. Reinefeld, F. Schintke, ZIB 7
What is Cloud Computing?
Cloud Computing = Grid Computing on Datacenters?
• not that simple …
Cloud and Grid both abstract resources through interfaces.
• Grid: via new middleware.
Requires Grid APIs.
• Cloud: via virtualization.
Allows legacy APIs.
Software as a Service (SaaS)Applications
Application Services
Platform as a Service (PaaS)Programming Environment
Execution Environment
Infrastructure as a Service (IaaS)Infrastructure Services
Resource Set
Alexander Reinefeld, ZIB 8
Why Cloud?
Pros
• It scales
because it‘s their resources, not yours
• It‘s simple
because they operate it
• Pay for what you need
don‘t pay for empty spinning disks
Cons
• It‘s expensive
Amazon S3 charges $.15 / GB / month.
= $1800 / TB / year
• It‘s not 100% secure
S3 now allows to bring your own
RSA key-pair. But: Would you put your
bank account into the cloud?
• It‘s not 100% available
S3 provides “service credits“ if availability
drops (10% for 99.0-99.9% availability)
Alexander Reinefeld, ZIB 9
File System Landscape
ext3, ZFS,NTFS
NFS, SMBAFS/Coda
Lustre, Panasas,GPFS, CEPH...
Cloud/Grid
Cluster FS/Datacenter
Network FS/Centralized
PC, local system
GDM"gridftp"
Grid File SystemGFarm
Alexander Reinefeld, ZIB 10
Consistency, Availability, Partition tolerance:
Pick two of three!
Consistency: All clients have the
same view of the data.
Availability: Each client can
always read and write.
Partition tolerance: Operations
will complete, even if individual
components are unavailable.
Brewer, Eric. Towards Robust Distributed Systems. PODC Keynote, 2004.
A
C P
A + PAmazon S3
Mercurial
Coda/AFS
C + Asingle server,
Linux HA
(one data center)
C + Pdistributed databases,
distributed file systems
Alexander Reinefeld, ZIB 11
Which semantic do you expect?
Distributed file systems should provide C + P
But recent hype was on A + P + eventual consistency (e.g. Amazon S3)
Alexander Reinefeld, ZIB 12
Grid File System
• provides access to heterogeneous
storage resources, but
• middleware causes additional
complexity, vulnerability
• requires explicit file transfer
• whole file: latency to 1st access,
bandwidth, disk storage
• also partial file access (gridftp)
and pattern access (falls)
• no consistency among replicas
• user must take care
• no access control on replicas
Alexander Reinefeld, ZIB 13
Cloud File System: XtreemFS
Focus on
• data distribution
• data replication
• object based
Key features
• MRCs are separated
from OSDs
• fat Client is the “link”
MRC = metadata & replica catalogue
OSD = object storage device
Client = file system interface
A. Reinefeld, F. Schintke, ZIB 14
A closer look at XtreemFS
• Features
• distributed, replicated POSIX compliant file system
• Server software (Java) runs on Linux, OS X, Solaris
• Client software (C++) runs on Linux, OS X, Windows
• secure: X.509 and SSL
• open source (GPL)
• Assumptions
• synchronous clocks with max. time drift (needed for OSD lease
negotiation, reasonable assumption in clouds)
• upper limit on round trip time
• no need for FIFO channels (runs on either TCP or UDP)
A. Reinefeld, F. Schintke, ZIB 15
XtreemFS Interfaces
Alexander Reinefeld, ZIB 16
File access protocol
XtreemFS Client (fuse)
User appl.
(Linux VFS) MRC OSD
FileSize = 128k
Update(Cap, FileSize=128k)
A. Reinefeld, F. Schintke, ZIB 17
Client
• gets list of OSDs from MRC
• get a capability (signed by MRC) per file
• selects best OSD(s) for parallel I/O
• various striping policies: scatter/gather, RAIDx, erasure codes
• scalable and fast access
• no communication between OSD and MRC needed
• client is the “missing link”
A. Reinefeld, F. Schintke, ZIB 18
MRC – Metadata and Replication Catalogue
• provides
• open(), close(), readdir(), rename(), …
• attributes per file: size, last access, access rights, location (OSDs), …
• capability (file handle) to authorize a client to access objects on OSDs
• implemented with a key/value store (BabuDB)
• fast index
• append-only DB
• allows snapshots
A. Reinefeld, F. Schintke, ZIB 19
OSD – Object Storage Device
• serves file content operations
• read(), write(), truncate(), flush(), …
• implements object replication
• also partial replicas for read-access �
• data is filled on demand
• gets OSD list from MRC
• slave OSD redirects to master OSD
• write ops only on master OSD
• POSIX requires linearizable reads, hence reads are also redirected
A. Reinefeld, F. Schintke, ZIB 20
OSD – Object Storage Device
• Which OSD to select?
• object list
• bandwidth
• rarest first
• network coordinates, datacenter map, …
• prefetching (for partial replicas)
A. Reinefeld, F. Schintke, ZIB 21
OSD – Object Storage Device
• implements concurrency control for replica consistency
• POSIX compliant
• master/slave replication with failover
• group membership service provided by MRC
• lease service “Flease”: distributed, scalable and failure-tolerant
• 50,000 leases/sec with 30 OSDs
• based on quorum consensus (Paxos)
A. Reinefeld, F. Schintke, ZIB 22
Quorum consensus
• Basic algorithm• When a majority is informed, each other majority has at least one
member with up-to-date information.
• A minority may crash at any time.
• Paxos Consensus• 1 Step: Check whether a consensus c was already established
• 2 Step: Re-establish c or try to establish own proposal
xx
xx
xx
x x
Alexander Reinefeld, ZIB 23
ProposerInit
r = 1 // lokale Runden-Nr
rlatest = 0 // Nr der höchsten bestätigten Runde
latestv = ⊥ // Wert d. höchsten bestätigten Runde
// Neues Proposal senden
acknum = 0 // Anzahl gültiger Bestätigungen
Sende prepare(r) an alle acceptors
Empfange ack(rack,vi,ri) von acceptor i
Falls r == rack
acknum ++
Falls ri > rlatest
rlatest = ri // jüngere akzeptierte Runde
latestv = vi // jüngerer Wert
Falls acknum ≥ maj
Falls latestv == ⊥schlage selbst einen Wert latestv ≠ ⊥ vor
sende accept(r, latestv) an alle acceptors
Learnernumaccepted = 0 // Anzahl gesammelter accepts
Empfange accepted(r, v) von acceptor i
Wenn r steigt: numaccepted = 0
numaccepted ++
Falls numaccepted == maj
decide v; inform client // v ist Konsens
Acceptor
Init
rack = 0 // zuletzt bestätigte Runde
raccepted = 0 // zuletzt akzeptierte Runde
v = ⊥ // aktueller lokaler Wert
Empfange prepare(r) von proposer
Falls r > rack ∧ r > raccepted // höhere Runde
rack = r
Sende ack(rack, v, raccepted) an Proposer
Empfange accept(r, w)
Falls r ≥ rack ∧ r > raccepted
raccepted = r
v = w
Sende accepted (raccepted, v) an Learners
Ende 1. Phase
A. Reinefeld, F. Schintke, ZIB 24
Striping Performance on Cluster
• Striping
• parallel transfer from/to
many OSDs
• bandwidth scales with the
number of OSDs
• client is the bottleneck: (slower reads are caused by TCP
ingress problem)
RE
AD
WR
ITE
One client writes/reads a single 4GB file using asynchr. writes, read-ahead, 1MB chunk size, 29 OSDs. Nodes are connected with IP over IB (1.2 GB/s).
A. Reinefeld, F. Schintke, ZIB 25
Snapshots & Backups
Metadata snapshots (MRC)
• need atomic operation without service interrupt
• asynchronous consolidation in background
• granularity: subdirectories or volumes
• implemented by BabuDB or Scalaris
File snapshots (OSD)
• taken implicitly when file is idle
or explicitly when closing file or fsync()
• versioning of file objects: copy-on-write
A. Reinefeld, F. Schintke, ZIB 26
Atomic Snapshots in MRC
• implemented with BabuDB backend
• a large-scale DB for data that exceeds the system’s main memory
• 2 components:
• small mutable overlay trees (LSM trees)
• large immutable memory-mapped index on disk
• non-transactional key-value store
• prefix and range queries
• primary design goal: Performance!
• 300,000 lookups/sec (30M entries)
• fast crash recovery
• fast start-up
A. Reinefeld, F. Schintke, ZIB 27
Log Structured Merge Trees: .
A lookup takes O(s log(n)) with s: #snapshots, n: #files
Alexander Reinefeld, ZIB 28
Replicating MRC, OSDs
Master/Slave Scheme
• Pros
• fast local read
• no distributed transactions
• easy to implement
• Cons
• master is performance bottleneck
• interrupt when master fails: needs stable master election
Replicated State Machine (Paxos)
• Pros
• no master, no single point of failure
• no extra latency on failure
• Cons
• slower: 2 round trips per op
• needs distrib. consensus
Alexander Reinefeld, ZIB 29
XtreemFS Features
Release 1.2.1 (current)
• RAID and parallel I/O
• POSIX compatibility
• Read-only replication
• Partial replicas (on-demand)
• Security (SSL, X.509)
• Internet ready
• Checksums
• Extensions
• OSD and replica selection (Vivaldi,
datacenter maps)
• Asynchronous MRC backups
• Metadata caching
• Graphical admin console
• Hadoop file system driver (experimental)
Release 1.3 (very soon)
• DIR and MRC replication with automatic
failover
• Read/write replication
Release 2.x
• Consistent Backups
• Snapshots
• Automatic replica creation, deletion and
maintenance
A. Reinefeld, F. Schintke, ZIB 30
Source Code
• XtreemFS• http://code.google.com/p/xtreemfs
• 35.000 lines of C++ and Java code
• GNU GPL v2 license
• BabuDB• http://code.google.com/p/babudb
• 10.000 lines of Java code
• new BSD license
• Scalaris• http://code.google.com/p/scalaris
• 28.214 lines of Erlang and C++ code
• Apache 2.0 licenseScalaris
A. Reinefeld, F. Schintke, ZIB 31
Summary
• Cloud file systems require replication
• availability
• fast access, striping
• Replication requires consistency algorithm
• when crashes are rare: use master/slave replication
• with frequent crashes: use Paxos
• Only Consistency + Partition tolerance from CAP theorem
• Our next step: Faster high-level data services
• for MapReduce, Dryad, key/value store, SQL, …