Upload
antonio-cesarano
View
346
Download
0
Embed Size (px)
Citation preview
Cluster-based Storage
Antonio Cesarano Bonaventura Del Monte
Università degli studi di Salerno
16th May 2014
Advanced Operating Systems
Prof. Giuseppe Cattaneo
Agenda
Context Goals of design NASD NASD prototype Distrubuted file systems on NASD NASD parallel file system Conclusions
A Cost-Effective, High-Bandwidth Storage Architecture
Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang,Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, Jim Zelenka, 1997-2001
Agenda
The File System
Motivations Architecture Benchmarks Comparisons and conclusions
[Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
Leung, 2003]
Context - 1998
New drive attachment technology
I/O bounded applications Streaming audio-
video
Datamining
Fibre channel
And new network standards
Context - 1998
Cost-ineffective storage servers
Excess of on-drive transistors
Controller
Context - 1998
Big files
Splitting
Storage1
Storage2
StorageN
Goal
No traditional storage file server
Cost-effective bandwidth scaling
What is NASD?
Network-Attached Secure Disk
direct transfer to clients
secure interfaces via cryptographic support
asynchronous oversight
variable-size data objects map to blocks
Network-Attached Secure Disk Architecture
NASD prototype Based on Unix inode interface Network with 13 NASD Each NASD runs on • DEC Alpha 3000, 133MHz, 64MB RAM• 2 x Seagate Medallist on 5MB/s SCSI bus• Connected to 10 clients by ATM
(155MB/s) Ad Hoc handling modules (16K loc)
NASD prototypeTests result:
It scales!
DFS on NASD
Porting NFS and AFS on NASD architecture
o Ok, no performance loss o But there are concurrency limitations
Solution:
A new higher-level parallel file system must be used…
NASD parallel file system
Scalable I/O low-level interface
Cheops as storage management layer Exports the same object interfaces of NASD devices Maps them to object on devices Maps striped objects Supports concurrency control for multi-disk accesses (10K loc)
NASD parallel file system TestClustering data mining application
+ =
*Each NASD drive provides 6.2MB/s
Conclusions
High ScalabilityDirect transfer to clientsWorking prototypeUsabe with existing file systemsBut...very high costs:• Network adapters• ASIC microcontroller,• Workstation
increasing the total cost by over 80%
Change
From here…
The Google File System
• Started with their Search Engine• They provided new services like:
Google Video Gmail Google Maps, Earth Google App Engine … and many more
Design overviewObserving common operations in Google applications leads developers to make several assumptions:
Multiple clusters distribuited worldwide
Fault-tolerance and auto-recovery need to be built into the systems because problems are very often
A modest number of large files (100+ MB or Multi-GB)
Workloads consist of either large streaming or small random reads, meanwhile write operations are sequential and append large quantity of data to files
Google applications and GFS should be co-designed
Producer – consumer pattern
GFS Architecture
MASTER
CLIENT
CHUNKSERVER
CHUNKS
UNIX FS
Request for Metadata
Metadata Response
METADATA
CHUNKSERVER
CHUNKS
UNIX FS
RAM
R-W REQUEST
R-W RESPONSE
GFS Architecture: Chunks Similar to standard File System blocks but
much larger Size: at least 64 MB (configurable)
Advantages:
• Reduced clients’ need to contact w/ the master
• Client may perform many operations on a single block
• Less chunks less metadata in the master
• No internal fragmentation due to lazy space allocation
GFS Architecture: Chunks
Disadvantages:
• Some small files, made of a small number of chunks may be accessed many times
• Not a major issue since Google Apps mostly read large multi-chunk files sequentially
• Moreover this can be fixed using an high replication factor
GFS Architecture: Master A single process running on a separate
machine
Stores all metadata in its RAM:
• File and chunk namespace
• Mapping from files to chunks
• Chunks location
• Access control information and file locking
• Chunk versioning (snapshots handling)
• And so on…
GFS Architecture: Master Master has the following responsabilities:
Chunk creation, re-replication, rebalancing and deletion for:
Balancing space utilization and access speed Spreading replicas across racks to reduce
correlated failures, usually 3 copies for each chunk
Rebalancing data to smooth out storage and request load
Persistent and replicated logging of crititical metadata updates
GFS Architecture: M - CS Communication Master and chunkservers communicate
regularly in order to retrieve their states:o Is chunkserver down?o Are there disk failure on any chunkserver ? o Are any replicas corrupted ?o Which chunk-replicas does a given
chunkserver store?
Moreover master handles garbage collection and deletes «stale» replicas
o Master logs the deletion, then renames the
target file to an hidden nameo A lazy GC removes the hidden files after a given amount of time
GFS Architecture: M - CS Communication Server Requests
Client retrieves metadata from master for the requested
Read / Write dataflows between client and chunkserver decoupled from master control flow
Single master is no longer a bottleneck: its involvement with R&W is minimized: Clients communicate directly with
chunkservers Master has to log operations as soon as
they are completed Less than 64 BYTES of metadata for
each 64 MB chunk in master memory
GFS Architecture: Reading
CHUNKSERVER
MASTER
APPLICATION
GFS CLIENT
METADATARAM
CHUNKSERVER
CHUNKSERVER
GFS Architecture: Reading
CHUNKSERVER
MASTER
APPLICATION
GFS CLIENT
METADATARAM
CHUNKSERVER
CHUNKSERVERNAME - RANGE
NAME CHUNK INDEX
CHUNK HANDLEREPLICA
LOCATIONS
GFS Architecture: Reading
CHUNKSERVER
MASTER
APPLICATION
GFS CLIENT
METADATARAM
CHUNKSERVER
CHUNKSERVER
CHUNKHANDLERANGEDATA
FROMFILE
DATA
GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATARAM
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY CHUNKSERVER
BUFFER CHUNK
SECONDAY CHUNKSERVER
BUFFER CHUNK
GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATARAM
NAME - DATA
NAME CHUNK INDEX
CHUNK HANDLEPRIMARY
AND SECONDAY REPLICA
LOCATIONS
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY CHUNKSERVER
BUFFER CHUNK
SECONDAY CHUNKSERVER
BUFFER CHUNK
GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATARAM
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY CHUNKSERVER
BUFFER CHUNK
SECONDAY CHUNKSERVER
BUFFER CHUNK
DATA
DATA
DATA
GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATARAM
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY CHUNKSERVER
BUFFER CHUNK
SECONDAY CHUNKSERVER
BUFFER CHUNK
WRITECMD
GFS Architecture: Writing
MASTER
APPLICATION
GFS CLIENT
METADATARAM
PRIMARY CHUNKSERVER
BUFFER CHUNK
SECONDAY CHUNKSERVER
BUFFER CHUNK
SECONDAY CHUNKSERVER
BUFFER CHUNK
ACKs
ACK
Fault Tolerance GFS has its own relaxed
consistency model Consistent: all replicas have the
same value
Defined: each replica reflects the performed mutations
GFS is high available Faster recovery (machine quickly
rebootable) Chunks replicated at least 3 times (take
this RAID-6) Shadow masters Data integrity exploited with a 64KB
checksum for each chunk
Benchmarking: small cluster
GFS tested on a small cluster:
I master 16 chunkservers 16 clients
Server machines connected to 100MBits central switch
Same for client machines
The two switches are connected to a 1Gbits switch
Benchmarking: small cluster
Read Rate Write Rate1 client 10 MB/s 6.3 MB/s16 clients 6 MB/s 2.2 MB/s
Network limit
12.5 MB/s
Benchmarking: real-world cluster
Cluster A: 342 PCs
Used for research and development
Tasks last few hours reading TBs of data, processing them and writing results back
Cluster B: 227 PCs
Continuously generates and processes multi-TB data sets
Typical tasks last more hours than cluster A tasks
Benchmarking: real-world cluster
Cluster A BChunkservers # 342 227Available disk space
72 TB 180 TB
Used disk space 55 TB 155 TB# of files 735000 737000# of chunks 992000 1550000Metadata at CSs 13 GB 21 GBMetadata at Master
48 MB 60 MB
Read rate 580 MB/s (750 MB/s)
380 MB/s (1300 MB/s)
Write rate 30 MB/s 100 MB/s * 3Master Ops 202~380 Ops/s 347~533 Ops/s
Benchmarking: recovery time
One chunkserver killed in cluster B:
o This chunkserver had 15000 chunks containing 600GB of data
o All chunks were restored in 23.2 mins with a replication rate of 440 MB/s
Two chunkserver killed in cluster B:
o Each with 16000 chunks and 660 GB of data, 266 of them became uniques
o These 266 chunks were replicated at an higher priority within 2 mins
Comparisons to others models
GFS
RAIDxFS
GPFS
AFS
NASD
spreads file data acrossstorage servers
simpler, uses only
replication for redundancy
location independentnamespace
centralized approach ratherthan distribuited management
commodity machines instead ofnetwork attached disks
lazy allocated fixed-size blocks rather than variable-lengh objects
Conclusion
GFS demonstrates how to support large-
scale processing workloads on commodity hardware:
designed to tollerate frequent component failures
optimised for huge files that are mostly appended and read
It has met Google’s storage needs, therefore good enough for them
GFS has influenced massively the computer science in the last few years