41
Cluster-based Storage Antonio Cesarano Bonaventura Del Monte Università degli studi di Salerno 16th May 2014 Advanced Operating Systems Prof. Giuseppe Cattaneo

Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Embed Size (px)

Citation preview

Page 1: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Cluster-based Storage

Antonio Cesarano Bonaventura Del Monte

Università degli studi di Salerno

16th May 2014

Advanced Operating Systems

Prof. Giuseppe Cattaneo

Page 2: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Agenda

Context Goals of design NASD NASD prototype Distrubuted file systems on NASD NASD parallel file system Conclusions

A Cost-Effective, High-Bandwidth Storage Architecture

Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff Butler, Fay W. Chang,Howard Gobioff, Charles Hardin, Erik Riedel, David Rochberg, Jim Zelenka, 1997-2001

Page 3: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Agenda

The File System

Motivations Architecture Benchmarks Comparisons and conclusions

[Sanjay Ghemawat, Howard Gobioff, and Shun-Tak

Leung, 2003]

Page 4: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Context - 1998

New drive attachment technology

I/O bounded applications Streaming audio-

video

Datamining

Fibre channel

And new network standards

Page 5: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Context - 1998

Cost-ineffective storage servers

Excess of on-drive transistors

Page 6: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Controller

Context - 1998

Big files

Splitting

Storage1

Storage2

StorageN

Page 7: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Goal

No traditional storage file server

Cost-effective bandwidth scaling

Page 8: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

What is NASD?

Network-Attached Secure Disk

direct transfer to clients

secure interfaces via cryptographic support

asynchronous oversight

variable-size data objects map to blocks

Page 9: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Network-Attached Secure Disk Architecture

Page 10: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

NASD prototype Based on Unix inode interface Network with 13 NASD Each NASD runs on • DEC Alpha 3000, 133MHz, 64MB RAM• 2 x Seagate Medallist on 5MB/s SCSI bus• Connected to 10 clients by ATM

(155MB/s) Ad Hoc handling modules (16K loc)

Page 11: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

NASD prototypeTests result:

It scales!

Page 12: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

DFS on NASD

Porting NFS and AFS on NASD architecture

o Ok, no performance loss o But there are concurrency limitations

Solution:

A new higher-level parallel file system must be used…

Page 13: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

NASD parallel file system

Scalable I/O low-level interface

Cheops as storage management layer Exports the same object interfaces of NASD devices Maps them to object on devices Maps striped objects Supports concurrency control for multi-disk accesses (10K loc)

Page 14: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

NASD parallel file system TestClustering data mining application

+ =

*Each NASD drive provides 6.2MB/s

Page 15: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Conclusions

High ScalabilityDirect transfer to clientsWorking prototypeUsabe with existing file systemsBut...very high costs:• Network adapters• ASIC microcontroller,• Workstation

increasing the total cost by over 80%

Page 16: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Change

From here…

Page 17: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

The Google File System

• Started with their Search Engine• They provided new services like:

Google Video Gmail Google Maps, Earth Google App Engine … and many more

Page 18: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Design overviewObserving common operations in Google applications leads developers to make several assumptions:

Multiple clusters distribuited worldwide

Fault-tolerance and auto-recovery need to be built into the systems because problems are very often

A modest number of large files (100+ MB or Multi-GB)

Workloads consist of either large streaming or small random reads, meanwhile write operations are sequential and append large quantity of data to files

Google applications and GFS should be co-designed

Producer – consumer pattern

Page 19: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture

MASTER

CLIENT

CHUNKSERVER

CHUNKS

UNIX FS

Request for Metadata

Metadata Response

METADATA

CHUNKSERVER

CHUNKS

UNIX FS

RAM

R-W REQUEST

R-W RESPONSE

Page 20: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Chunks Similar to standard File System blocks but

much larger Size: at least 64 MB (configurable)

Advantages:

• Reduced clients’ need to contact w/ the master

• Client may perform many operations on a single block

• Less chunks less metadata in the master

• No internal fragmentation due to lazy space allocation

Page 21: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Chunks

Disadvantages:

• Some small files, made of a small number of chunks may be accessed many times

• Not a major issue since Google Apps mostly read large multi-chunk files sequentially

• Moreover this can be fixed using an high replication factor

Page 22: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Master A single process running on a separate

machine

Stores all metadata in its RAM:

• File and chunk namespace

• Mapping from files to chunks

• Chunks location

• Access control information and file locking

• Chunk versioning (snapshots handling)

• And so on…

Page 23: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Master Master has the following responsabilities:

Chunk creation, re-replication, rebalancing and deletion for:

Balancing space utilization and access speed Spreading replicas across racks to reduce

correlated failures, usually 3 copies for each chunk

Rebalancing data to smooth out storage and request load

Persistent and replicated logging of crititical metadata updates

Page 24: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: M - CS Communication Master and chunkservers communicate

regularly in order to retrieve their states:o Is chunkserver down?o Are there disk failure on any chunkserver ? o Are any replicas corrupted ?o Which chunk-replicas does a given

chunkserver store?

Moreover master handles garbage collection and deletes «stale» replicas

o Master logs the deletion, then renames the

target file to an hidden nameo A lazy GC removes the hidden files after a given amount of time

Page 25: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: M - CS Communication Server Requests

Client retrieves metadata from master for the requested

Read / Write dataflows between client and chunkserver decoupled from master control flow

Single master is no longer a bottleneck: its involvement with R&W is minimized: Clients communicate directly with

chunkservers Master has to log operations as soon as

they are completed Less than 64 BYTES of metadata for

each 64 MB chunk in master memory

Page 26: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Reading

CHUNKSERVER

MASTER

APPLICATION

GFS CLIENT

METADATARAM

CHUNKSERVER

CHUNKSERVER

Page 27: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Reading

CHUNKSERVER

MASTER

APPLICATION

GFS CLIENT

METADATARAM

CHUNKSERVER

CHUNKSERVERNAME - RANGE

NAME CHUNK INDEX

CHUNK HANDLEREPLICA

LOCATIONS

Page 28: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Reading

CHUNKSERVER

MASTER

APPLICATION

GFS CLIENT

METADATARAM

CHUNKSERVER

CHUNKSERVER

CHUNKHANDLERANGEDATA

FROMFILE

DATA

Page 29: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Writing

MASTER

APPLICATION

GFS CLIENT

METADATARAM

PRIMARY CHUNKSERVER

BUFFER CHUNK

SECONDAY CHUNKSERVER

BUFFER CHUNK

SECONDAY CHUNKSERVER

BUFFER CHUNK

Page 30: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Writing

MASTER

APPLICATION

GFS CLIENT

METADATARAM

NAME - DATA

NAME CHUNK INDEX

CHUNK HANDLEPRIMARY

AND SECONDAY REPLICA

LOCATIONS

PRIMARY CHUNKSERVER

BUFFER CHUNK

SECONDAY CHUNKSERVER

BUFFER CHUNK

SECONDAY CHUNKSERVER

BUFFER CHUNK

Page 31: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Writing

MASTER

APPLICATION

GFS CLIENT

METADATARAM

PRIMARY CHUNKSERVER

BUFFER CHUNK

SECONDAY CHUNKSERVER

BUFFER CHUNK

SECONDAY CHUNKSERVER

BUFFER CHUNK

DATA

DATA

DATA

Page 32: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Writing

MASTER

APPLICATION

GFS CLIENT

METADATARAM

PRIMARY CHUNKSERVER

BUFFER CHUNK

SECONDAY CHUNKSERVER

BUFFER CHUNK

SECONDAY CHUNKSERVER

BUFFER CHUNK

WRITECMD

Page 33: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

GFS Architecture: Writing

MASTER

APPLICATION

GFS CLIENT

METADATARAM

PRIMARY CHUNKSERVER

BUFFER CHUNK

SECONDAY CHUNKSERVER

BUFFER CHUNK

SECONDAY CHUNKSERVER

BUFFER CHUNK

ACKs

ACK

Page 34: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Fault Tolerance GFS has its own relaxed

consistency model Consistent: all replicas have the

same value

Defined: each replica reflects the performed mutations

GFS is high available Faster recovery (machine quickly

rebootable) Chunks replicated at least 3 times (take

this RAID-6) Shadow masters Data integrity exploited with a 64KB

checksum for each chunk

Page 35: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Benchmarking: small cluster

GFS tested on a small cluster:

I master 16 chunkservers 16 clients

Server machines connected to 100MBits central switch

Same for client machines

The two switches are connected to a 1Gbits switch

Page 36: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Benchmarking: small cluster

Read Rate Write Rate1 client 10 MB/s 6.3 MB/s16 clients 6 MB/s 2.2 MB/s

Network limit

12.5 MB/s

Page 37: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Benchmarking: real-world cluster

Cluster A: 342 PCs

Used for research and development

Tasks last few hours reading TBs of data, processing them and writing results back

Cluster B: 227 PCs

Continuously generates and processes multi-TB data sets

Typical tasks last more hours than cluster A tasks

Page 38: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Benchmarking: real-world cluster

Cluster A BChunkservers # 342 227Available disk space

72 TB 180 TB

Used disk space 55 TB 155 TB# of files 735000 737000# of chunks 992000 1550000Metadata at CSs 13 GB 21 GBMetadata at Master

48 MB 60 MB

Read rate 580 MB/s (750 MB/s)

380 MB/s (1300 MB/s)

Write rate 30 MB/s 100 MB/s * 3Master Ops 202~380 Ops/s 347~533 Ops/s

Page 39: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Benchmarking: recovery time

One chunkserver killed in cluster B:

o This chunkserver had 15000 chunks containing 600GB of data

o All chunks were restored in 23.2 mins with a replication rate of 440 MB/s

Two chunkserver killed in cluster B:

o Each with 16000 chunks and 660 GB of data, 266 of them became uniques

o These 266 chunks were replicated at an higher priority within 2 mins

Page 40: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Comparisons to others models

GFS

RAIDxFS

GPFS

AFS

NASD

spreads file data acrossstorage servers

simpler, uses only

replication for redundancy

location independentnamespace

centralized approach ratherthan distribuited management

commodity machines instead ofnetwork attached disks

lazy allocated fixed-size blocks rather than variable-lengh objects

Page 41: Cluster based storage - Nasd and Google file system - advanced operating systems unisa 2014

Conclusion

GFS demonstrates how to support large-

scale processing workloads on commodity hardware:

designed to tollerate frequent component failures

optimised for huge files that are mostly appended and read

It has met Google’s storage needs, therefore good enough for them

GFS has influenced massively the computer science in the last few years