21
The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan virtualized disk gets smart…”

The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

Embed Size (px)

Citation preview

Page 1: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

The HP AutoRAID Hierarchical Storage System

John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan

“virtualized disk gets smart…”

Page 2: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 2o File System Recapo OS manages storage of files on storage media using a File Systemo storage media:

o comprised of an array of data units, called sectorso File System:

o organizes sectors into addressable storage units o establishes directory structure for accessing files

o FFS and LFS both developed as improvements over previous FSeso improved performance by optimizing access

o FFS:o increased block size to reduce # of block addresses managed in directoryo logically grouped cylinders to help ensure locality for blocks of a file

o LFS:o eliminated seek times by always writing at end of the logo introduced new addressable structure called extentso an extent is a large contiguous set of blockso need extents so as to have plenty of room at end of log for writing new entrieso requires Garbage Collection of old log entrieso live blocks of partially filled extents are migrated to other extents to free up space

Page 3: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 3o Crash Recoveryo issue is consistency of directory data after a crash or power failureo directory information typically written after the file data is written

o FFS:o after a crash you have no way of knowing what you were last doingo requires a consistency checko all inode information must be verified against data it maps too inconsistencies cannot always be repaired, data can be lost

o LFS:o drastically reduces time to recover because of checkpointingo checkpoint = noted recent time when files and inode map were consistento verify by rolling forward through log from last checkpointo LFS keeps lots of other metadata information and stores some of it with the fileo increased odds of restoring consistency

o But neither can recover from a hardware failure….

Page 4: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 4o RAID ! (round about the 1980’s)o Redundant Array of Inexpensive (or Independent) Diskso connect multiple cheap disks into an ARRAY of disks, spread data across them!o a single disk has less reliability than an array of smaller drives with redundancy

o Virtualization !o multiple disks but the File System sees only one virtual unit (doesn’t know it’s virtual!)o requires an ARRAY CONTROLLER, a combination of hardware and software

o handles mapping between where the FS thinks data is and where it actually is

o Redundancy!o partial, like parityo full, like an extra copyo if a single drive in the array is lost, its data can be automatically regenerated o no longer have to worry too much about drives failing!

Page 5: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 5o RAID Levelso RAID 1 - Mirroringo full redundancy!o zero recovery time in case of disk failure, just use copy

o storage capacity = 50% of total size of arrayo writes are serialized at some level between the two disks

o in case of crash or power failure, both disks are NOT in inconsistent stateo this makes writes slower than just writing to one disko a write request does not return until both copies have been updated

o transfer rate = same as one disk

o parallel reads !o each copy can service a read request

Page 6: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 6o RAID Levelso RAID 3 - Byte level striping, parity on check disko spread data by striping: byte1 -> disk1, byte2 -> disk2, byte3 -> disk3o reads and writes of stripe’s bytes happen at the same time!o transfer rate = (N - 1) * transfer rate of one disk

o only partial redundancy!o check disk stores parity informationo parity overhead amounts to one bit per group of corresponding bits in a stripeo redundancy overhead = 1 / N %

o Oops! Byte striping means every disk involved in every request! o No parallel reads nor writes

Page 7: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 7o Parityo parity is computed using XOR ( ^ ):

Page 8: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 8o RAID Levelso RAID 5 - Block level striping, parity interleavedo striping unit is 1 block: block1 -> disk1, block2 -> block2, block3 -> block3, etc.o blocks of stripes written at same time!o transfer rate = (N - 1) * transfer rate of one disko only partial redundancy!o parity information dispersed round-robin among all diskso same redundancy overhead as level 3, = 1 / N %

o Hey! Block striping can mean that every disk is NOT involved in a (small) requesto parallel reads and writes can occur, depends on which disks store involved blocks

o BUT writes get slower!o this happened in RAID 3 tooo read - modify - write :

o read parityo recompute/modify parityo write data and parity

Page 9: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 9o RAID 1 vs RAID 5o Reads:o RAID 1 (mirroring):

o always offers parallel readso RAID 5:

o can only sometimes offer parallel readso depends on where the needed blocks areo two read requests that require blocks on the same disk must be serialized

o Writes:o RAID 1:

o (mirroring) must complete two writes before request returnso granularity of serialization can be smaller than a fileo can’t do parallel writes

o RAID 5:o typically does read-modify-write to recompute parityo (HP AutoRAID uses combo of read-modify-write and LFS !)o can’t do parallel writes either

o Redundancy Overhead:o RAID 1 = full redundancy, storage capacity reduced by 50%o RAID 5 = partial redundancy, storage capacity reduced by 1/N%

Page 10: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 10o Storage Hierarchy = HP AutoRAIDo RAID 1 = fast reads and writes, but 50% redundancy overheado RAID 5 = strong reads, slow writes, 1/N% storage overhead

o RAID 1 is fast but expensive, like a cache!o RAID 5 is slower but cheaper, like main memory!

o Neither is optimum under all circumstances…

o SO create a hierarchy:o use mirroring for active blocks

o active set = blocks of regularly read and written fileso use RAID 5 for inactive blocks

o inactive set = blocks of read-only and rarely accessed files

o Sounds hard!o Who pushes the data back and forth between the sets?o How often do you have to do it?

o if the sets change too often, no time for anything else!

Page 11: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 11o Who Minds the Storage Hierarchy?o The System Administrator?

o as long as you don’t have to pay them mucho and if they get it right all the time and don’t make any mistakes

o The File System?o if so, big plus: File System knows better than anything who is using which fileso can best determine active and inactive sets based on tracking access patternso BUT, there are a lot of different OSes with different File System optionso that makes deployment hardo each File System must be modified in order to manage a storage hierarchy

o An Array Controller?o embed the software to manage the hierarchy in the hardware of a controllero no deployment issues, just add the hardware to the systemo overrules the existing File System…o lose the ability to track access patterns…o need a reliable and often correct policy for determining active/inactive sets…o sounds like virtualization…

Page 12: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 12o HP AutoRAID (local hard drive gets smart!)o array controller’s embedded software manages active/inactive setso application level user interface for configuration parameters

o set up LUNs (virtual logical units)o virtualization

o File System is out of the loop!

o Consider Mapping:o File System things it is addressing the blocks of a particular fileo doesn’t know the file is actually in a storage hierarchyo is the requested file in the active set?o Or inactive set? o which disk is it on?o need some set mapping between what the file system sees and where data actually resides on disk

Page 13: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 13o Virtual to Physical Mappingo Physically:

o the array is structured by an address hierarchy:o PEGs contain 3 or more PEXso PEXs address 1MB worth 128KB segmentso a segment holds 2 Relocation Blocks

o PEXs are typically 1MB of contiguous disk space o Segments are 128KB of contiguous sectorso Relocation Blocks serve as the:

o striping unit in RAID 5, the mirroring unit in RAID 1, o and as the unit of migration between active and inactive sets

o Virtually, the File System sees:o LUNs: Logical Unitso purely virtual, no superblock, no directory, not a partitiono rather is a set of RBs that get mapped to physical segments when actually usedo user can create as many LUNs as they want

o Each LUN has a virtual device table that holds the list of RBs assigned to ito RBs in virtual device table are mapped to RBs in PEG tableso PEG tables map RBs to PEX tables in which RBs are assigned to actual segments

Page 14: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 14o Mappingif RB3 migrates from inactive to active, simply update the PEX mapping in the PEG table that maps RB3

Page 15: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 15o How cool is that…o What you can do when you’re not in control anymore..

o Hot-pluggable diskso take one out and RAID immediately begins regenerating missing datao or, if one fails, activate a spare, if availableo array still functions, no down timeo requests for missing data are given top priority for regeneration

o Create a larger array on the flyo size of array is limited to the size of the smallest disko so take a small disk out and put a larger disk ino systematically replace all disks, one by one, letting each regenerateo when last bigger disk goes in, array is automatically larger

Page 16: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 16o HP AutoRAID Read and Write Operations

o RAID 1 Mirrored Storage Classo normal RAID Level 1 reads and writes o 2 reads can happen in parallel o a write is serialized (at the segment level) between the two diskso both updates must complete before request returns (remember the overhead!)

o RAID 5 Storage Classo reads are processed as normal RAID 5 read operationso reads are parallel if possibleo writes are log structuredo when they happen is more complicatedo RAID 5 Writes happen for 1 of 3 reasons:

o a File System request tries to write data at RAID 5:o results in promotion of requested data to active set o (no actual write happens at RAID 5 in this case)

o Mirrored storage class runs out of space:o so data is demoted from active to inactive, RBs copied from active to inactive

o when garbage collecting and cleaning in RAID 5

Page 17: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 17

o Holes, Cleaning, and Garbage Collectiono Holes come from:

o demotion of RBs from active to inactive leaves ‘holes’ in PEXs of mirrored classo holes are managed as a free list

o promotion of RBs from inactive to active leaves holes in PEXs of RAID 5o by the way, RAID 5 in HP AutoRAID uses LFS…o so holes must be garbage collected

o Cleaning:o plug the holeso RBs are migrated between PEGs to fill some, empty otherso cleaning mirrored class frees up PEGs to accommodate bursts or to give to RAID 5o cleaning RAID 5 is an alternative to garbage collection

o Garbage Collection:o normal LFS garbage collectiono or can be hole plugging garbage collection to fill/free PEGs

o this performs much better, reduces garbage collection by up to 90%!

Page 18: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 18o Performanceo depends most on how much of the active set fits into the mirrored classo if it all fits, then RAID 5 goes unused. Performance is that of a RAID I array

o tested OLTP against weaker RAID and JBODo JBOD = just a bunch of disks, striped, no redundancy (so performs the best!)o tested with all of active set fitting in Mirrored Storage classo so no migration overhead o AutoRAID lags due to redundancy overhead

o tested performance for different %’s of active set at mirrored levelo more disks = higher % at Mirrored Storage Classo obviously performance rises with higher % because less migrationo interesting to note at 8 drives, when all of active set fits

o performance rises because transfer rate is increasing, more disks to write to

Shows transaction rate of OLTP for slow RAID, HP AutoRAID, and for JBOD

Shows transaction rate as number of disk in AutoRAID increases

Page 19: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 19o Can the File System help?o File System sees virtual disk,

o probably has its own ideas of how best to lay data to blocks to optimize accesso perhaps by assigning RBs of a LUN to a linear set of contiguous blocks…

o BUT are they really going to be contiguous?o in the array, RBs can be mapped anywhere and most likely are not stored linearlyo so does this make seek times really bad?

o ran tests where they initially set up array:o with all RBs laid out completely linearlyo with all RBs laid out completely randomly

o Resulted in only modest performance gains for initial linear layouto note there is no way to migrate data between sets and maintain a linear layout..

o Conclusion: o the 64KB RB allocation block may sound big, but works just fineo remember, large block sizes amortize seek timeso RB size is subject to same considerations as block size on a normal hard drive

Page 20: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 20

o Mirrored Storage Class Read Selection Algorithmo which copy should be read?o possibilities:

o strict alternationo keep one disk head on the outer track, the other on the innero read from the disk with the shortest queueo read from the disk with the shortest seek time

o strict alternation and inner/outer can give big benefits under certain workloads…o AND can really punish under other workloads

o shortest queue and shortest seek time yield same modest gaino but it is hard to track shortest seek timeo so shortest queue wins

Page 21: The HP AutoRAID Hierarchical Storage System John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”

HP AutoRAID 21

o Conclusion

o redundancy protects from data loss due to hardware failureo different striping units and levels of redundancy result in different performance

o performance depends on type of workloado redundancy also introduces overhead

o 50% for mirroringo reduce redundancy overhead by using a storage hierarchy

o implementing different RAID levels for active and inactive datao storage hierarchy managed by an array controller

o management software embedded onto hardware controller o special mapping virtualizes the arrayo File System sees one (or more) virtual logical units