Cutting the cord: Why we took the file system out of our ... · IBM Cloud Object storage, based on technology pioneered by Cleversafe, is extremely scalable with no central service

2016 Storage Developer Conference. © IBM Corporation. All Rights Reserved.

Cutting the cord: Why we took the file system out of our storage nodes

Manish Motwani IBM Cloud Object Storage

(previously, Cleversafe)

Presenter

Presentation Notes

Hi and Good afternoon everyone. Thank you for attending this talk. My name is Manish Motwani and I’m from IBM Cloud Object Storage. I’m going to be presenting Cutting the cord: Why we took the filesystem out of our storage nodes.


Agenda

Storage Paradigms Object Storage (and IBM Cloud Object Storage)

High Level System Architecture Evolution of our Storage implementations

File Storage Packed Storage Zone Storage (Raw Disk)

Building an ideal “object system” Results and Conclusions

2

Presenter

Presentation Notes

In this presentation, I’ll cover some of the existing storage paradigms to show the bigger picture of where object storage falls. I’ll go over a high level architecture of IBM’s cloud object storage and how the back end storage implementations have evolved over time. Initially we used to store all data on the file system, called File Storage. It’s an implementation in which each object received by the Storage Node is stored as a separate file. About 5 years ago we created a new implementation called Packed Storage. In this implementation we have a bunch of large files and all objects get stored, “packed” within them. Most of our current customer deployments are on Packed Storage. Most recently, for optimal performance and supporting the larger, SMR drives, we decided that we need to forego the filesystem entirely and store data directly on the raw disk. We call this implementation Zone Storage. I'll describe what we have to do to build such a system and show some of our results.


Storage Paradigms

Block Interfaces E.g. SCSI, iSCSI, FC

NAS Interfaces E.g. CIFS, NFS

Object Interfaces Main focus of this talk E.g. S3, OpenStack, IBM COS

3

Presenter

Presentation Notes

There are various different storage interfaces used in the industry. At a high level, they can be broken down by the types of data they store. There are block interfaces, which store fixed-size, small (512 bytes – 4K) entities addressed by a logical block address (or block number). Examples include SCSI, iSCSI, and fiber channel Next came NAS interfaces, which are basically file systems accessed over network protocols, such as CIFS and NFS. The most recent paradigm to emerge has been object storage, which typically uses HTTP REST APIs to access arbitrarily sized BLOBs in a flat namespace. For example Amazon S3, OpenStack and of course the IBM Cloud Object Storage - all support object interfaces. Building an efficient back-end for Object Storage will be the main focus of this talk.


Object Storage

4

Presenter

Presentation Notes

Objects come in various sizes and shapes. What we need is a storage back end that is flexible enough to store and retrieve all object sizes, while doing so efficiently.


What is Object Storage?

5

Object storage provides a virtually infinite key value mapping A way of organizing unstructured data No hierarchical organization – flat namespace

Accessed directly by an application using a developer friendly RESTful HTTP interface

Only allows Full writes/overwrites of objects – no partial or in-place modification

Presenter

Presentation Notes

So what is Object Storage? Object storage provides a virtually infinite key/value mapping of unstructured data, like photos, videos, music and other media over a flat namespace. It's generally accessed directly by an application using REST API. Unlike files in a filesystem, objects are generally immutable. This will be an important factor when it comes to designing the ideal back-end for an “object system”.


IBM Cloud Object Storage (COS)

Extremely scalable object storage Erasure coded across nodes for durability

6

Presenter

Presentation Notes

IBM Cloud Object storage, based on technology pioneered by Cleversafe, is extremely scalable with no central service or single point of failure (SPOF). It provides an object interface where objects are erasure coded across storage nodes, to provide robust yet cost-effective storage. But for the purposes of this presentation, it makes little difference whether the object system is storing replicas or erasure coded pieces. If you would like to learn more about erasure codes, you should attend my colleague Jason Resch’s presentation tomorrow.


IBM COS Architecture

7

Presenter

Presentation Notes

At a very high level, this is the IBM Cloud Object Storage architecture. On the top left you have the users, sending and receiving data via client servers, which talk to the access layer over some load balancers, which then sends and receives data from the storage nodes, which are all the way on the right. We also have a management node which allows an admin to configure reliability, users and their containers, and view status of the access layer and the storage nodes.


Storage Nodes: Very Dense JBODs

8

Presenter

Presentation Notes

This is what a typical storage node looks like, with hard drives sitting vertically in the box as shown here. Our storage nodes aim to support very high storage densities: as many drives per “U” as possible. Correspondingly the drives need to hold as much data as possible, no SSDs, no 10K RPM drives. SMR is preferred... Helium is preferred. It is not uncommon for our storage nodes to have 50 or 100 disks. The higher the density, the cheaper storage we can provide on a per-GB measure.


What Storage Nodes Store

Objects are split into segments, which are stored across multiple storage nodes for reliability Segmentation allows HTTP ranged reads

Let’s call what storage nodes receive “slices” Storage Node provide the following API: Store(slice name, slice data) Slice data = Get(slice name) Slice names = List(starting slice name, limit)

9

Presenter

Presentation Notes

Before the data reaches storage nodes, it is broken into segments by the access layer. Segmenting objects provides several benefits, it enables objects to be partially read at given offsets, thus supporting HTTP ranged reads. It also allows large objects to be spread across many different nodes and drives to prevent overloading any one node or drive. And finally, it enables accessing encrypted or compressed objects without having to retrieve the entire object first. Let's call what the storage nodes receive SLICES. Each object segment is stored across many nodes, either as replicas or as erasure coded fragments of each segment. The storage node is responsible for remembering *SLICES* sent to it, and returning them upon a future request. Listing is needed to support rebuilding, migration, and other maintenance and data health tasks.


Storage Node Goals

Efficiently satisfy storage requests Hold up to billions of slices per drive Minimize storage overhead

Allow fast storage and retrieval of slices when requested Ideally having at most one seek per slice read Ideally zero seeks (amortized) per write

Reliability and consistency Recover from unclean shutdowns quickly Support efficient ordered listing of slices held

10

Presenter

Presentation Notes

Our storage node requirements include an efficient support of the API. Each drive can have 100s of millions or billions of slices and the storage nodes can store 10s of billions of slices. So it's important to have low storage overhead. We want to allow fast storage and retrieval of slices, ideally only requiring a single seek per slice read and almost no seeks for writes. The storage nodes are also responsible for the reliability and consistency of slices stored, which means recovering from unclean shutdowns like power outages, and an efficient ordered listing of the slices.


Evolution of our Storage Implementations

11

Presenter

Presentation Notes

At this point I'll get into how our storage back-end implementations have evolved over several years.


Storage Back-end Progression

Our back-end has progress through 3 phases: File Storage

Minimal effort, okay performance, some overhead Packed Storage

Significant effort, fast and with minimal overhead Zone Storage

Herculean effort, most optimal “object system”

12

Presenter

Presentation Notes

We started with a paradigm – file storage – that was entirely based on the file system. This was easy, as the file system provided a lot for free. However, a file system is not a great match for storing objects. Our second approach, Packed Storage, is largely not reliant on the file system for locating or accessing slices. It implements the lookup and storage using our own data structures, but it's still storing these data structures on the disk over a filesystem, and hence, still leaves further room for optimization. The third step of optimization, Zone Storage, likely would not have been necessary, except for a new storage technology that has entered the horizon: SMR drives. SMR drives are extremely inefficient for in-place overwrites, and existing filesystems do not play well with them. This presented a problem for us, as SMR drives are the most data-dense drives that will exist in the near future. How are we to maintain our performance goals and still meet our data density goals? --- This required cutting the cord, and dropping the file system altogether from our storage nodes.


Feature Comparison File Storage Packed Storage Zone Storage

Directories EXT4 N/A N/A

Read Cache EXT4 EXT4 ZS

Write Cache EXT4 EXT4 ZS

Crash Recovery EXT4 EXT4 and PS ZS

File Attributes EXT4 N/A N/A

Name Listing EXT4 PS ZS

Defragmentation EXT4 PS ZS

Permissions EXT4 N/A N/A

Metrics and Usage EXT4 EXT4 and PS ZS

Tools EXT4 PS ZS

13

Presenter

Presentation Notes

This Table shows what the file system does for us, which is a lot. Significant effort, many decades have gone into optimizing file systems. To abandon the file system entirely is not without its costs, it meant having to re-implement from scratch all of the things the file system had given us for free. In green, we can see everything EXT4 provided for us that was the basis for our initial implementation. As we progressed to Packed Storage, there were some things the file system provided that were not needed, shown in black with the N/A. We still used the read and write caches provided by the file system, but without directories we needed our own mechanisms for listing. In addition we also needed to implement our own crash recovery and usage tracking on top of those still provided by the file system. With Zone Storage, we get nothing. Every feature or capability we need, required us to implement it ourselves. We effectively were building an object system from scratch; one optimized for storage of immutable blobs on an SMR drive (which doesn’t allow overwrites). At this point I'll get into greater details of how each of these implementations actually work for supporting our Storage Node API.


First Solution: Using the Filesystem

14

Presenter

Presentation Notes

Our first approach, back in 2006 was to use the filesystem directly to store slices, storing each slice in its own file. The result was not too different from what you see in this picture. This is what a large filesystem looks like. A bunch of assorted files stuffed in folders, stuffed inside still other folders. As you might imagine, it takes a long time to find anything.


What a Filesystem Provides

directory structure cache journal metadata storage (file and directory) directory listing defragmentation user permissions usage metrics, tools like df/du

15

Presenter

Presentation Notes

But a filesystem is a great vehicle because it shields you from a lot of low level complexity. It provides a hierarchical directory structure, a built-in caching mechanism, a journal (for crash recovery), metadata storage, listing capability, user permissions, usage metrics and tools. When we started off, we didn’t really need anything more than this. This is partly because at the time we didn’t have any really large scale deployments –our largest deployments were in 10s or 100s of terabytes and we used to have smaller, 12-disk storage nodes. Since a filesystem tries to optimize for the general use case, it never did a great job for us (e.g. we don’t need a directory structure at all; we don’t need user permissions – all data is accessed by a single application running as a single user). I'll get into some of limitations of using a filesystem later in the presentation.


File Storage explained

Store Slice as File Convert slice name to a file path as follows:

Base 64 encode name (24 bytes ⇒ 32 char) Create a two level directory structure

First level = First 2 chars of base-64 name Second level = Next 2 chars of base-64 name File name = Remaining 28 chars of base-64 name

16

Presenter

Presentation Notes

In this approach, we essentially create a new file for each slice that is received by the node. A 24-byte slice name is converted into base 64, which is 32 characters. The first two characters make the first level directory name, the next two characters make the next level directory name and finally, the remaining 28 characters make up the file name. Creating multiple levels of subdirectories is a trade off. You don’t want directories with millions or billions of files in them, so creating sub directories in this way reduces the number of slices to a few thousand per directory. However, the more sub-directories there are, the more seeks you potentially have to do to open a file when those directories are not in cache.


File Storage Overhead

Slices are generally in the 10KB – 1MB range Each slice requires a 256 byte inode header

In memory this ends up being about 350 bytes of RAM If the inode is not cached, it requires at least 2 seeks per read

If the containing directory is not cached, it can be 4 or more seeks

17

Presenter

Presentation Notes

Slices that are received on the storage nodes are generally in the 10KB to 1MB range – they could be smaller but usually not larger because they get segmented and erasure coded on the access layer. Each slice write requires an inode write, a directory write and the file write. If the two directory levels are not already created, those also need to be created. As you can see, in this scheme we can end up seeking quite a bit for writes. For reads, if neither the directory nor the file inode are in the cache, we can seek up to 4 times before reading the data. Plus if the file is fragmented, it could be even more.


File Storage Gotchas Multiple seeks for slice writes

In the worst case, required creating 2 directories and 1 file 3 inode updates, 2 directory updates, 1 file update Performance is unpredictable – if the buffer cache is full can slow down a lot

Multiple seeks for slice reads Top-level directories are often cached

May require reading second level directory, directory inode, file inode and data

Poor Disk utilization (at least 4% wasted) Need to greatly over-provision inodes to prevent running out in any case (2%) All files have to round up to the nearest multiple of 4K (>= 2%)

Hard cap on max i-nodes after filesystem creation Deteriorating performance with large number of files per directory Kernel bottleneck handling 50- or 100-drive systems

18

Presenter

Presentation Notes

Both reads and writes require multiple seeks in file storage. In the worst case, this can be 4 or more seeks. In the write case, up to 3 inodes, 2 directories and the actual file need to be written. If the filesystem buffer cache is full, this can hurt performance. Reads on average can require reading the directory entry, the directory inode, the file inode and then the file. Because we cannot predict the user IO pattern, we have to over-provision inodes so that we never run out of them. In our implementation we set it at 2% of the disk space. Even with this number of inodes, it is possible to run out of them when you have extremely small slices (sizes in bytes). Writes also waste space because they get rounded up to the nearest multiple of 4K. The inode table cannot be re-initialized or grown after it is first created. So once you run out of inodes, you can’t write anymore to this disk. We’ve also seen that with large number of files in a directory, performance deteriorates. While filesystems work fine on a small to medium number of files, they really break down when this number gets in the billions, and it’s worse when a single kernel has to handle 50- or 100-drive systems.


HDD 101

Data is stored in sectors and tracks. Tracks are concentric circles Sectors are pie-shaped wedges on a track A sector contains a fixed number of bytes - 4096. Reading a random sector takes about 10 ms

5 ms – Position the head 4 ms – Wait until the data spins under it 1 ms – Read the data 10 ms 100 seeks / sec

Reading 1 byte or 1MB takes similar time

19

Presenter

Presentation Notes

So I want to provide a quick high level overview of how a hard drive works. Hard drives have sectors and tracks. Tracks are these concentric circles – they are divided into sectors. Sectors contain fixed number of bytes. While reading from a disk, 90% of the time is spent seeking (positioning the head, waiting for the platter to spin under the head). Reading 1 byte or 1 MB sequentially takes almost the same amount of time due to the seeking overhead.


Second Solution: Stuffing Slices Together

20

Presenter

Presentation Notes

In order to significantly reduce seeks, and also to cut down on overhead, our next approach was to use our own data structures on a filesystem. In practice this amounts to stuffing many thousands of slices together into a smaller number of files. As you can already see (in this picture), this is more organized and cleaner than the previous attempt at using the file system.


Packed Storage

Create a few large append-only files that each store tens of thousands of slices

Keep in memory pointers to find where to read a slice with a single seek

When bin files have too many holes, “compact” them

Header Bins

Bin offset

Bin length hole

Bin file (on disk) In-memory pointers

Has

h va

lue

Bin File

Offset/Length

21

Presenter

Presentation Notes

Packed storage creates a small number of large append-only files (which we call bin files). Each bin file stores tens to hundreds of thousands of slices. To decide which bin file to store a particular slice in, we hash the slice name to map it to a bin, just as a hash table maps keys to buckets. In memory pointers reference the file offset and the bin length. Therefore, a simple in memory table lookup tells us exactly where to seek to in order to read the slice.


Packed Storage Advantages

Higher IOPS compared with the existing ‘file per slice’: For writes – append operation to end of existing files (its fast!) For reads – only one seek is required to retrieve a slice For listing – specialized catalog to store all slice names

More space efficient (no i-nodes/4K blocks)

Divorced from the underlying file system / operating

system – runs in application space We can use different filesystems with more consistent performance

22

Presenter

Presentation Notes

Packed Storage overcame most of the problems of the file system. We’re able to get significantly higher iops for writes because writes are appended to the end of existing files. We no longer need to create new files, lock directories, seek to different positions as we update different file system structures, every time we write a new slice. Reads also benefit significantly, in the worst case they require only a single seek. Packed Storage also frees up about 4% of the disk space in recovered overhead from the file system. It is more space efficient because it no longer needs to be 4k block aligned nor reserve 2% of the disk for i-nodes. Finally, Packed Storage can really run on many different filesystems – it’s essentially only using the filesystem to create a fixed number of files and all user data gets appended to these files. It took a lot of tuning and testing to settle on EXT4 and with the proper number of i-nodes. Those things no longer matter to us.


1.0 1.5

1.6

3.5

16.2 20.2

20.1

Performance benefits of Packed Storage

Embedded Content Mode

Packed Storage

Original File Storage

10 Gbps Storage Node

Time (Years)

23

Presenter

Presentation Notes

So let's look at our performance over the years. Here we see the number of write IOPS a storage node could achieve, indexed to 1 for our performance in 2007. Note that that this chart is on a log scale. While we nearly doubled our storage node performance going from 1 Gbps NICs to 10 Gbps NICs, we realized a whopping 4.7X improvement in performance going from File Storage to Packed Storage.


File Storage vs Packed Storage

24

Presenter

Presentation Notes

Here’s another look at File Storage and Packed Storage IOPS for both reads and writes across a range of file sizes. On the smaller end of file sizes is 1 to 100 KB, these could be excel sheets, word documents, emails, and so on. In the middle, the range is 1 to 5 MB, here you would have audio files or pictures. And on the larger end, 10 MB and beyond would be the video files, genomic data, etc. For larger objects, both file and packed storages achieve similar performance. This is because seek costs of file storage are small compared to the amount of data being written. However on the small to medium file range, Packed Storage is definitely superior, with about a 3 – 4X improvement for reads and writes.


Packed Storage, Lingering Limitations.. Multiple active open files on each disk

Can cause disk head thrashing due to continuous seeking back and forth Does not work well on Shingled Magnetic Recording (SMR) drives

Slice file writes are sequential, but file system does not guarantee this Slice name catalog is not written sequentially

Does not achieve the full entitlement of the drives Drive errors obscured by the file system Not possible to write a fully asynchronous storage stack

Certain POSIX calls are “blocking” – mkdir, rename, rm Caching performed by OS

OS will cache all written blocks, whether we want it to or not

25

Presenter

Presentation Notes

While Packed Storage is a great improvement over file storage, there are still some shortcomings. Since we have about a hundred files per disk, there is a large number of open files actively being written to, this causes a lot of seeking. Packed storage also doesn’t really work well on SMR drives – even though most writes are sequential, not all are. SMR drives also only support a limited number of active “write pointers” . Additionally, the linux kernel isn’t able to keep 50 drives fully saturated for writes or reads when going through a file system. Another downside of the OS and the file system is that the OS and File System also hide a lot of the internal details from us, details which might help us to make more optimal storage decisions, such as when a flush completes, or when a drive fails to read a sector. Using a filesystem, we also cannot make a fully async storage stack because certain POSIX calls are blocking. Lastly, the OS will do things we don’t want it to do, such as hold recently read or written blocks in memory, even when the application knows certain things are not likely to be accessed again in the near future. An application-level cache can make more optimum caching decisions.


Third Solution: Dump the File System

26

Presenter

Presentation Notes

Due to the emergence of new drive technologies, we’ve finally decided that to build an optimal “object system” we were going to have to do so without a file system being anywhere in the storage stack.


Shingle Magnetic Recording (SMR) Drives

• All Hard Drive vendors have begun the transition to SMR drives • HDD vendors need to continue density growth to stay competitive

• The challenge for storage systems is to use SMR drives effectively • Sequential writes only • Normal usage can have a huge performance penalty

27

Presenter

Presentation Notes

Drive vendors like HGST (Hitachi Global Storage Technologies), Seagate, Western Digital, Toshiba etc. are all moving towards the SMR technology for hard drives. The challenge is to write only sequentially from the application for host-managed drives. The drive managed drives allow random writes but with a huge performance penalty. Host-managed drives have “zones”, or areas on the disk, that must be written to sequentially. According to a slide from HGST, the areal density growth for SMR drives is expected to be 25% per year vs. 15% per year on conventional drives, which means SMR drives will have 65% greater maximum capacity by 2020. In order to store data effectively on SMR drives, the storage systems need to be re-engineered to not have any non-sequential writes.


What is a Zone? A zone is an element of the SMR architecture, available for access in 'host managed' mode On SMR drives it is mapped directly to SMR zones Owns a range of drive's logical blocks Has a fixed size (typically 256MB) Only a limited number of zones can be writable at a given of time A writeable zone maintains a 4K aligned write pointer (WP) which should survive power

outages and unclean shutdowns Write operations target the WP of a writable zone and do not extend beyond that zone All operations must be 4K block-aligned

28

Presenter

Presentation Notes

So, what is a zone? Zone is an element of the SMR architecture, available for access in the “host managed” mode of the SMR drive. A zone is comprised of a range of drive’s logical blocks. It has a fixed size, typically 256MB, but this may change in the future. One property of SMR drives we have to consider is that only a limited number of zones can be writeable at a given point in time. We also have to maintain a 4K-aligned write pointer for each writeable zone, which is basically the append point for all writes in that zone.


Zone Storage

The Problem: No existing file systems work great on SMR; current approaches include:

Use drive managed drives (using a media cache) SMR_FS-EXT4 – new “beta” filesystem for host managed drives by Seagate but only good

for “archive only” workloads

We don’t need a file system, we need an “object system”

The Solution: Use direct block-level access to store directly to disk Always append, never overwrite Maintain single-seek-reads of Packed Storage Employ application-aware “smarter” read and write caches Implement a fully-asynchronous storage stack

Reduces the number of threads required to keep all disks saturated

Fully crash resilient design 29

Presenter

Presentation Notes

So what are we trying to solve here? The problem is that no existing filesystems work great with SMR drives – they are not zone-aware, they don’t have knowledge of write-pointers, and they don’t guarantee append-only writes. Current approaches for utilizing SMR drives are to either get a drive managed version of the drive or use a new “beta” filesystem which is made for “archive only” workloads� – we could wait until a filesystem becomes ready for SMR drives, but it still won’t do exactly what our application requires. Besides, we don’t really need a file system – we need an “object system”. The solution is to use block level access to the raw disk. Have append-only design, never overwrite Reads should still only require a single seek, just like in packed storage We should have custom, application aware read/write caches It would be ideal to have a fully async storage stack a fully crash resilient design ======= – the crash resilience is for consistency, but we may still lose recent writes. In our system, however, we would require a threshold of nodes to have simultaneous power loss to actually cause data loss of recent writes.


Building an Ideal “Object System”

30

Presenter

Presentation Notes

Let's talk about how to actually build this storage back end.


Things We Take for Granted

Mapping file names and offsets to disk blocks Crash recovery Fast listing--as it appeared in Packed Storage

Existing libraries (like LevelDB) won’t work for catalog Maintain usage statistics

Tools such as df/du to determine free space Debugging and consistency checking tools:

(ls, lsof, find, hexdump, cat, grep, fsck) Efficient caches

31

Presenter

Presentation Notes

One of the things that comes easy with a filesystem is actually storing data. Without a filesystem, we have to figure out the data format – i.e. how the slices written to different offsets will be stored and mapped on a raw disk. The storage mechanism must be able to recover consistency after a crash or a power outage, and do so quickly. We also need to continue to support fast, ordered listing of slice names as in Packed Storage. On packed storage we could use libraries like LevelDB to implement a fast name index, but this won’t work without a file system. We also are now completely responsible for tracking our own usage statistics, much like a filesystem does, and also implement tools for viewing the usage. File systems also come with a lot of tools for debugging and introspection, which we’ll have to implement on our own. Lastly, we also need efficient caching for good performance.


Division of “Zone” Labor We have defined four uses of Zones in our design:

The “Bootstrap Zone”

Like a file system’s partition table Contains disk info, zone states and logs

Bin zones Like a file system’s allocated blocks, holds the actual content Stores the slice data within our defined data structures, like Packed Storage’s bin files

Journal zones Like a file system’s journal, is used to replay events after a crash Stores structures that must be be loaded into RAM after a restart

Catalog zones Like a file system’s directories, keeps a list of what is there for efficient ordered listing Holds slice names and sizes for every slice on the drive

32

Presenter

Presentation Notes

In our design, we use zones for one of 4 possible purposes. The Bootstrap zone is the first zone on the disk – it persists permanent disk information, states of all zones on the disk and a journal allocation log. It is used to load all the zone state information and load the journal data in memory at startup. This is the only zone that has a fixed position. All other zones are in the dynamic pool of all zones and are allocated as they’re needed. The Bin zones are where the actual slice data resides. This takes up most of the disk. Journal zones are for the journal, which stores pointers, usage information and arbitrary system metadata, in log format to support crash recovery. Catalog zones contain the slice name index for listing


Bin Zones

• Similar to Packed Storage, where each zone is a bin file, but: • Zones are smaller (256MB rather than tens of GB) • We only write to one zone at a time • We compact zones to the zone with the current write pointer

• Compaction

1. Zones sorted by highest compressible space 2. Content is copied from zone with most free space to active zone 3. Old zone is freed and can be reused for any purpose

33

Presenter

Presentation Notes

The bin zones contain the actual data slices received from the client. The bin zones are similar to the Packed Storage Bin files, with a few differences. Zones are much smaller – just 256 MB – than the packed storage bin files, which could be 30 GB or bigger. We only write to one bin zone at a time so as to not require additional seeking for writes. Compaction occurs by reading from old zones and writing to the currently active bin zone. During compaction, basically we find the bin zones with the most free space, and copy the content from them onto the current zone. When everything has been copied out of the zone, it is put back into circulation and can be reused, perhaps as another bin zone, or maybe for a journal or catalog zone.


Journal Zones

Many components need to persist internal system data across restarts. The journal provides: A common persistence mechanism for all metadata

Stored as a series of appended journal entries Checksummed against invalid information

Fault tolerance (unclean shutdowns) At startup the entire journal is replayed to load

appropriate and necessary data into memory

34

Presenter

Presentation Notes

Many components within our application need to persist system data across restarts. The journal provides a common mechanism to do so. Each journal entry is checksummed against invalid information. It’s designed to be fault tolerant such that we can quickly return to a consistent state after a crash. At startup the entire journal is replayed to load data into memory. The journal is currently used for storing bin pointers and bin usage as well as system configuration information.


Slice Catalog

Problem We need an efficient mechanism to support listing slices without reading all bin zones, with append-only support

Solution Modify a lightweight key-value store

Make it write to zones rather than files

Index all stored slices

35

Presenter

Presentation Notes

In order to satisfy our fast listing requirement, we need to store a name index, which would just be an ordered set of slice names and slice sizes. LevelDB accomplishes this task very efficiently, but it only works on top of a filesystem. We have modified this library so it can write directly to one or more of a raw disk’s zones.


Replacing the FS caches

Read cache Intelligent cache Frequency based eviction policy

Write cache Batches writes to reduce disk operations Persist based on memory pressure or deadline In-memory pointers updated after flush

36

Presenter

Presentation Notes

We must also implement our own caching for slices. For the read cache, we are using a 3rd party library called Caffeine, which provides a high performance read cache. It is based on TinyLFU, which uses a frequency based eviction policy. The write cache batches writes together to reduce disk operations. This means we can now write 10s or 100s of small objects without touching the disk and they’ll all get written in a single sequential-write disk operation, without ever seeking. Persistence is based on deadline and memory pressure, which can be configured on a per-disk basis. Pointers are only updated after the flush to disk completes.


Zone Storage Advantages

Single active write pointer (per disk) for 99% of writes Large sequential writes to disk Remaining 1% are still sequential but require seeking

Works well on all disk types (not just SMRs!) 100% append-only writes Random reads with single seek

Application-aware caches Writes are batched, close to 0 disk ops per write Smarter read cache

Gets us closer to the full entitlement of the drives Clean handling of error and unclean shutdown scenarios Fully asynchronous storage stack

Storage application only needs a few threads

37

Presenter

Presentation Notes

To summarize the advantages of this zone storage implementation, a single write pointer for most of the writes will ensure sequential writes without seeking for almost anything, which is better for all kinds of HDDs. Application aware caches are better than generic filesystem caches. We can get much closer to the full hard drive entitlement on a 50 or 100 drive storage node which is hard to do with so many filesystems running over a single kernel. When we control the entire stack without the filesystem, all operations can be fully async, all the way down to the disk layer, which allows us to be more efficient and use fewer threads in the system.


Zone Storage Disadvantages

It is a complex, low-level implementation Difficult, prone to error, long to implement

Required modifications to our Storage API to make good use of memory in the storage node

38

Presenter

Presentation Notes

Some of the obvious disadvantages of replacing a filesystem is that you have to implement your own low level implementation. This is difficult, error prone and takes time. For example, we had to invent our own compaction algorithms. Additionally, we had to make some changes to our storage node protocol that our access layer uses in order to make efficient use of memory. But if you have the resources to throw at this problem, we really believe this is the solution going forward for achieving superior performance on SMR drives.


1.0 1.5

1.6

3.5

16.2 20.2

20.1

94.7

Zone Storage Write Performance

Embedded Content Mode

Packed Storage

Zone Storage

Original File Storage

10 Gbps Storage Node

Time (Years) 39

Presenter

Presentation Notes

We are projecting to achieve full entitlement of the disk drive and realize significant IOPS gains when we move to zone storage. There are a lot of factors that play into this. All told, we should realize nearly a 100-fold improvement in our storage IOPS, primarily through improved software. These include but are not limited to: - almost 0 disk seeks on average for all disk writes – because writes are batched and we keep a single write pointer - we have application-aware caches - but most importantly we’ll be removing the kernel bottleneck of supporting filesytems over a large number of disks


Conclusions

Why we’re cutting the cord Existing filesystems don’t work well on SMRs (especially host-managed) Linux kernel can’t get full HDD performance entitlement on 50- or 100-

drive storage nodes with filesystems Filesystems are not designed for scalable object storage storage nodes No need for a directory structure for a non-hierarchical (flat) namespace It’s possible to make an efficient “object storage” solution on a raw disk

Solution is specific media that prefers sequential writes

If the storage media economy changes, we will likely have to make radical changes to our software stack When NVRAM, 3DXPoint or something else replaces HDDs in the future

40

Presenter

Presentation Notes

So what have we learned? We can skip the filesystem for object storage, Existing filesystems don’t really work for SMR, especially the host-managed ones as they require sequential writes. We can’t get full HDD performance entitlement on 50 or 100 drive storage nodes with the use of any known filesystem. Filesystems aren’t built for large scale object storage nodes. We don’t need a directory structure on a flat namespace. And we can make an efficient “object storage” solution over the raw disk. But to recap, the zone storage solution is specific to hard drives or media that prefers sequential writes. This could change in the future, however, if the storage media economy changes. New non-volatile memory companies are coming out with high density technologies all the time and if the economy changes or these new technologies become cheaper than hard drives, we’ll be making radical changes to our software stack.


Questions

My head hurts

41

Presenter

Presentation Notes

At this point I’ll open up for any questions from the audience.

Documents

Cutting the cord: Why we took the file system out of our ... · IBM Cloud Object storage, based on technology pioneered by Cleversafe, is extremely scalable with no central service