16
PrismDB: Read-aware Log-structured Merge Trees for Heterogeneous Storage Ashwini Raina Princeton University Asaf Cidon Columbia University Kyle Jamieson Princeton University Michael J. Freedman Princeton University Abstract In recent years, emerging hardware storage technologies have focused on divergent goals: better performance or lower cost- per-bit of storage. Correspondingly, data systems that employ these new technologies are optimized either to be fast (but ex- pensive) or cheap (but slow). We take a different approach: by combining multiple tiers of fast and low-cost storage technolo- gies within the same system, we can achieve a Pareto-efficient balance between performance and cost-per-bit. This paper presents the design and implementation of PrismDB, a novel log-structured merge tree based key-value store that exploits a full spectrum of heterogeneous storage technologies (from 3D XPoint to QLC NAND). We intro- duce the notion of “read-awareness” to log-structured merge trees, which allows hot objects to be pinned to faster storage, achieving better tiering and hot-cold separation of objects. Compared to the standard use of RocksDB on flash in datacen- ters today, PrismDB’s average throughput on heterogeneous storage is 2.3× faster and its tail latency is more than an order of magnitude better, using hardware than is half the cost. 1 Introduction Several new NVMe storage technologies have recently emerged, expressing the competing goals of improving per- formance and reducing storage costs. On one side, high per- formance non-volatile memory (NVM) technologies, such as Optane SSD [4, 63] and Z-NAND [8], provide single-digit μs latencies. On the other end of the spectrum, cheap and dense storage such as QLC NAND [41, 43] enables applications to store vast amounts of data on flash at a low cost-per-bit. Yet with this lower cost, QLC has a higher latency and is significantly less reliable than less dense flash technology. Table 1 compares the wide range of cost and performance across three representative storage technologies, showing the tradeoffs between their reliability (P/E cycle lifetime), nor- malized cost, and read and write latency. In short, each of these storage technologies introduces its own set of benefits, trade-offs, and limitations related to per- NVM TLC QLC Lifetime (P/E cycles) 18,000 540 200 Cost ($/GB) $1.3 $0.4 $0.1 Avg Read Latency (4KB) 26μs 195μs 391μs Avg Write Latency (64MB) 121μs 216μs 456μs Table 1: Comparing lifetime, cost, and latency between Op- tane SSD (NVM), and two generations of flash (TLC and QLC NAND, which have three and four bits per cell, re- spectively). The cost is based on quotes from July 25, 2019 on amazon.com (Intel’s Optane SSD 900P, 750 Series and 760P), and the lifetime is based on publicly available infor- mation [21, 30, 61]. Latency is computed using Fio [3]. formance, cost, and endurance. In Table 1, for example, we observe that there is a roughly 15× performance difference between Optane SSD (NVM) and QLC on random reads, and sequential reads show a similar trend (not shown). Yet this performance comes at steep cost: Optane SSD costs nearly 10× per GB compared to QLC. Endurance also varies widely: dense flash technologies such as QLC NAND can sustain a relatively small number of writes before exhibiting errors [42]. It is hard for system architects to reason about these unique performance, cost, and endurance characteristics when developing software data systems, and many studies have shown that simply running existing software systems on new hardware storage technologies often leads to poor re- sults [18, 23, 24, 37]. Therefore, significant recent effort has sought to build new databases, file systems, and other soft- ware storage systems that are architected specifically for these new technologies [18, 23, 24, 37, 52, 55]. However, we argue that these new architectures do not go far enough in rethinking the use of emerging hardware technology, as they continue to take the legacy perspective of the storage substrate as a homogeneous and monolithic layer. They choose one point in the design space: fast but expensive [18, 24, 37] (e.g., using Intel SSD or Z-NAND), or cheap but slower [23, 52, 55] (e.g., using dense flash). 1 arXiv:2008.02352v2 [cs.DB] 24 Sep 2020

arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

PrismDB: Read-aware Log-structured Merge Trees for Heterogeneous Storage

Ashwini RainaPrinceton University

Asaf CidonColumbia University

Kyle JamiesonPrinceton University

Michael J. FreedmanPrinceton University

AbstractIn recent years, emerging hardware storage technologies havefocused on divergent goals: better performance or lower cost-per-bit of storage. Correspondingly, data systems that employthese new technologies are optimized either to be fast (but ex-pensive) or cheap (but slow). We take a different approach: bycombining multiple tiers of fast and low-cost storage technolo-gies within the same system, we can achieve a Pareto-efficientbalance between performance and cost-per-bit.

This paper presents the design and implementation ofPrismDB, a novel log-structured merge tree based key-valuestore that exploits a full spectrum of heterogeneous storagetechnologies (from 3D XPoint to QLC NAND). We intro-duce the notion of “read-awareness” to log-structured mergetrees, which allows hot objects to be pinned to faster storage,achieving better tiering and hot-cold separation of objects.Compared to the standard use of RocksDB on flash in datacen-ters today, PrismDB’s average throughput on heterogeneousstorage is 2.3× faster and its tail latency is more than an orderof magnitude better, using hardware than is half the cost.

1 Introduction

Several new NVMe storage technologies have recentlyemerged, expressing the competing goals of improving per-formance and reducing storage costs. On one side, high per-formance non-volatile memory (NVM) technologies, such asOptane SSD [4, 63] and Z-NAND [8], provide single-digit µslatencies. On the other end of the spectrum, cheap and densestorage such as QLC NAND [41, 43] enables applicationsto store vast amounts of data on flash at a low cost-per-bit.Yet with this lower cost, QLC has a higher latency and issignificantly less reliable than less dense flash technology.

Table 1 compares the wide range of cost and performanceacross three representative storage technologies, showing thetradeoffs between their reliability (P/E cycle lifetime), nor-malized cost, and read and write latency.

In short, each of these storage technologies introduces itsown set of benefits, trade-offs, and limitations related to per-

NVM TLC QLC

Lifetime (P/E cycles) 18,000 540 200Cost ($/GB) $1.3 $0.4 $0.1Avg Read Latency (4KB) 26µs 195µs 391µsAvg Write Latency (64MB) 121µs 216µs 456µs

Table 1: Comparing lifetime, cost, and latency between Op-tane SSD (NVM), and two generations of flash (TLC andQLC NAND, which have three and four bits per cell, re-spectively). The cost is based on quotes from July 25, 2019on amazon.com (Intel’s Optane SSD 900P, 750 Series and760P), and the lifetime is based on publicly available infor-mation [21, 30, 61]. Latency is computed using Fio [3].

formance, cost, and endurance. In Table 1, for example, weobserve that there is a roughly 15× performance differencebetween Optane SSD (NVM) and QLC on random reads, andsequential reads show a similar trend (not shown). Yet thisperformance comes at steep cost: Optane SSD costs nearly10× per GB compared to QLC. Endurance also varies widely:dense flash technologies such as QLC NAND can sustain arelatively small number of writes before exhibiting errors [42].

It is hard for system architects to reason about theseunique performance, cost, and endurance characteristics whendeveloping software data systems, and many studies haveshown that simply running existing software systems onnew hardware storage technologies often leads to poor re-sults [18, 23, 24, 37]. Therefore, significant recent effort hassought to build new databases, file systems, and other soft-ware storage systems that are architected specifically for thesenew technologies [18, 23, 24, 37, 52, 55].

However, we argue that these new architectures do notgo far enough in rethinking the use of emerging hardwaretechnology, as they continue to take the legacy perspectiveof the storage substrate as a homogeneous and monolithiclayer. They choose one point in the design space: fast butexpensive [18, 24, 37] (e.g., using Intel SSD or Z-NAND), orcheap but slower [23, 52, 55] (e.g., using dense flash).

1

arX

iv:2

008.

0235

2v2

[cs

.DB

] 2

4 Se

p 20

20

Page 2: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

Conversely, this paper explores how a key-value store cansimultaneously leverage multiple new storage technologiesto realize more optimal trade-offs between performance, en-durance, and cost. In particular, we investigate combiningheterogeneous storage technologies within a Log-StructuredMerge Tree [48] (LSM), a widely-used data structure that pow-ers many modern flash-based databases and key-value stores,including Google’s BigTable [16] and LevelDB, Apache Cas-sandra [36], Facebook’s RocksDB [22] and MySQL storagebackend [40], and MongoDB [44].

At a high-level, LSM trees maintain high write rates bybuffering multiple updates in memory, then sorting the up-dates’ keys before writing the new keys and values as a blockto disk, into the first level of the LSM tree. Multiple blocksfrom one level are then merged into lower levels, ensuring thatblocks at each lower level are sorted and disjoint. As thesesorted blocks can be written as large sequential writes, theyare thought to be a good fit for flash-based storage, whichrequires large contiguous writes for maintaining performanceand endurance.

Existing LSM tree-based databases assume all levels arestored on a homogeneous storage medium. However, the ac-cess patterns to these levels are not homogeneous. Objectsstored in the higher levels of the LSM are read and updatedmuch more often than objects stored at the bottom layers.Therefore, the upper levels of the LSM tree (e.g., L0, L1, L2)would benefit from using a high performance, high endurance(and more expensive) storage medium such as Optane SSD.On the other hand, the lower levels (e.g., L3, L4, or the lastcouple of layers of the LSM tree), which store 90% or moreof the data, can use a much cheaper form of storage such asQLC NAND. In addition, since they are updated much lessfrequently, these lower levels can meet the lower endurancerequirements of cheaper flash storage (i.e., fewer P/E cycles).To summarize, this heterogeneous LSM tree design wouldallow an LSM tree to enjoy the performance benefit of usingfast storage (e.g., Optane) to speed up frequently read objectsfrom high levels, while maintaining a low cost-per-bit, sinceover 90% of data would be stored in the lower, less frequentlyaccessed levels on cheaper storage (e.g., QLC).

Yet, we have found that this observation cannot be naivelyapplied to LSM trees. We show that a straightforward hete-rogeneous implementation, where different LSM levels aremapped to different storage technologies, performs onlymarginally better than an LSM tree fully mapped on theslowest storage. We make the observation that LSM treesby default are “write-aware”, i.e., the key layout in the treeis dictated by the order of the writes. This fundamentally re-stricts the ability of LSM implementations to fully exploit theperformance benefits of heterogeneous storage.

In this paper, we introduce the notion of “read-awareness”to LSM trees. In read-aware LSM trees, the key layout withinthe tree is influenced by both the write order as well as theread order of keys. Existing LSM implementations always

compact all the keys down from upper to lower levels. We pro-pose a new compaction algorithm called pinned compaction,in which keys that are read more often are retained in thesame level. Unlike traditional compaction, our compaction al-gorithm also allows for keys to rise up the tree levels towardsfaster storage, while maintaining consistency.

We present the design and implementation of PrismDB,a key-value store built on top of RocksDB that implementsread-awareness via pinned compactions. In order to add readawareness to the LSM tree, PrismDB needs to decide whichobjects to pin during compaction time. To this end, it uses alightweight object popularity mechanism based on the clockalgorithm. In order to convert the clock value to a pinning pol-icy, PrismDB dynamically estimates the distribution of clockvalues across different keys using a light-weight concurrentdata structure, and uses that distribution to determine whichobjects to pin during compaction.

PrismDB’s pinned compaction also results in reducing thetotal number of compactions it performs compared to tra-ditional LSMs, leading to lower write amplification whichalso translates to higher performance. This also better pre-serves the limited write endurance of its lowest-layer stor-age (the limited P/E cycles supported by QLC). By storingmore frequently-accessed keys in higher levels of faster stor-age media, PrismDB results in both lower read latency andhigher throughput, even while both fully exploit in-memorycaching for common LSM objects (e.g., via Memtables, OSpage caches, and block caches; see Fig. 1).

Our evaluation shows PrismDB significantly improves thethroughput and tail latency of point-query workloads (i.e.,for specific keys). We also highlight cases where PrismDBprovides no benefit: read-only workloads that do not triggercompactions and highly-skewed workloads that can be cachedentirely in DRAM. In addition, PrismDB provides a smallerbenefit for scan-heavy workloads.

Our paper makes the following contributions:

1. We present the first holistic evaluation of LSM trees onheterogeneous storage that takes cost, performance, aswell as endurance into account.

2. We identify limitations of deploying existing LSM treebased key-value stores on heterogeneous storage.

3. We propose a new LSM tree variant—the read-awareLSM—and a new compaction algorithm that unlocks thefull benefit of heterogeneous storage.

4. We compare our system PrismDB to Mutant, a prior key-value store for heterogeneous storage, and RocksDB run-ning on heterogeneous storage, and show that PrismDBachieves up to 5.3× and 3.6× higher throughput (respec-tively), reduces read p99 latency by 33× and 27×, andreduces update p99 latency by 29× and 25×.

2

Page 3: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

Figure 1: Elements of a Log Structured Merge Tree.

2 Background

We provide a brief background on new storage technologies,as well as on log-structured merge (LSM) trees.

2.1 Trends in StorageIn recent years, NVMe storage devices have evolved in two or-thogonal directions: faster (and more expensive) non-volatilememory and cheaper (and slower) dense flash. New fast andpersistent memory technologies, such as 3D XPoint [4, 43]and Z-NAND [8], which we refer to collectively through-out the paper as Non-Volatile Memory (NVM), provide lowrandom read and write latencies of 10s µs or less.

On the other end of the spectrum, flash technology hasbecome ever more dense and cheap. Flash manufacturershave been able to pack more bits in each device, both bystacking cells vertically (3D flash), and by packing morebits per memory cell. However, making devices denser alsocauses their latency to increase and makes them less reli-able [28,32,43,47,52,61]. The latest QLC technology, whichpacks 4 bits per memory cell, can only tolerate 100–200 writecycles before it becomes unusable [47,53,61]. For this reason,the main current use case for QLC is for applications thatissue a small number of writes [52, 54]. Future dense flashtechnologies, such as the recently announced PLC (5 bits percell), will only exacerbate this trade off [32].

2.2 Log-Structured Merge (LSM) TreesIn LSMs (Figure 1), data is written first into DRAM, where itis stored in Memtables, which are in-memory data structuresbased on Skiplists [7]. Once a Memtable becomes full, itsdata is written into a Sorted String Table (SST). An SST isa file on disk that contains sorted variable-sized key-valuepairs that are partitioned into a sequence of data blocks. Inaddition to data blocks, the SST stores meta blocks, whichcontain Bloom filters and the metadata of the file (e.g., datasize, index size, number of entries).

SST files are stored in flash in multiple levels, called L0,L1, and so forth. L0 contains files that were recently flushedfrom the Memtable. After L0, each level is comprised of SSTsthat are disjoint (in keyspace) from other SSTs on the same

level, and the LSM maintains a sort order over each level’sSSTs. To find a key, a binary search is first done over the startkey of all SST files to identify which file contains the key.Only then, an additional binary search is done inside the file(using its index block) to locate the exact position of the key.

Each level also has a target size that specifies the volumeof data that should be stored in the level, typically with anexponentially increasing capacity (e.g., in Figure 1, the levelsize multiplier is 10). Once a level reaches its target size, acompaction is triggered. During compaction, at least one SSTfile is picked and merged with its overlapping key range inthe next level.

The motivation for the original LSM Tree [48] was to de-sign an indexing data structure for higher write throughput,targeting storage devices that require large contiguous writesto exhibit good performance (initially magnetic disks to avoidseeks and later flash for large writes to avoid write amplifica-tion). An LSM tree inherently trades off read performance formore efficient writes. Due to its relatively efficient handlingof writes, even with the shift in workloads becoming moreread-heavy—which has led to the introduction of some auxil-iary data structures such as SST fence pointers, bloom filters,and read caches to improve read performance [5, 7, 22]—thecore structure of the LSM tree has stayed largely unchangedover the past 25 years.

In short, the order of write operations dictates the organiza-tion of keys at the tree levels, and so we refer to LSM trees aswrite aware. In particular, read operations do not change thekey layout in the LSM tree, i.e., an LSM tree does not changethe placement of particular objects based on the frequencythat they are read. As we shall see, however, this distinctionbecomes crucial for achieving better read performance whenemploying LSM over heterogeneous storage.

Read Optimizations in LSM Trees. As mentioned, LSMtrees have introduced several auxiliary data structures to im-prove read performance. Recall that a key can be located inany level in the LSM tree (and in fact, in multiple levels si-multaneously in the case of updates or deletions). To avoidreading a potential on-disk SSTable per level, LSM tree imple-mentations such as RocksDB [7] employ in-memory Bloomfilters [1] per SST to check whether the key possibly exists inthe SST before reading it from disk.

LSM trees additionally cache SST blocks in memory foraccelerating reads. The block cache caches entire SST fileblocks using the least-recently-used (LRU) algorithm. In ad-dition, the LSM trees may rely on the operating system’s pagecache, which also caches at the block level. These block andpage caches operate at the block level because storage is typi-cally block addressable and reads happen at a page granularity(commonly 4 KB). Since key-value objects can be as smallas tens of bytes [15], caching an individual object at blockgranularity would mean wasting valuable cache space and re-duced I/O utilization. This mismatch between the granularity

3

Page 4: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

0

50

100

150

tput

(K o

ps/s

ec) optane

tlcqlchet

(a) (b)

Figure 2: RocksDB throughput utilizing homogeneous andheterogeneous storage (Fig. 2a). Throughput is measured un-der YCSB-B workload: 95% read-5% update, 200 million keydataset, 24 clients, 2 billion requests. Heterogenous storageconfiguration (Fig 2b) uses Optane for L0-L2 (2.4GB), TLCfor L3 (20GB), and QLC for L4 (200GB).

L0 L1 L2 L3 L40

200

400

600

#writ

es p

er b

yte

of st

orag

e

(a)

mt+bc L0 L1 L2 L3 L40

10

20

30

40

perc

ent

(b)

Figure 3: Distribution of writes and reads across LSM levels(L0-L4), memtable (mt) and block cache (bc) under YCSB-Bworkload: 95% read-5% update, 200 million key dataset, 24clients, 2 billion requests.

of caching (4-16 KB blocks) and object sizes (10s-100s ofbytes) affects cache performance in LSMs, which we analyzein §3.3.

3 LSM Performance on Emerging Storage

This section presents an evaluation of LSM tree performanceon both homogeneous and heterogeneous storage, and high-light the shortcomings of existing LSM tree design whenattempting to utilize heterogeneous storage. Throughout thepaper, we use RocksDB, an open-source key-value store builtby Facebook [15], as our baseline LSM tree implementation.

3.1 Homogeneous Storage

We first evaluate RocksDB on a homogeneous, single-disksetup using the YCSB benchmark. We compare its perfor-mance on three homogeneous configurations: Optane, TLC,and QLC. Figure 2a shows the throughput and read latency ofrunning RocksDB on these varying technologies. RocksDBrunning on Optane has roughly 3.5× and 5.9× higher through-put compared to TLC and QLC, respectively.

We make the observation that while an LSM tree is a single

Figure 4: Simulation of trade-off between average read la-tency and storage cost when storing data in an LSM tree, witha minimum storage lifetime constraint of 3 years. The percent-age of reads and writes to each level is based on RocksDBproduction data [22] with a total database size of 223 GB.The hardware configuration is represented by a five-tuple, cor-responding to each level of the LSM tree, where N is OptaneSSD (NVM), T is TLC, and Q is QLC. Blue represents ho-mogeneous configurations, and the red configuration is usedas the default one in this paper.

logical data structure, each one of its levels has very differ-ent performance and endurance requirements, as depicted inFigure 14. Objects stored in the upper levels of the LSM areread and updated more often than objects stored at the bottomlayers. Therefore, the higher levels would benefit from usinga high performance and high endurance (and more expensive)storage medium such as NVM. The lower levels on the otherhand, which store 90% or more of the data, can use a muchcheaper form of storage such as QLC NAND. In addition,since they are updated much less frequently, they may meetthe endurance requirements of cheaper flash storage. On theother hand, QLC has much slower read latency, so query la-tency can suffer if a high fraction of reads are served from L4(as in Figure 3b).

3.2 Heterogeneous StorageWe next examine how well the LSM tree performs on a hete-rogeneous setup. The LSM tree running on heterogeneousstorage is referred to as LSM-het for the remainder of thepaper. We use a 5-level LSM tree, where each level can bemapped to a different storage tier. In Figure 4, we simulate thecost vs. performance trade off of different configurations, bysimulating the read latency and cost of each level, when theyare placed on different storage technologies. Note that theread latency is computed by taking the average read latency ofeach storage device measured by Fio, and does not considertail latencies or any database software overhead (e.g., due tolock contention or bloom filter false positives). The simulator

4

Page 5: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

(a) (b)

Figure 5: Heat map showing the percentage of reads served bydifferent levels of the 100,000 most popular keys for RocksDB(Fig. 5a) and PrismDB (Fig. 5b).

is intended to provide a qualitative comparison of the tradeoff between different heterogeneous configurations.

In order to simulate endurance constraints, we assume thestorage devices need to last for at least 3 years, which is atypical lifetime for storage devices. Therefore, if a particularstorage technology has been written to too much to last 3years, we add additional spare storage capacity to that tech-nology until we reach enough storage to achieve the 3 yearlimit. This method follows the same principle used by enter-prise flash devices, which are provisioned with spare capacityto achieve higher endurance for write-heavy workloads.

The ideal points on the Pareto frontier of the curve, asexpected, are those where upper levels use the same or afaster storage technology than the lower levels. In the paper,we examine one of the points on the Pareto frontier, NNNT Q,(depicted in Figure 2b), which maps LSM levels L0-L2 toNVM, L3 to TLC, and L4 to QLC, which contains the keyspace of the entire database.

We modified RocksDB to map different levels onto differ-ent storage types, and again evaluate its performance with theYCSB [19] benchmark. We observed that even though LSM-het has faster storage components, it performs only marginallybetter than an LSM that is fully mapped to the slowest storageQLC (Figure 2a), and it has about 5.8× worse throughputcompared to the LSM running fully on Optane. In short: aheterogeneous RocksDB configuration pays the extra costof faster storage but does not achieve any significant perfor-mance boost.

Why the lack of performance improvement? Figure 5ashows a significant number of read queries for top-100K pop-ular keys are served from either L3 and L4 levels, whichare mapped on slower storage. This completely diminishesthe impact that the faster storage tier would have on read la-tency. Optimizing LSM trees for heterogeneous storage canbe thought of a multi-tiered storage problem, where the firsttier is DRAM, the second is NVM, and the remaining areone or more types of flash devices (or even HDDs). Yet wecannot rely on LSM’s traditional design to organize the treeinto an efficient multi-tiered storage system, since it does not

try to keep popular objects in higher levels of the LSM tree.In §4, we show how we can transform the existing LSM datastructure to account for the read performance of different stor-age technologies using our new pinned compaction technique.Figure 5b shows how pinned compactions are effective atkeeping popular keys in upper LSM levels, and serve moreread queries from faster storage.

3.3 Caching EfficiencySingle and multi-disk experiments show that it is critical toconsider LSM level dynamics while deploying LSM treesover heterogeneous storage. We now turn to another crucialaspect: the caching dynamics of LSM tree systems.

As discussed in §2.2, LSM trees cache at a block granularity(4 KB or more) to maximize I/O utilization. However, typicalkey-value pairs in production workloads are on the order oftens to hundreds of bytes [15,22,45]. Since the layout of keyswithin blocks is only dependent on writes, SST blocks containobjects with different read popularity. Figure 5a shows thatthe popular keys are scattered across different levels of theLSM and SST blocks.

Therefore, LSM tree block-level caching turns out to beless effective, as a significant percentage of the objects in theblock cache may not be popular. This general limitation ofLSM trees affects both the homogeneous and heterogeneousstorage setups and is a key aspect of our design.

4 PrismDB Design

In this section, we introduce PrismDB, a read-aware LSM tree.We first discuss design challenges, and then detail PrismDB’smain design components. In addition, we discuss how todeploy PrismDB in practice.

PrismDB is comprised of three main components (Fig-ure 6). The tracker is responsible for tracking which objectsare popular. The mapper keeps track of the distribution of ob-ject popularity across the entire LSM tree and translates thatdistribution to an actionable algorithm for selecting whichobjects to pin. Then, the placer ensures that popular objectsremain on higher levels of the LSM tree, where they will bestored on faster storage when using heterogeneous storage.

4.1 Tracker: Lightweight Tracking of KeysThe first component in PrismDB, the tracker, is responsible fortracking which objects are frequently read at a low overhead.There is a large body of work on how to track and estimateobject popularity [13, 51, 55, 58, 59]. However, many of theexisting mechanisms require a relatively large amount of dataper object, in order to track various access statistics (numberof prior accesses, frequency, relationships with other objects,etc.) [13, 51]. Given that key-value objects are often small(e.g., less than 1 KB [15,17,45]), we need to limit the amount

5

Page 6: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

Figure 6: PrismDB system diagram.

of metadata we use for tracking purposes per object, yet main-tain an accurate prediction on how “hot” the object is, or howlikely it is to be read or updated in the near future. In addition,LSM tree implementations support a high number of con-current write and read operations to the database [5, 7]. Thisrequires a high performing popularity tracking mechanismthat can track millions of small objects at high throughput.

Clock [20] is a well-known classical approach that approx-imates least recently used (LRU) while offering better spaceefficiency and concurrency [23, 25]. Using a single clock bitcaptures recency, while employing multiple clock bits can beused to also track frequency. This makes clock an attractiveoption for estimating the relative popularity of different keys,in order to group more popular keys into higher LSM levels.

PrismDB’s tracker implements the multi-bit clock algo-rithm for lightweight object tracking across all levels of theLSM. The tracker uses a concurrent hash map that maps keysto their clock bits. Each LSM read requires the tracker toupdate the clock bits of the object that was read. Since settingthe clock bits is on the critical path of reads, the set operationneeds to be lightweight. This leads to several performanceoptimizations in the tracker.

First, in order to save space, the tracker does not store clockbits of all key-value pairs in the system, only the most recentlyread ones. Second, the tracker is optimized for concurrent keyinsertions and evictions. Traditional clock implementationsuse a doubly linked list or a circular buffer that contains theclock bits along with a hash map that contains the mappingfrom key to clock bits, which requires that insertion and evic-tion operations be serialized. In contrast, PrismDB’s trackerdoes not keep a separate buffer from the hash table. There-fore, it does not require any extra synchronization for thoseoperations, and relies on the synchronization provided by theconcurrent hash map. Third, the tracker conducts evictionoffline, in a background process, so that when a read causes anew key to be inserted into the tracker and exceed its size, thatdoesn’t trigger eviction of an older key in the critical path.

YCSB-A YCSB-B YCSB-D YCSB-F0

20

40

60

80

100

perc

ent

clk-0clk-1

clk-2clk-3

Figure 7: Clock value distributions under different workloads.

4.2 Mapper: Enforcing the Pinning ThresholdIdeally, we could set a “pinning” threshold on the popularkeys. That is, at each pass of the compactor, PrismDB shouldpin some percent (e.g., 10%) of the most popular objects thatare being tracked by the tracker to the same level.

However, the actual relative popularity of the object de-pends on the clock bit distribution. For example, if PrismDBwants to enforce a pinning threshold of 10%, and exactly10% of keys have all their clock bits set to 3 (using a two bitclock, with 3 being most popular and 0 being least popular),then PrismDB should pin all the items with a clock valueof 3. However, if 50% of the keys have a clock value of 3,then PrismDB should not pin all items with the clock value3, otherwise it will significantly exceed the desired pinningthreshold. To illustrate, Figure 7 shows that the clock valuedistributions change as a function of the workload.

To this end, the mapper is responsible for keeping trackof the clock value distribution, and uses that distribution toenforce the pinning threshold. In order to maintain the clockvalue distributions, the mapper maintains the aggregate num-ber of keys that have a particular clock value in an array. Thearray gets updated during insertion and eviction. During evic-tion, the tracker keeps count of the clock values of all theitems it decremented and the item it evicted and updates themapper with the new distribution. It also updates the mapperwhen a new item is inserted into the hash map.

Pinning Threshold Algorithm. In order to enforce the pin-ning threshold, the mapper uses the following algorithm,which is best illustrated with an example. Suppose the clockdistribution is similar to the YCSB-B workload depicted inFigure 7, where the percentage of keys with a clock value of3 is about 10%, the percentage of those with a value of 2 is10%, the ones with a value of 3 is 30%, and the remaining50% have a clock value of 0. Suppose the desired pinningthreshold is 15%. If the placer encounters an item with a clockvalue of 3, it will always pin it. If it encounters an item with aclock value of 2, it will flip a coin whether to keep it or not(in this example, with weight 0.5). If it encounters either anitem with clock value of 1 or 0, or an item that is not currentlybeing tracked (recall the tracker does not track all items in the

6

Page 7: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

database), that item will be compacted down.To summarize, the mapper satisfies the pinning threshold

using the highest-ranked clock items by descending rank, andif need be, randomly samples objects that belong to the lowestclock value that is needed to satisfy the threshold.

4.3 Placer: Pinning Keys to LevelsThe third component in our system, the placer, is responsiblefor pinning popular keys to LSM levels residing on differentstorage mediums.

Ideally, given our tiered storage design, we would like tokeep all hot keys at the top of the LSM tree. However, we needto take into account LSM tree level sizing. Write amplificationin LSM trees is minimized when the ratio between the sizesof each subsequent level is fixed across all levels, namely, thatlevel Li is k-times larger than the level Li−1, where k = 10typically [22,48]. This imposes a size constraint on each level,and means that we cannot indiscriminately pin all the hotobjects to the top levels that reside on fast storage. Deviatingfrom the LSM tree sizing rule can increase the overall writeamplification and reduce overall performance, which we alsoconfirmed experimentally.

Another design goal for the placer is that it must not violatethe consistency guarantees of the LSM tree. LSM trees typi-cally maintain consistency by always storing newer versionsof objects in upper levels and older versions in the lower lev-els. Naively pinning hot keys to top levels—and violating thisversioning property—can break the consistency guarantee ofLSM trees. (We discuss this further in §4.4.)

The placer also needs to be lightweight, since it is compet-ing with reads, writes, and background compaction jobs forresources. It should avoid adding any locks on the databasesince reads and writes are being served concurrently. The tim-ing of when to trigger the pinning process is also important.Ideally, it should be done during periods of reduced databaseactivity to avoid resource contention.

A popular object that is cached in DRAM may also getpinned by the placer to faster storage (double caching). Inthis case, pinning is still useful as it enables better hot-coldseparation, which in-turn improves dram cache efficiency.Ideally, DRAM cache policy should be optimized to taketiered storage into account. We leave this for future work.

Pinned Compactions. To achieve the above stated goals,we combine the pinning logic with the compaction processof LSM. We introduce a new compaction algorithm calledpinned compaction. Compaction by default moves the keystowards the bottom levels of the tree. Compaction does amerge sort on the SST files between two adjacent levels. Inthe process, it reads every single key from the SST files andwrites them in the sorted order into the output level. We extendthe new compaction process to not just retaining the popularkeys, but also “pull up” popular keys stuck in the lower level,

Figure 8: Example of pinned compaction.

possibly on slower storage. We call this process, not found intraditional LSM trees, “up-compaction.” Because each SSTfile is already sorted, the merge between SST files can beexecuted in an efficient, streaming manner; a property thatup-compaction continues to maintain as well.

Figure 8 shows an example of how pinned compactionswork on files under compaction. Keys shown in red and grayare present in the tracker; keys in white are not in tracker.Keys 40, 35, and 65 (shown in red) are keys with clock value3. Keys in gray have clock value 2. Assume that the pinningthreshold pins all keys with clock value 3. During the mergeprocess, instead of compacting everything down, key 40 isretained in Li−1. Also, popular keys, 35 and 65, from the lowerlevel Li are compacted up and stored in Li. At a later time, ifthe key popularity changes and the keys residing in Li−1 eitherdon’t satisfy the pinning threshold or are evicted from thetracker, they will be un-pinned in the next compaction eventand compacted down. Pinned compactions can temporarilygenerate small SST files comprised of popular keys. At a laterpoint, when PrismDB selects these files again for compaction,they will be merged with other SST files. In our evaluation,we observed that less that 10% of the SST files in a level weresmall (less than 10MB).

Choosing Which SST File to Compact. When a com-paction job is triggered, it first chooses a level and then anSST file in that level for compaction. With traditional LSM,picking the candidate SST file is based on some system levelobjective like reclaiming storage space, reducing write ampli-fication, and so forth.

For PrismDB, we introduce a new SST file selection cri-teria, which better aligns with the high-level goal of pinningkeys to appropriate levels. When the SST file is created, apopularity score is assigned to it, based on the clock values ofobjects present in that file. Objects not present in the trackerget a default score of -1. Since by default clock values rangebetween [0,3], using them directly is problematic, since thedifference between a highly popular key (clock value 3) andan unpopular key (clock value -1) is just 4, i.e., it only take 3un-popular keys to negate the contribution of a highly popu-lar key. We devise a workaround to this problem by using a

7

Page 8: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

weight parameter n to boost the contributions of popular keystowards the SST popularity score:

score = Σ(ob jecti clock value)n

where i is the ith key in the SST file. In our experiments, weuse n = 3.

During compaction, each SST file in a level is sorted inthe ascending order of popularity score, and the file with thelowest score is selected for compaction.

4.4 Maintaining ConsistencyLSM-backed databases like RocksDB ensure that databaseclients will always read the latest version of the object commit-ted to that database. LSMs ensure this behavior by versioningall writes (inserts, updates, or deletes) to a key, where newerversions of the key reside on upper levels of the LSM tree andolder versions towards the tree’s bottom. In this regard, whena read query traverses the tree from memtables, to L0, andthen on downwards, the first version of the key it encounters isthe latest (including a “tombstone” record for deletes). Evenduring compactions, this guarantee is maintained.

PrismDB’s read-aware LSM tree maintains the same con-sistency guarantees. If a pinned compaction retains a set ofobjects in the same level, they will always be the more recentversions of objects than present in the lower levels. Even whenobjects rise up a level during a pinned compaction, they don’tbreak the LSM guarantee: if the same key is encounteredwhen merging SST files from levels Li−1 and Li, the mergeprocess preserves the (newer) key from level Li−1.In doingso, read-aware LSM trees continue to ensure the ordering ofkey versions down the tree.

4.5 Deploying PrismDBWe envision PrismDB being deployed in two different setups.

The first setup is on a single physical server, containingmore than one type of storage device. Similar to RocksDB, forload balancing purposes, each server would contain multipleinstances of PrismDB, each of which would be assigned acertain ratio of each of the storage devices’ capacity. In thispaper, we focus on the NNNTQ setup to demonstrate howit can work with three storage devices. However, PrismDB’sdesign is generalizable to other heterogeneous setups, includ-ing a simpler deployment scenario with just two storage tiers,such as NNNNQ. Such a configuration could be readily de-ployed with two devices (e.g., a 280 GB Intel 900P OptaneSSD and a 2 TB Intel 760p QLC drive) using a 10:1 levelratio, where about 90% of the capacity would be allocated toQLC, and the rest to Optane.

The second setup is a disaggregated storage setting, wherethere are multiple disaggregated storage tiers, each of whichhas their own cost and performance characteristics. In thisscenario, PrismDB’s layers would be split across multiple

Figure 9: PrismDB component interaction

different disaggregated storage tiers. Even the major publicclouds today all support various forms of such disaggregatedstorage tiering.

5 Implementation

We implemented PrismDB1 on top of RocksDB v6.2.0, whichis written in C++. The tracker is built on the concurrent hashmap implementation from Intel’s TBB library [29].

Intel’s TBB concurrent hash map implements a hash tablewith chaining. Chained hashing offers constant-time insertoperations; however, lookup and delete operations may beslower. In our system, inserts to the hash table happen on theread critical path, while lookups and deletes happen in thebackground. Additionally, the ratio of inserts to lookups inthe tracker is 18:1, which makes hash chaining a good fit. Inmicrobenchmarks, we found it took less than 2µs to insert anew key into the tracker.

The hash map index is the object key and each index storesa 1 byte value. The top two bits are used for storing the clockvalue, while the bottom six bits store the last six bits of thehash of the key version. Multiple versions of the same clientkey can exist in the LSM, but only the latest version is usefulfor pinning objects.

Figure 9 shows the interaction between PrismDB’s differ-ent design components. When a database client calls the getAPI, the key is inserted into the tracker. Typically an insertoperation also invokes an eviction. However, since the trackerinsert is on the critical path of reads for PrismDB, we deferthe evictions to a background thread. If the key is not presentin tracker, insert writes the key with an initial clock valueof 1. Setting the initial value to 3 can result in keys that areaccessed only once staying in the tracker for a long time, sincethe clock value needs to get decremented down to 0 before thekey is evicted. For a key that is already present in tracker, ifits version matches the tracker’s version, then the clock valueis set to 3; otherwise, it is treated as a new key.

A background eviction thread implements the clock handmovement using an iterator over the hash table. Note that newinserts into tracker can happen concurrently while the back-

1GitHub link elided for anonymity purposes.

8

Page 9: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

Workload Description

YCSB-A write intensive : 50% reads, 50% updatesYCSB-B read intensive : 95% reads, 5% updatesYCSB-C read only : 100% readsYCSB-D read intensive (latest) : 95% reads, 5% insertsYCSB-E scan intensive : 95% scans, 5% insertsYCSB-F 50% 50% reads, read-modify-writes

Table 2: YCSB workload description.

91nvm

21tlc

9 qlc

11 hetstorage cost (cents/GB)

0

100

200

300

tput

(K o

ps/s

ec) prismdb

rocksdbmutant

Figure 10: Throughput vs. storage cost of PrismDB, Mutantand RocksDB under YCSB-A. X-axis is log scale.

ground iterator is running, by relying on TBB’s concurrenthash map guarantee [31] that as long as erase operations areserialized, an iterator running in one thread does not conflictwith insert operations happening in other threads. The limi-tation of running concurrently is that the iterator may eithermiss visiting some items or visit them more than once withina full iteration. This makes our implementation an approxima-tion of the clock algorithm. In the context of our system, suchapproximate behavior does not affect system performance.Finally, the clock value is stored as an atomic variable. Thuslookup does not need to be serialized with eviction.

Each client get updates the clock value distribution inthe mapper. The mapper is implemented as an array of fouratomic integers; each keeps track of the number of keys with aparticular clock value.While running pinned compactions, theplacer calls lookup to check if the key is present in tracker.If so, it queries the mapper to determine whether to pin thekey to the current level.

6 Evaluation

We evaluate PrismDB by answering the following questions:1. How does PrismDB compare to RocksDB and Mutant?2. In which workloads does PrismDB provide a benefit?3. How does hot-cold separation affect DRAM hit rates?4. What is the impact of pinned compactions on read and

write amplification?5. How does the pinning threshold affect performance?

Figure 11: Throughput comparison of RocksDB, Mutant andPrismDB under different YCSB workloads.

Configuration. We performed our experiments on a 32-core, 64 GB RAM machine running Ubuntu 18.04.2 LTS 32.Three different storage devices were locally attached to thismachine, all running over PCIe3.1: an Optane SSD (Intel900p, which uses 3D XPoint), a TLC SSD (Intel 760p), and aQLC SSD (Intel 660p). All three are representative consumerNVMe devices of the type commonly employed in hyperscaleor other datacenter settings to save costs.

Workload. We use YCSB [19] to compare the performanceacross PrismDB, RocksDB, and Mutant. All YCSB experi-ments (Table 2) are run using 24 concurrent database clients.By default, we use YCSB-B with 95% reads-5% updates andZipfian request distribution. We use 200 million key dataset,with each object size of 1KB, for a total database size of223GB. We use 2 billion requests for read intensive work-loads (B, C and D), 600 million requests for scan intensiveworkload (E) and 200 million requests for update intensiveworkloads (A and F), with 30% of requests as warm-up. Weuse shorter workloads for the write-intensive workloads sincethey take a much longer time to run. We verified experimen-tally that the performance on these workloads remains stableif run with more keys. All the databases we compare in thissection use a 1:10 ratio between DRAM and storage, where20% of the DRAM is dedicated to block cache (a commonproduction configuration [22]). DRAM allocation is done us-ing cgroups. In the heterogeneous configurations, the storageis divided roughly in a ratio of 1:9:90 between NVM, TLC,and QLC respectively, in the 5-level “NNNTQ” configurationas shown in Figure 2b.

Baselines. We compare PrismDB against two baselines:RocksDB and Mutant [68]. Mutant is a storage layer forLSMs that assigns SST files to heterogeneous storage basedon the access frequency of SST files. It uses a backgroundprocess called migration that moves SST files across storagetypes. To ensure an apples-to-apples performance compari-son, we re-implemented Mutant on top of RocksDB v6.2.0.We set Mutant’s SST file cooling coefficient α to 0.999 andoptimization epoch to 60 seconds. We did not implementMutant’s “migration resistance” optimization as it trades offstorage space for fewer numbers of migrations. For PrismDB,we set the tracker size to 10% of database key space and the

9

Page 10: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

avg p50 p95 p99100

101

102

103

104

read

late

ncy

(us)

rocksdbmutantprismdb

(a)

avg p50 p95 p99100

101

102

103

104

upda

te la

tenc

y (u

s)

rocksdbmutantprismdb

(b)

YCSB-A YCSB-D YCSB-E YCSB-F0

1000

2000

avg

read

late

ncy

(us) rocksdb

mutantprismdb

(c)

YCSB-A YCSB-D YCSB-E YCSB-F0

2000

4000

6000

avg

upda

te la

tenc

y (u

s)

rocksdbmutantprismdb

(d)

Figure 12: Average read and update latencies. Figures 12a and 12b use YCSB-B workload (y-axis is logscale).

z0.6 z0.8z0.99z1.2 z1.40

250

500

750

1000

tput

(K o

ps/s

ec) rocksdb

prismdb

(a)

z0.6 z0.8 z0.99 z1.2 z1.40

500

1000

1500

avg

read

late

ncy

(us) rocksdb

prismdb

(b)

z0.6 z0.8 z0.99 z1.2 z1.40

500

1000

1500

avg

upda

te la

tenc

y (u

s)

rocksdbprismdb

(c)

Figure 13: Performance with different YCSB-B Zipfian distributions.

pinning threshold to 10%. All experiments use 8 threads forbackground compaction. SST index and filter block cachingis enabled. Other settings are default from RocksDB v6.2.0.

6.1 Homogeneous vs. Heterogeneous

Figure 10 compares the average throughput and storage costtrade off with PrismDB, RocksDB, and Mutant under fourdifferent configurations: three homogeneous configurations(NVM, TLC, and QLC) and the heterogeneous configuration(NNNTQ). We make a few observations.

First, as expected, NVM provides significantly higherthroughput than TLC, which is slightly better than QLC.When we use RocksDB in the heterogeneous storage con-figuration, since its LSM tree is not read-aware, it provides amarginal performance benefit over the pure QLC setup.

Second, PrismDB’s heterogeneous configuration outper-forms homogeneous configurations of TLC and QLC. Notethat the cost of the heterogeneous setup is only 20% higherthan pure QLC. The reason for this is that 90% of the databasecapacity (the bottom level) uses QLC. Therefore, it providesa superior throughput-cost trade off than TLC, the standardflash technology used in most datacenter databases: RocksDBon TLC is almost 2× more expensive but 2.3× slower thanPrismDB on heterogeneous.

Third, PrismDB outperforms RocksDB in all homogeneoussetups. The primary reason for this is improved hot-cold sep-aration of SST files and reduced background compactionsfrom pinning popular keys in upper levels. We analyze thisfurther in §6.3 and §6.4.

6.2 Heterogeneous Storage: PrismDB vs.RocksDB and Mutant

Figure 11 compares the average throughput of PrismDB toRocksDB and Mutant under the heterogeneous setup underdifferent YCSB workloads.

In point-query workloads with different mixes of readsand writes (YCSB A, B, D, and F), PrismDB significantlyoutperforms the other systems due to its read-awareness.

In read-only and scan workloads, PrismDB provides no tolow benefit. Recall that object pinning occurs during com-paction, but in a 100% read workload (YCSB-C), compactionsare not triggered, which leads to suboptimal placement. Forthe scan workload (YCSB-E), PrismDB provides only about a15% improvement over RocksDB. The reason for this is thatPrismDB makes pinning decisions on an individual objectbasis. Therefore, a scan that reads a large number of objectsis likely to encounter at least some objects at all levels, dimin-ishing the benefit of PrismDB’s object-by-object placement.

Figures 12a and 12b present the read and update latenciesof PrismDB compared to RocksDB and Mutant. PrismDB im-proves the average, 95th percentile (p95), and 99th percentile(p99) read latencies of RocksDB by 3.7×, 2.3× and 28×, re-spectively, and corresponding update latencies by 3.1×, 2.2×,and 25.4×. As shown by prior work [12], the primary driverfor high tail latencies in RocksDB is interference from com-pactions. As we show in §6.4, PrismDB’s hot-cold separationand placement on different storage tiers significantly reducescompactions, which leads to greatly improved tail latenciesfor both reads and updates.

PrismDB improves the median (p50) read and update la-

10

Page 11: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

L1 L2 L3 L40

1000

2000

3000

4000

5000

write

I/O

(GB)

rocksdbprismdb

(a)

0 40 80 120 160 200 240time (mins)

0

50

100

150

tput

(K o

ps/s

ec) prismdb

rocksdbmutant

(b)

0 40 80 120 160 200 240time (mins)

5

10

15

back

grou

nd i/

o (G

B) mutantrocksdb

prismdb

(c)

0 40 80 120time (mins)

103

104

105

p99

read

late

ncy

(us)

rocksdbprismdb

(d)

Figure 14: Comparing throughput, background compaction I/O and p99 tail latency under YCSB-B workload. 14b and 14cinclude the warm-up phase whereas 14d only shows the steady state measurement.

Overall Data BlockConfig RocksDB Mutant PrismDB Improvement Improvement

Optane 64.0% n/a 76.3% 19.2% 55.3%TLC 56.4% n/a 75.5% 33.8% 82.6%QLC 56.2% n/a 75.4% 34.16% 82.9%Het 55.9% 51.6% 74.0% 32.4% 85.5%

Table 3: DRAM cache hit rate improvement.

tency more modestly, by 18% and 20%. The primary reasonfor this is that queries to hit faster storage tiers with PrismDB.Figure 5, which compares the heatmaps of reads betweenRocksDB and PrismDB, confirms that PrismDB shifts morepopular keys to faster devices.

PrismDB significantly outperforms Mutant for two reasons.First, Mutant’s object placement scheme works at SST filegranularity, whereas PrismDB is able to place at an objectgranularity. Thus, Mutant does not create hot-cold separationwithin SST blocks. Second, its migrations induce I/O and re-quire entire files to be locked, both of which degrade through-put and cause latency to spike, even compared to RocksDB.

We also present the average latency for different YCSBworkloads in Figures 12c and 12d. Similar to the throughputresults, PrismDB’s read latency actually benefits from somelevel of writes to generate compactions.

We compare PrismDB to RocksDB under different YCSBrequest distributions in Figures 13a, 13b and 13c. PrismDBprovides no benefit for a Zipfian distribution with a parameterof 1.2 (and above). The reason for this is that the higher theZipfian parameter, the more skewed the workload; therefore,once the workload becomes very skewed, almost all of itsworking set fits in RocksDB’s block cache in DRAM. At thatpoint, since PrismDB has a slightly higher overhead due toits tracker, which needs to be updated at every read operation,RocksDB slightly outperforms PrismDB.

6.3 Impact of Hot-Cold SeparationSince PrismDB pins hot objects to their levels, it naturallycreates SST files that contain blocks with a similar level ofread popularity. Since LSM trees cache data at the block level,

this leads to a higher percentage of popular keys being readinto DRAM, and therefore to higher hit rates.

To demonstrate for the effect of PrismDB’s hot-cold sep-aration, Table 3 measures the hit rate of the block cache forSST data, index, and filter blocks. We also show the hit rateimprovement for data blocks.2 The table demonstrates thatPrismDB significantly increases the data block hit rate forboth homogeneous and heterogeneous configurations.

6.4 Impact on Compactions

We analyze the impact of PrismDB on I/O usage in Figure 14.The plot shows that PrismDB significantly reduces overallwrite I/O. Note that PrismDB has slightly higher write I/Oin levels L1 and L2, which is expected as it pins popularobjects to faster storage. By doing so, it prevents popularkeys from compacting down the tree, thus reducing unnec-essary compactions. This behavior is further confirmed inFigures 14b and 14c that show throughput degradation ofRocksDB and Mutant as the background I/O increases, whilePrismDB maintains a steady throughput. In particular, withYCSB-B, PrismDB conducts only 5684 total compactions,compared to RocksDB, which does 9047 compactions.

There are two primary reasons for the reduced I/O. Thefirst is due to PrismDB’s improved hot-cold separation, whichresults in a higher DRAM block hit rate. Therefore, reads aswell as updates are more likely to be cached in DRAM, andless likely to be read and written from disk. The second reasonis that by pinning more popular objects in upper levels of theLSM tree, and by prioritizing the compaction of SST fileswith fewer popular objects (as described in §4.3), PrismDBreduces read amplification. The reason for this is that whenlooking up an object in storage, LSM trees sequentially lookup the object in level 0, then level 1, and so forth. SincePrismDB keeps more popular objects in upper levels (e.g.,see Figure 5), it reduces overall read amplification.

2Note that as mentioned in §2.2, LSM trees make extensive use of the OSpage cache, for which it is difficult to measure the exact data block hit rate.

11

Page 12: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

0 510 25 50 75 100pinning threshold (%)

0

50

100

150

tput

(K o

ps/s

ec)

Figure 15: Effect of pinning threshold on throughput. Experi-ment uses YCSB-B with 100M keys; tracker size is 30M keys.Pinning threshold is depicted as a percentage of tracker size.

6.5 Determining Pinning Threshold

We analyze the effect of the pinning threshold on the through-put of PrismDB in Figure 15. We observe that setting thethreshold too low or too high is detrimental to performance.If the threshold is set too low, very few objects will bepinned and PrismDB will converge to a similar throughput ofRocksDB. If set too high, the compactor will be less effectivein compacting SST files (since many objects will be pinned),leading to higher I/O consumption and reduced throughput.3

7 Related Work

We review three bodies of related work, which incorporatesemerging storage technologies into databases and key-valuestores, key-value caches, and file systems.

Databases and Key-value Stores: SplinterDB [18],KVell [37], and uDepot [34] are databases designed for fastNVMe devices (e.g., Optane SSD). They observe that unlikeflash, NVM supports fast writes to small amounts of data.Therefore, they can avoid the buffering and large compactionsemployed by LSMs. However, since they are optimized forNVM, they would not perform well in a heterogeneous settingthat includes traditional flash, which requires large contiguouswrites for performance and endurance.

Mutant [68] is a key-value store that tries to explore theperformance-cost trade-off by using multi-tiered storage de-vices, although unlike PrismDB, it does not consider storageendurance aspects in its design. Mutant ranks the LSM treeSST files based on their access frequencies and then placesthem on appropriate storage devices through a process calledmigration. However, as we showed in §6.2, Mutant performspoorly because migrations are expensive; during migration ofSST files the read latency can spike by an order of magnitude(as also reported in the Mutant paper). SST file migrationsalso increase background I/O significantly.

3This effect is reminiscent of the trade off between disk capacity utiliza-tion and write costs in log-structured file systems [50].

MyNVM [24] and Wu et al. [64] incorporate fast NVMe de-vices into SQL databases as a first-level cache ahead of flash,but unlike PrismDB, do not integrate the heterogeneous stor-age technologies into the basic LSM structure. Since neitherof these systems are open source, we cannot compare to themdirectly. However, since these systems cache at the blockgranularity, we expect that similar to Mutant, their perfor-mance would suffer since they do not do hot-cold separationof objects within the LSM levels.

In addition, there is extensive work on improving the per-formance of single-tier LSM trees. For example, TRIAD [11]employs several strategies to reduce write amplification, suchas keeping frequently updated objects in memory, and de-laying compaction until the overlap between files is high.Other examples include PebblesDB [49], LSM-Trie [65], andWiscKey [38], each of which use different data structuresto minimize compaction I/O and write amplification. Othersystems include EvenDB [26], which groups together key-value pairs that are likely to be accessed together. CLSM [27]and LOCS [60] improve the concurrency of LSM trees. AC-Key [62] adapts the sizes of different DRAM caches in theLSM-tree to improve performance. SILK [12] uses an IOscheduler that optimizes tail latency.The techniques employedby all of these systems are largely orthogonal to our work,and may be incorporated into PrismDB.

There have been many recent efforts to optimize key-valuestores for persistent memory [9, 10, 33, 39, 46, 56, 57, 66, 67].As we only consider heterogeneous NVMe devices, persistentmemory is outside of our scope.

Caches: Flashield [23] is a hybrid memory and flash key-value cache that uses a clock-based algorithm to filter whichobjects should be stored in memory and which on flash. Sim-ilarly, RIPQ [55] uses segmented LRU to co-locate objectsthat have similar priorities. Fatcache [2], McDipper [6] andTao [14] are all caches that use flash as a last level cache toreplace DRAM. All of these systems use some combinationof storage and memory, but do not take advantage of both fastand cheap storage technologies.

File Systems: Strata [35] uses different types of stor-age (NVM, normal SSD, and HDD) to navigate the cost-performance trade off. For example, similar to our design datais asynchronously flushed down to cheaper storage devices.However, Strata focuses exclusively on the file system andits interactions with the kernel. As our results demonstrate,managing storage placement at the file level misses that a filemay contain hundreds or thousands of objects.

8 Summary

By combining multiple storage storage technologies withinthe same system we can enable both fast and cost-effectivedata systems. In this paper, we demonstrate that by makingLSM trees read-aware using pinned compactions, PrismDBachieves meaningful latency and throughput performance im-

12

Page 13: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

provements while preserving lower costs. We believe that thegeneral approach of simultaneously employing a mixture ofstorage technologies will likely prove useful for other areasof systems as well, and is an excited area for future research.

References

[1] Bloom filter. https://en.wikipedia.org/wiki/Bloom_filter.

[2] Fatcache. https://www.github.com/twitter/fatcache.

[3] Flexible I/O tester. https://github.com/axboe/fio.

[4] Intel Optane memory. https://www.intel.com/content/www/us/en/architecture-and-technology/optane-memory.html.

[5] LevelDB. http://leveldb.org/.

[6] McDipper: A key-value cache for flash storage.https://www.facebook.com/notes/facebook-engineering/mcdipper-a-key-value-cache-for-flash-storage/10151347090423920/.

[7] RocksDB wiki. github.com/facebook/rocksdb/wiki/.

[8] Samsung Z-SSD redefining fast responsiveness. https://www.samsung.com/semiconductor/ssd/z-ssd/.

[9] Joy Arulraj and Andrew Pavlo. How to build a non-volatile memory database management system. In Proc.Intl. Conference on Management of Data (SIGMOD),May 2017.

[10] Joy Arulraj, Matthew Perron, and Andrew Pavlo. Write-behind logging. Proc. VLDB Endowment, 10(4), Novem-ber 2016.

[11] Oana Balmau, Diego Didona, Rachid Guerraoui, WillyZwaenepoel, Huapeng Yuan, Aashray Arora, KaranGupta, and Pavan Konka. TRIAD: Creating synergiesbetween memory, disk and log in log structured key-value stores. In Proc. USENIX Annual Technical Con-ference (ATC), July 2017.

[12] Oana Balmau, Florin Dinu, Willy Zwaenepoel, KaranGupta, Ravishankar Chandhiramoorthi, and Diego Di-dona. SILK: Preventing latency spikes in log-structuredmerge key-value stores. In Proc. USENIX Annual Tech-nical Conference (ATC), July 2019.

[13] Nathan Beckmann, Haoxian Chen, and Asaf Cidon.LHD: Improving cache hit rate by maximizing hit den-sity. In Proc. USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI), April 2018.

[14] Nathan Bronson, Zach Amsden, George Cabrera, PrasadChakka, Peter Dimov, Hui Ding, Jack Ferris, AnthonyGiardullo, Sachin Kulkarni, Harry Li, Mark Marchukov,Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and VenkatVenkataramani. TAO: Facebook’s Distributed DataStore for the Social Graph. In Proc. USENIX AnnualTechnical Conference (ATC), June 2013.

[15] Zhichao Cao, Siying Dong, Sagar Vemuri, andDavid H.C. Du. Characterizing, modeling, and bench-marking rocksdb key-value workloads at facebook.In Proc. USENIX Conference on File and StorageTechnologies (FAST), February 2020.

[16] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C.Hsieh, Deborah A. Wallach, Mike Burrows, TusharChandra, Andrew Fikes, and Robert E. Gruber. Bigtable:A distributed storage system for structured data. ACMTrans. Computer Systems, 26(2), June 2008.

[17] Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh, andSachin Katti. Cliffhanger: Scaling performance cliffs inweb memory caches. In Proc. USENIX Symposium onNetworked Systems Design and Implementation (NSDI),March 2016.

[18] Alexander Conway, Abhishek Gupta, Vijay Chi-dambaram, Martin Farach-Colton, Richard Spillane,Amy Tai, and Rob Johnson. SplinterDB: Closing thebandwidth gap for NVMe key-value stores. In Proc.USENIX Annual Technical Conference (ATC), 2020.

[19] Brian F. Cooper, Adam Silberstein, Erwin Tam, RaghuRamakrishnan, and Russell Sears. Benchmarking cloudserving systems with YCSB. In Proc. Symposium onCloud Computing (SoCC), June 2010.

[20] Fernando J Corbato. A paging experiment with the mul-tics system. Technical report, DTIC Document, 1968.

[21] Disctech. Intel 760p. https://www.disctech.com/intel-760p-ssdpekkw128g8xt-128gb-pcie-nvme-solid-state-drive.

[22] Siying Dong, Mark Callaghan, Leonidas Galanis,Dhruba Borthakur, Tony Savor, and Michael Strum. Op-timizing space amplification in rocksDB. In Proc. Bien-nial Conference on Innovative Data Systems Research(CIDR), January 2017.

[23] Assaf Eisenman, Asaf Cidon, Evgenya Pergament,Or Haimovich, Ryan Stutsman, Mohammad Alizadeh,and Sachin Katti. Flashield: a hybrid key-value cachethat controls flash write amplification. In Proc. USENIXSymposium on Networked Systems Design and Imple-mentation (NSDI), February 2019.

13

Page 14: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

[24] Assaf Eisenman, Darryl Gardner, Islam AbdelRahman,Jens Axboe, Siying Dong, Kim Hazelwood, Chris Pe-tersen, Asaf Cidon, and Sachin Katti. Reducing DRAMfootprint with NVM in Facebook. In Proc. EuroSysConference, April 2018.

[25] Bin Fan, David G. Andersen, and Michael Kaminsky.MemC3: Compact and concurrent MemCache withdumber caching and smarter hashing. In Proc. USENIXSymposium on Networked Systems Design and Imple-mentation (NSDI), April 2013.

[26] Eran Gilad, Edward Bortnikov, Anastasia Braginsky,Yonatan Gottesman, Eshcar Hillel, Idit Keidar, NuritMoscovici, and Rana Shahout. EvenDB: Optimizingkey-value storage for spatial locality. In Proc. EuropeanConference on Computer Systems (EuroSys), April 2020.

[27] Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, andIdit Keidar. Scaling concurrent log-structured datastores. In Proc. European Conference on ComputerSystems (EuroSys), April 2015.

[28] Laura M. Grupp, John D. Davis, and Steven Swanson.The bleak future of NAND flash memory. In Proc.USENIX Conference on File and Storage Technologies(FAST), February 2012.

[29] Intel. Intel thread building blocks (tbb) library.https://software.intel.com/content/www/us/en/develop/tools/threading-building-blocks.html.

[30] Intel. Product brief: Intel optane ssd-900p seriespcie. https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/optane-ssd-900p-brief.pdf.

[31] Intel. Traversing concurrent hash map concurrently.https://software.intel.com/en-us/blogs/2010/05/14/traversing-concurrent_hash_map-concurrently.

[32] Shehbaz Jaffer, Kaveh Mahdaviani, and BiancaSchroeder. Rethinking WOM codes to enhance thelifetime in new SSD generations. In Proc. USENIXWorkshop on Hot Topics in Storage and File Systems(HotStorage), 2020.

[33] Olzhas Kaiyrakhmet, Songyi Lee, Beomseok Nam,Sam H Noh, and Young-Ri Choi. SLM-DB: single-level key-value store with persistent memory. In Proc.USENIX Conference on File and Storage Technologies(FAST), 2019.

[34] Kornilios Kourtis, Nikolas Ioannou, and Ioannis Kolt-sidas. Reaping the performance of fast NVM storagewith uDepot. In Proc. USENIX Conference on File andStorage Technologies (FAST), February 2019.

[35] Youngjin Kwon, Henrique Fingler, Tyler Hunt, SimonPeter, Emmett Witchel, and Thomas Anderson. Strata:A cross media file system. In Proc. Symposium onOperating Systems Principles (SOSP), 2017.

[36] Avinash Lakshman and Prashant Malik. Cassandra:A decentralized structured storage system. SIGOPSOperating Systems Review, 44(2), April 2010.

[37] Baptiste Lepers, Oana Balmau, Karan Gupta, and WillyZwaenepoel. KVell: The design and implementation ofa fast persistent key-value store. In Proc. Symposium onOperating Systems Principles (SOSP), 2019.

[38] Lanyue Lu, Thanumalayan Sankaranarayana Pillai,Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. WiscKey: Separating keys from values in SSD-conscious storage. In Proc. USENIX Conference on Fileand Storage Technologies (FAST), February 2016.

[39] Virendra J. Marathe, Margo Seltzer, Steve Byan, andTim Harris. Persistent memcached: Bringing legacycode to byte-addressable persistent memory. In Proc.USENIX Conference on Hot Topics in Storage and FileSystems (HotStorage), 2017.

[40] Yoshinori Matsunobu. Myrocks: A space and write-optimized MySQL database. code.facebook.com/posts/190251048047090/myrocks-a-space-and-write-optimized-mysql-database/.

[41] C Mellor. Toshiba flashes 100TB QLC flashdrive, may go on sale within months. really.http://www.theregister.co.uk/2016/08/10/toshiba_100tb_qlc_ssd//.

[42] Micron. Comparing SSD and HDDendurance in the age of QLC SSDs.https://www.micron.com/-/media/client/global/documents/products/white-paper/5210_ssd_vs_hdd_endurance_white_paper.pdf.

[43] Micron. QLC NAND technology. https://www.micron.com/products/advanced-solutions/qlc-nand.

[44] MongoDB. WiredTiger. http://www.wiredtiger.com/.

[45] Rajesh Nishtala, Hans Fugal, Steven Grimm, MarcKwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy,Mike Paleczny, Daniel Peek, Paul Saab, David Stafford,Tony Tung, and Venkateshwaran Venkataramani. Scal-ing Memcache at Facebook. In Proc. USENIX Sympo-sium on Networked Systems Design and Implementation(NSDI), 2013.

14

Page 15: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

[46] Matheus Almeida Ogleari, Ethan L Miller, and JishenZhao. Steal but no force: Efficient hardware undo+ redologging for persistent memory systems. In Proc. Intl.Symposium on High Performance Computer Architec-ture (HPCA), 2018.

[47] S Ohshima and Y Tanaka. New 3D flash tech-nologies offer both low cost and low power so-lutions. https://www.flashmemorysummit.com/English/Conference/Keynotes.html.

[48] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Eliz-abeth O’Neil. The log-structured merge-tree (LSM-tree).Acta Informatica, 33(4), 1996.

[49] Pandian Raju, Rohan Kadekodi, Vijay Chidambaram,and Ittai Abraham. PebblesDB: Building key-valuestores using fragmented log-structured merge trees. InProc. Symposium on Operating Systems Principles(SOSP), 2017.

[50] Mendel Rosenblum and John K Ousterhout. The designand implementation of a log-structured file system. ACMTrans. Computer Systems, 10(1), 1992.

[51] Zhenyu Song, Daniel S. Berger, Kai Li, and Wyatt Lloyd.Learning relaxed belady for content distribution networkcaching. In Proc. USENIX Symposium on NetworkedSystems Design and Implementation (NSDI), 2020.

[52] Amy Tai, Andrew Kryczka, Shobhit O. Kanaujia, KyleJamieson, Michael J. Freedman, and Asaf Cidon. Who’safraid of uncorrectable bit errors? online recovery offlash errors with distributed redundancy. In Proc.USENIX Annual Technical Conference (ATC), 2019.

[53] Billy Tallis. The crucial P1 1TB SSD review: The otherconsumer QLC SSD. https://www.anandtech.com/show/13512/the-crucial-p1-1tb-ssd-review.

[54] Billy Tallis. Intel details upcoming SSDs fordatacenter including QLC NAND. https://www.anandtech.com/show/13157/intel-details-upcoming-ssds-for-datacenter-including-qlc-nand.

[55] Linpeng Tang, Qi Huang, Wyatt Lloyd, Sanjeev Kumar,and Kai Li. RIPQ: Advanced photo caching on flash forFacebook. In Proc. USENIX Conference on File andStorage Technologies (FAST), 2015.

[56] Alexander van Renen, Viktor Leis, Alfons Kemper,Thomas Neumann, Takushi Hashida, Kazuichi Oe,Yoshiyasu Doi, Lilian Harada, and Mitsuru Sato. Man-aging non-volatile memory in database systems. In Proc.Intl. Conference on Management of Data (SIGMOD),2018.

[57] Alexander van Renen, Lukas Vogel, Viktor Leis, ThomasNeumann, and Alfons Kemper. Persistent memory I/Oprimitives. In Proc. Intl. Workshop on Data Manage-ment on New Hardware (DaMoN), 2019.

[58] Carl Waldspurger, Trausti Saemundsson, Irfan Ahmad,and Nohhyun Park. Cache modeling and optimizationusing miniature simulations. In Proc. USENIX AnnualTechnical Conference (ATC), 2017.

[59] Carl A. Waldspurger, Nohhyun Park, Alexander Garth-waite, and Irfan Ahmad. Efficient MRC constructionwith SHARDS. In Proc. USENIX Conference on Fileand Storage Technologies (FAST), 2015.

[60] Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang,Shiding Lin, Chen Zhang, and Jason Cong. An efficientdesign and implementation of LSM-tree based key-valuestore on open-channel SSD. In Proc. European Confer-ence on Computer Systems (EuroSys), 2014.

[61] Sean Webster. Intel ssd 660p. https://www.tomshardware.com/reviews/intel-ssd-660p-qlc-nvme,5719.html.

[62] Fenggang Wu, Ming-Hong Yang, Baoquan Zhang, andDavid H.C. Du. AC-Key: Adaptive caching for LSM-based key-value stores. In Proc. USENIX Annual Tech-nical Conference (ATC), July 2020.

[63] Kan Wu, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. Towards an unwritten contract of Intel OptaneSSD. In Proc. USENIX Workshop on Hot Topics inStorage and File Systems (HotStorage), July 2019.

[64] Kan Wu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Rathijit Sen, and Kwanghyun Park. ExploitingIntel Optane SSD for Microsoft SQL Server. In Proc.Intl. Workshop on Data Management on New Hardware(DaMoN), 2019.

[65] Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. LSM-trie: An LSM-tree-based ultra-large key-value store forsmall data items. In Proc. USENIX Annual TechnicalConference (ATC), 2015.

[66] Fei Xia, Dejun Jiang, Jin Xiong, and Ninghui Sun.HiKV: A hybrid index key-value store for DRAM-NVMmemory systems. In Proc. USENIX Annual TechnicalConference (ATC), 2017.

[67] Ting Yao, Yiwen Zhang, Jiguang Wan, Qiu Cui, LiuTang, Hong Jiang, Changsheng Xie, and Xubin He. Ma-trixKV: Reducing write stalls and write amplification inlsm-tree based KV stores with matrix container in NVM.In Proc. USENIX Annual Technical Conference (ATC),July 2020.

15

Page 16: arXiv:2008.02352v1 [cs.DB] 5 Aug 2020 · gies, such as 3D XPoint [4,65] and Z-NAND [8], provide la-tencies of 100s of nanoseconds to 10 µs, making them suitable as DRAM replacement

[68] Hobin Yoon, Juncheng Yang, Sveinn Fannar Kristjans-son, Steinn E. Sigurdarson, Ymir Vigfusson, and AdaGavrilovska. Mutant: Balancing storage cost and la-

tency in lsm-tree data stores. In Proc. Symposium onCloud Computing (SoCC), 2018.

16