24
UNIVERSITY OF T ORONTO UNIVERSITY OF T ORONTO MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG CaSSanDra: An SSD Boosted KeyValue Store Prashanth Menon, Tilmann Rabl, Mohammad Sadoghi (*), HansArno Jacobsen 1 *

CaSSanDra: An SSD Boosted Key-Value Store

Embed Size (px)

DESCRIPTION

This presentation was held by Prashanth Menon at ICDE '14 on April 3, 2014 in Chicago, IL, USA. The full paper and additional information is available at: http://msrg.org/papers/Menon2013 Abstract: With the ever growing size and complexity of enterprise systems there is a pressing need for more detailed application performance management. Due to the high data rates, traditional database technology cannot sustain the required performance. Alternatives are the more lightweight and, thus, more performant key-value stores. However, these systems tend to sacrifice read performance in order to obtain the desired write throughput by avoiding random disk access in favor of fast sequential accesses. With the advent of SSDs, built upon the philosophy of no moving parts, the boundary between sequential vs. random access is now becoming blurred. This provides a unique opportunity to extend the storage memory hierarchy using SSDs in key-value stores. In this paper, we extensively evaluate the benefits of using SSDs in commercialized key-value stores. In particular, we investigate the performance of hybrid SSD-HDD systems and demonstrate the benefits of our SSD caching and our novel dynamic schema model.

Citation preview

Page 1: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

CaSSanDra:  An  SSD  Boosted  Key-­‐Value  StorePrashanth  Menon,  Tilmann  Rabl,  Mohammad  Sadoghi  (*),  Hans-­‐Arno  Jacobsen

!1

*

Page 2: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Outline• ApplicaHon  Performance  Management  

• Cassandra  and  SSDs  

• Extending  Cassandra’s  Row  Cache  

• ImplemenHng  a  Dynamic  Schema  Catalogue  

• Conclusions

!2

Page 3: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Modern  Enterprise  Architecture

• Many  different  soPware  systems  

• Complex  interacHons  

• Stateful  systems  oPen  distributed/parHHoned/replicated  

• Stateless  systems  certainly  duplicated

!3

Page 4: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

ApplicaHon  Performance  Management

• Lightweight  agent  aSached  to  each  soPware  system  instance  

• Monitors  system  health  

• Traces  transacHons  

• Determines  root  causes  

• Raw  APM  metric:

!4

Agent

Agent

Agent

Agent

Agent Agent

AgentAgent

Agent

Agent

Agent

Agent

Page 5: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

ApplicaHon  Performance  Management

• Problem:  Agents  have  short  memory  and  only  have  a  local  view  • What  was  the  average  response  Hme  for  requests  served  by  servlet  X  between  December  18-­‐31  2011?  

• What  was  the  average  Hme  spent  in  each  service/database  to  respond  to  client  requests?

!5

Agent

Agent

Agent

Agent

Agent Agent

AgentAgent

Agent

Agent

Agent

Agent

Page 6: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

APM  Metrics  Datastore

• All  agents  store  metric  data  in  high  write-­‐throughput  datastore  

• Metric  data  is  at  a  fine  granularity  (per-­‐acHon,  millisecond  etc)  

• User  now  has  global  view  of  metrics  

• What  is  the  best  database  to  store  APM  metrics?

!6

Agent

Agent

Agent

Agent

Agent Agent

AgentAgent

Agent

Agent

Agent

Agent?

Page 7: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Cassandra  Wins  APM

• APM  experiments  performed  by  Rabl  et  al.  [1]    show  Cassandra  performs  best  for  APM  use  case  • In  memory  workloads  including  95%,  50%  and  5%  read  • Workloads  requiring  disk  access  with  95%,  50%  and  5%  reads

!7

Read: 95%

0

50000

100000

150000

200000

250000

2 4 6 8 10 12

Thro

ughput (O

ps/

sec)

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 6: Throughput for Workload RW

0.1

1

10

100

1000

2 4 6 8 10 12

Late

ncy

(m

s) -

Logarith

mic

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 7: Read latency for Workload RW

has a throughput that is about 10% higher than for the first work-load. HBase’s throughput increases by 40% for the higher writerate, while Project Voldemort’s throughput shrinks by 33% as doesMySQL’s throughput.

For multiple nodes, Cassandra, HBase, and Project Voldemortfollow the same linear behavior as well. MySQL exhibits a goodspeed-up up to 8 nodes, in which MySQL’s throughput matchesCassandra’s throughput. For 12 nodes, its throughput does no longergrow noticeably. Finally, Redis and VoltDB exhibit the same be-havior as for the Workload R.

As can be seen in Figure 7, the read latency of all systems is es-sentially the same for both Workloads R and RW. The only notabledifference is MySQL, which is 75% less for one node and 40% lessfor 12 nodes.

In Figure 8, the write latency for Workload RW is summarized.The trends closely follows the write latency of Workload R. How-ever, there are two important subtle differences: (1) HBase’s la-

0.01

0.1

1

10

100

2 4 6 8 10 12

Late

ncy

(m

s) -

Logarith

mic

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 8: Write latency for Workload RW

0

50000

100000

150000

200000

250000

2 4 6 8 10 12

Thro

ughput (O

ps/

sec)

Number of Nodes

CassandraHBase

Project Voldemort

VoltDBRedis

MySQL

Figure 9: Throughput for Workload W

0.1

1

10

100

1000

10000

2 4 6 8 10 12

Late

ncy

(m

s) -

Logaritm

ic

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 10: Read latency for Workload W

tency is almost 50% lower than for Workload R; and (2) MySQL’slatency is twice as high on average for all scales.

5.3 Workload WWorkload W is the one that is closest to the APM use case (with-

out scans). It has a write rate of 99% which is too high for webinformation systems’ production workloads. Therefore, this is aworkload neither of the systems was specifically designed for. Thethroughput results can be seen in Figure 9. The results for onenode are similar to the results for Workload RW with the differencethat all system have a worse throughput except for Cassandra andHBase. While Cassandra’s throughput increases modestly (2% for12 nodes), HBase’s throughput increases almost by a factor of 2(for 12 nodes).

For the read latency in Workload W, shown in Figure 7, the mostapparent change is the high latency of HBase. For 12 nodes, it goesup to 1 second on average. Furthermore, Voldemort’s read latencyalmost twice as high while it was constant for Workload R and RW.For the other systems the read latency does not change significantly.

The write latency for Workload W is captured in Figure 11. Itcan be seen that HBase’s write latency increased significantly, bya factor of 20. In contrast to the read latency, Project Voldemort’swrite latency is almost identical to workload RW. For the other sys-tems the write latency increased in the order of 5-15%.

5.4 Workload RSIn the second part of our experiments, we also introduce scans

in the workloads. In particular, we used the existing YCSB clientfor Project Voldemort which does not support scans. Therefore,we omitted Project Voldemort in the following experiments. In thescan experiments, we split the read percentage in equal sized scanand read parts. For Workload RS this results in 47% read and scanoperations and 6% write operations.

Read: 50%

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

2 4 6 8 10 12

Thro

ughput (O

pera

tions/

sec)

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 3: Throughput for Workload R

million records per node, thus, scaling the problem size with thecluster size. For each run, we used a freshly installed system andloaded the data. We ran the workload for 10 minutes with max-imum throughput. Figure 3 shows the maximum throughput forworkload R for all six systems.

In the experiment with only one node, Redis has the highestthroughput (more than 50K ops/sec) followed by VoltDB. Thereare no significant differences between the throughput of Cassan-dra and MySQL, which is about half that of Redis (25K ops/sec).Voldemort is 2 times slower than Cassandra (with 12K ops/sec).The slowest system in this test on a single node is HBase with 2.5Koperation per second. However, it is interesting to observe that thethree web data stores that were explicitly built for scalability in webscale – i.e. Cassandra, Voldemort, and HBase – demonstrate a nicelinear behavior in the maximum throughput.

As discussed previously, we were not able to run the cluster ver-sion of Redis, therefore, we used the Jedis library that shards thedata on standalone instances for multiple nodes. In theory, this is abig advantage for Redis, since it does not have to deal with propa-gating data and such. This also puts much more load on the client,therefore, we had to double the number of machines for the YCSBclients for Redis to fully saturate the standalone instances. How-ever, the results do not show the expected scalability. During thetests, we noticed that the data distribution is unbalanced. This ac-tually caused one Redis node to consistently run out of memoryin the 12 node configuration7. For VoltDB, all configurations thatwe tested showed a slow-down for multiple nodes. It seems thatthe synchronous querying in YCSB is not suitable for a distributedVoltDB configuration. For MySQL we used a similar approach asfor Redis. Each MySQL node was independent and the client man-aged the sharding. Interestingly, the YCSB client for MySQL dida much better sharding than the Jedis library, and we observed analmost perfect speed-up from one to two nodes. For higher numberof nodes the increase of the throughput decreased slightly but wascomparable to the throughput of Cassandra.

Workload R was read-intensive and modeled after the require-ments of web information systems. Thus, we expected a low la-tency for read operations at the three web data stores. The averagelatencies for read operations for Workload R can be seen in Figure4. As mentioned before, the latencies are presented in logarithmicscale. For most systems, the read latencies are fairly stable, whilethey differ strongly in the actual value. Again, Cassandra, HBase,and Voldemort illustrate a similar pattern – the latency increasesslightly for two nodes and then stays constant. Project Voldemort

7We tried both supported hashing algorithms in Jedis, Mur-MurHash and MD5, with the same result. The presented resultsare achieved with MurMurHash

0.1

1

10

100

2 4 6 8 10 12

Late

ncy

(m

s) -

Logarith

mic

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 4: Read latency for Workload R

0.01

0.1

1

10

100

2 4 6 8 10 12

Late

ncy

(m

s) -

Logarith

mic

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 5: Write latency for Workload R

has the lowest latency of 230 µs for one node and 260 µs for 12nodes. Cassandra has a higher average latency of 5 - 8 ms andHBase has a much higher latency of 50 - 90 ms. Both shardedstores, Redis, and MySQL, have a similar pattern as well, with Re-dis having the best latency among all systems. In contrast to theweb data stores, they have a latency that tends to decrease with thescale of the system. This is due to the reduced load per system thatreduces the latency as will be further discussed in Section 5.6. Thelatency for reads in VoltDB is increasing which is consistent withthe decreasing throughput. The read latency is surprisingly highalso for the single node case which, however, has a solid through-put.

The latencies for write operations in Workload R can be seen inFigure 5. The differences in the write latencies are slightly big-ger than the differences in the read latencies. The best latency hasHBase which clearly trades a read latency for write latency. It is,however, not as stable as the latencies of the other systems. Cas-sandra has the highest (stable) write latency of the benchmarkedsystems, which is surprising since it was explicitly built for highinsertion rates [19]. Project Voldemort has roughly the same writeas read latency and, thus, is a good compromise for write and readspeed in this type of workload. The sharded solutions, Redis andMySQL, exhibit the same behavior as for read operations. How-ever, Redis has much lower latency then MySQL while it has lessthroughput for more than 4 nodes. VoltDB again has a high latencyfrom the start which gets prohibitive for more than 4 nodes.

5.2 Workload RWIn our second experiment, we ran Workload RW which has 50%

writes. This is commonly classified as a very high write rate. InFigure 6, the throughput of the systems is shown. For a single node,VoltDB achieves the highest throughput, which is only slightly lowerthan its throughput for Workload R. Redis has a similar through-put, but it has 20% less throughput than for Workload R. Cassandra

[1] http://msrg.org/publications/pdf_files/2012/vldb12-bigdata-Solving_Big_Data_Challenges_fo.pdf

Page 8: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Cassandra

• Built  at  Facebook  by  previous  Dynamo  engineers  • Open  sourced  to  Apache  in  2009  

• DHT  with  consistent  hashing  • MD5  hash  of  key  • MulHple  nodes  handle  segments  of  ring  for  load  balancing  

• Dynamo  distribuHon  and  replicaHon  model  +  BigTable  storage  model

!8

Commit&&Log&

Memtable&

SS&Tables&

Page 9: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Cassandra  and  SSDs• Improve  performance  by  either  adding  nodes  or  improving  per-­‐node  performance  

• Node  performance  is  directly  dependent  on  the  disk  I/O  performance  of  the  system  

• Cassandra  stores  two  enHHes  on  disk:  • Commit  Log  • SSTables  

• Should  SSDs  be  used  to  store  both?  

• We  evaluated  each  possible  configura<on

!9

Page 10: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Experiment  Setup• Server  specificaHon:  

• 2x  Intel  8-­‐core  X5450,  16GB  RAM,  2x  2TB  RAID0  HDD,  2x  250GB  Intel  x520  SSD    • Apache  Cassandra  1.10  

• Used  YCSB  benchmark  • 100M  rows,  50GB  total  raw  data,  ‘latest’  distribuHon  • 95%  read,  5%  write  

• Minimum  three  runs  per  workload,  fresh  data  on  each  run  

• Broken  into  phases:  • Data  load  • FragmentaHon  • Cache  warm-­‐up  • Workload  (>  12h  process)

!10

Page 11: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

SSD  vs.  HDD

• LocaHon  of  log  is  irrelevant  

• LocaHon  of  data  is  important  • DramaHc  performance  improvement  of  SSD  over  HDD  

• SSD  benefits  from  high  parallelism

!11

Configura<on #  of  clients #  of  threads/client Loca<on  of  Data Loca<on  of  Commit  Log

C1 1 2 RAID  (HDD) RAID  (HDD)C2 1 2 RAID  (HDD) SSDC3 1 2 SSD RAID  (HDD)C4 1 2 SSD SSDC5 4 16 RAID  (HDD) RAID  (HDD)C6 4 16 SSD SSD

0

1000

2000

3000

4000

5000

6000

7000

8000

C1 C2 C3 C4 C5 C6

Thro

ughput (o

ps/

sec)

Configuration

(a) HDD vs SSD Throughput

0

1

2

3

4

5

6

7

8

C1 C2 C3 C4 C5 C6

Late

ncy

(m

s)Configuration

(b) HDD vs SDD Latency

0

1000

2000

3000

4000

5000

6000

7000

8000

HDD SSD

Thro

ughput (o

ps/

sec)

Data Location

Empty DiskFull Disk

(c) 99% Fill HDD vs SDD Throughput

0

50

100

150

200

250

HDD SSD

Late

ncy

(m

s)

Data Location

Empty DiskFull Disk

(d) 99% Fill HDD vs SDD Latency

Fig. 4. Throughput/Latency Results for HDD vs SSD and Disk Full vs Disk Empty

on HDD for the bulk of data that is infrequently accessed.Another reason to do this is the fact that SSD performancedegrades with higher fill ratios. As seen in Figure 4(c), theperformance of a highly filled SSD degrades much worse thanthe performance of a highly filled disk. It has to be noted thatthe workload in this case is still read heavy, for write heavyworkloads even worse degradations will be experienced.

When evaluating our extended SSD row cache, the sizeof the data set was 100 million records, where each recordhad five columns having a size of 75 bytes. The total sizeof the data on disk after load averaged 50GB. Our evaluationprocess was broken down into four phases: data loading, datafragmentation, Memtable flush, bufferpool warmup, and trans-actional workload phases. The fragmentation phase attemptsto spread the columns of a row across multiple SSTablesto illustrate the effect of read amplification on LSM-basedstorage systems. In the fragmentation phase, we used a latestrequest distribution with 10% of operations being read and theremaining 90% of operations updating anywhere between oneand all five columns. The warming phase also used a latestrequest distribution with read operations accounting for 99%of all operations. The warmup phase was run until either thecache was full or stored at most 10% of the total dataset.The transactional phase was run with a latest distribution (azipfian distribution where the most recently entered keys arefavoured). These experiments all used configuration C5 (referto Table I), the optimal configuration for HDDs to provide abalanced evaluation.

When evaluating our dynamic schema model, we used adataset consisting of 40 million records where each recordconsisted of between 5 and 10 columns, of 10 bytes. Bydefault, YCSB does not vary the number of columns in arecord during the loading phase. We modified YCSB to createa new varying-size record generator, which we plugged intothe default data generator. Each run of the experiment createda different amount of data on disk, but we observed thatthe average total data size was between 6.5GB and 7GB. Inall runs, we varied the read percentage for the experimentsbetween 95%, 50% and 5% using configuration C6.

A. SSD Row CacheIn Figure 5(a), the throughput of the two Cassandra in-

stances can be seen for the three different workloads thatwere tested. For the 95% read-heavy workload, we see thatthe SSD-enabled row-cache provides an 85% improvementin throughput growing from 384 reads/sec to 710 reads/sec.

This is because a larger portion of the hot data is cached onthe SSD; in fact, our configuration enabled storing more thantwice the amount of data than when using an in-memory cachealone, achieving a cache-hit ratio of more than 85%. Whena read operation reaches the server for a row that does notreside in the off-heap memory cache, only a single SSD seekis required to fulfill the request. In addition, cached data ispre-compacted, meaning that at most one seek is required tofetch the row. We see the same effect in the remaining twoworkloads despite a lower proportion of reads. Cassandra is awrite-optimized system meaning that in write-heavy scenarios,the efficacy of a cache is reduced. This is evidenced by thereduction in the cache-hit ratio from 72% in the workload with85% reads to 60% in the 75%-read workload.

As seen in Figure 5(b), in the 95% read workload, theSSD-enabled row cache averaged a latency of 3ms while thein-memory cache managed a read latency of 5.6ms, a 46%improvement. As the proportion of reads is reduced from 85%to 75%, the latency when using an SSD for the row-cacheremains roughly the same. This is because the latest requestdistribution gives us a high probability that the reads forthe rows can be served directly from Cassandra’s Memtable,which effectively acts as a write-back cache.

B. Dynamic SchemaNext, we illustrate that by extracting the metadata (i.e.,

schema) from the data on-disk we suffer no perceivableperformance penalty. The column names in our test were fixedat 5 bytes and the number of columns varied between 5 and10. This accounts to a minimum saving of 25 bytes frombeing written on a per-row basis. Cassandra, not uncommonfrom many commercial databases, performs buffered I/O;all reads and writes are executed in 16 KB pages. In ourexperiment configuration, one row fits well within a singleCassandra page. This means that reading a row will incur noadditional overhead since the total size of a row with a co-located schema is larger than a modified row with the schemaextracted out. When we extract out the metadata, we expectedno degradation in performance or latency and the results inFigure 5(c) and Figure 5(d) confirm our assertion. Specifically,we conclude that in the 95% and 50%-read workloads, thelatency and throughput were comparable with any differencebeing attributed to the environment.

Throughput and latency are not major motivations for imple-menting the dynamic schema. Fairly significant space savingscan be obtained by extracting redundant schema information

Page 12: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

SSD  vs.  HDD  (II)

• SSD  offers  more  than  7x  improvement  to  throughput  on  empty  disk  

• SSD  performance  degrades  by  half  as  storage  device  fills  up  

• Filling  the  SSD  or  running  it  near  capacity  is  not  advisable

!12

0

1000

2000

3000

4000

5000

6000

7000

8000

C1 C2 C3 C4 C5 C6

Th

rou

gh

pu

t (o

ps/

sec)

Configuration

(a) HDD vs SSD Throughput

0

1

2

3

4

5

6

7

8

C1 C2 C3 C4 C5 C6

La

ten

cy (

ms)

Configuration

(b) HDD vs SDD Latency

0

1000

2000

3000

4000

5000

6000

7000

8000

HDD SSD

Th

rou

gh

pu

t (o

ps/

sec)

Data Location

Empty DiskFull Disk

(c) 99% Fill HDD vs SDD Throughput

0

50

100

150

200

250

HDD SSD

La

ten

cy (

ms)

Data Location

Empty DiskFull Disk

(d) 99% Fill HDD vs SDD Latency

Fig. 4. Throughput/Latency Results for HDD vs SSD and Disk Full vs Disk Empty

on HDD for the bulk of data that is infrequently accessed.Another reason to do this is the fact that SSD performancedegrades with higher fill ratios. As seen in Figure 4(c), theperformance of a highly filled SSD degrades much worse thanthe performance of a highly filled disk. It has to be noted thatthe workload in this case is still read heavy, for write heavyworkloads even worse degradations will be experienced.

When evaluating our extended SSD row cache, the sizeof the data set was 100 million records, where each recordhad five columns having a size of 75 bytes. The total sizeof the data on disk after load averaged 50GB. Our evaluationprocess was broken down into four phases: data loading, datafragmentation, Memtable flush, bufferpool warmup, and trans-actional workload phases. The fragmentation phase attemptsto spread the columns of a row across multiple SSTablesto illustrate the effect of read amplification on LSM-basedstorage systems. In the fragmentation phase, we used a latestrequest distribution with 10% of operations being read and theremaining 90% of operations updating anywhere between oneand all five columns. The warming phase also used a latestrequest distribution with read operations accounting for 99%of all operations. The warmup phase was run until either thecache was full or stored at most 10% of the total dataset.The transactional phase was run with a latest distribution (azipfian distribution where the most recently entered keys arefavoured). These experiments all used configuration C5 (referto Table I), the optimal configuration for HDDs to provide abalanced evaluation.

When evaluating our dynamic schema model, we used adataset consisting of 40 million records where each recordconsisted of between 5 and 10 columns, of 10 bytes. Bydefault, YCSB does not vary the number of columns in arecord during the loading phase. We modified YCSB to createa new varying-size record generator, which we plugged intothe default data generator. Each run of the experiment createda different amount of data on disk, but we observed thatthe average total data size was between 6.5GB and 7GB. Inall runs, we varied the read percentage for the experimentsbetween 95%, 50% and 5% using configuration C6.

A. SSD Row CacheIn Figure 5(a), the throughput of the two Cassandra in-

stances can be seen for the three different workloads thatwere tested. For the 95% read-heavy workload, we see thatthe SSD-enabled row-cache provides an 85% improvementin throughput growing from 384 reads/sec to 710 reads/sec.

This is because a larger portion of the hot data is cached onthe SSD; in fact, our configuration enabled storing more thantwice the amount of data than when using an in-memory cachealone, achieving a cache-hit ratio of more than 85%. Whena read operation reaches the server for a row that does notreside in the off-heap memory cache, only a single SSD seekis required to fulfill the request. In addition, cached data ispre-compacted, meaning that at most one seek is required tofetch the row. We see the same effect in the remaining twoworkloads despite a lower proportion of reads. Cassandra is awrite-optimized system meaning that in write-heavy scenarios,the efficacy of a cache is reduced. This is evidenced by thereduction in the cache-hit ratio from 72% in the workload with85% reads to 60% in the 75%-read workload.

As seen in Figure 5(b), in the 95% read workload, theSSD-enabled row cache averaged a latency of 3ms while thein-memory cache managed a read latency of 5.6ms, a 46%improvement. As the proportion of reads is reduced from 85%to 75%, the latency when using an SSD for the row-cacheremains roughly the same. This is because the latest requestdistribution gives us a high probability that the reads forthe rows can be served directly from Cassandra’s Memtable,which effectively acts as a write-back cache.

B. Dynamic SchemaNext, we illustrate that by extracting the metadata (i.e.,

schema) from the data on-disk we suffer no perceivableperformance penalty. The column names in our test were fixedat 5 bytes and the number of columns varied between 5 and10. This accounts to a minimum saving of 25 bytes frombeing written on a per-row basis. Cassandra, not uncommonfrom many commercial databases, performs buffered I/O;all reads and writes are executed in 16 KB pages. In ourexperiment configuration, one row fits well within a singleCassandra page. This means that reading a row will incur noadditional overhead since the total size of a row with a co-located schema is larger than a modified row with the schemaextracted out. When we extract out the metadata, we expectedno degradation in performance or latency and the results inFigure 5(c) and Figure 5(d) confirm our assertion. Specifically,we conclude that in the 95% and 50%-read workloads, thelatency and throughput were comparable with any differencebeing attributed to the environment.

Throughput and latency are not major motivations for imple-menting the dynamic schema. Fairly significant space savingscan be obtained by extracting redundant schema information

Page 13: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

SSD  vs.  HDD:  Summary• Cassandra  benefits  most  when  storing  data  on  SSD  (not  the  log)  

• LocaHon  of  commit  log  not  important  

• SSD  performance  inversely  proporHonal  to  fill  raHo  

• Storing  all  data  on  SSD  is  uneconomical  • Replacing  3TB  HDD  with  3x  1TB  SSD  is  10x  more  costly  • SSDs  have  limited  lifeHme  (10-­‐50K  write-­‐erase  cycles),  replacement  more  frequently  

• Rabl  et  al.  [1]  show  adding  node  is  100%  costlier,  with  100%  throughput  improvement  

• Build  hybrid  system  to  get  comparable  performance  for  marginal  cost

!13

Page 14: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Cassandra:  Read  +  Write  Path• Write  path  is  fast:  

1. Write  update  into  commit  log  2. Write  update  into  Memtable  

• Memtables  flush  to  SSTables  asynchronously  when  full  • Never  blocks  writes  

• Read  path  can  be  slow:  1. Read  key-­‐value  from  Memtable  2. Read  key-­‐value  from  each  SSTable  on  disk  3. Construct  merged  view  of  row  from  each  

input  source

!14

ReadUpdate

Memtable

SSTableSSTableSSTable SSTableSSTableSSTable

Memory

• Each  read  needs  to  do  O(#  of  SSTables)  I/O

Disk

Log

Page 15: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Cassandra:  SSTables• Cassandra  allows  blind-­‐writes  

• Row  data  can  be  fragmented  over  mulHple  SSTables  over  Hme  

!

!

!

!

• Bloom  filters  and  indexes  can  potenHally  help  

• Ul<mately,  mul<ple  fragments  need  to  be  read  from  disk

!15

Employee(ID( First(Name( Last(Name( Age( Department(ID(99231234& Prashanth& Menon& 25& MSRG&

{SSTables

Page 16: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Cassandra:  Row  Cache• Row  cache  buffers  full  merged  row  in  memory  

• Cache  miss  follows  regular  read  path,  constructs  merged  row,  brings  into  cache  

• Makes  read  path  faster  for  frequently  accessed  data  

• Problem:  Row  cache  occupies  memory  • Takes  away  precious  memory  from  rest  of  system

!16

• Extend  the  row  cache  efficiently  onto  SSD

ReadUpdate

Memtable

SSTableSSTableSSTable SSTableSSTableSSTable

Memory

Disk

Log

Row Cache

Page 17: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Extended  Row  Cache• Extend  the  row  cache  onto  SSD  

• Chained  with  in-­‐memory  row  cache  • LRU  in-­‐memory,  overflow  onto  LRU  SSD  row  cache  

• Implemented  as  append-­‐only  cache  files  • Efficient  sequenHal  writes  • Fast  random  reads  

• Zero  I/O  for  hit  in  first  level  row  cache  

• One  random  I/O  on  SSD  for  second  level  row  cache  

!17

Log SSTableSSTableSSTable SSTableSSTableSSTable

Memory

Memtable

1rst Level Row Cache

2nd Level Cache Index

Disk

2nd Level Row CacheSSD

ReadUpdate

Page 18: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

EvaluaHon:  SSD  Row  Cache

• Setup:  • 100M  rows,  50GB  total  data,  6GB  row  cache  

• Results:  • 75%  improvement  in  throughput  • 75%  improvement  in  latency  • RAM-­‐only  cache  has  too  liSle  hit  raHo

!18

0

200

400

600

800

1000

95% 85% 75%

Thro

ughput (o

ps/

sec)

Read Percentage

DisabledRAM

RAM+SSD

(a) Row Cache (Throughput)

0

1

2

3

4

5

6

7

8

95% 85% 75%

Late

ncy

(m

s)

Read Percentage

DisabledRAM

RAM+SSD

(b) Row Cache (Latency)

0

1000

2000

3000

4000

5000

6000

7000

95% 50% 5%

Thro

ughput (o

ps/

sec)

Read Percentage

RegularDynamic

(c) Dynamic Schema (Throughput)

0

20

40

60

80

100

120

140

95% 50% 5%

Late

ncy

(m

s)

Read Percentage

RegularDynamic

(d) Dynamic Schema (Latency)

Fig. 5. Throughput/Latency Results for Row Cache Extension and Dynamic Schema

and we find this to be much more compelling. In normaloperation, data sizes averaged 6.8GB compressed after theinitial load of 40 million keys. With a modified Cassandra,data sizes averaged at 6.01GB of data, a savings of roughly10%. This value will grow as the number of columns in thetable grow and as column names grow in length.

Another potential benefit for dynamic schema model (omit-ted in the interest of space), is in executing column-slicequeries. When performing a read from Cassandra, it is possibleto read a slice of the row by specifying which columns to read.Though Cassandra has an index per-row, it is only a sample;not every column has an appropriate index entry. If we havea schema on hand, we know precisely the layout of the rowon disk which we can use to optimize the read process andavoid cache pollution.

Finally, it is important to note that we are not using high-endenterprise PCIe-bus SSDs (e.g., FusionIO), yet we are getting asubstantial performance improvement. Therefore, we concludethat even with inexpensive commodity SSDs, a considerablethroughput and latency improvement is achieved.

VII. RELATED WORK

There exists a recent move in the database community toexploit key SSD characteristics such fast random reads thatis orders of magnitude faster than magnetic physical drivesand using SSDs to make updates disk-I/O friendly, e.g., [3],[4], [5]. One way to exploit SSDs is to introduce a storagehierarchy in which SSDs are placed as a cache between mainmemory and disks. This extends the database bufferpool tospan over both main memory and SSDs. A novel temperature-based bufferpool replacement policy was introduced in [4],which substantially improved both transactional and analyticalquery processing in IBM DB2. In our work, we go beyonda simply extension of the bufferpool with SSDs, instead wedevelop specialized bufferpool enhancements that targets theslow read path problem (incurring many random I/Os in orderto consolidate across many SSTables) of key-value stores in thecontext of Cassandra. Furthermore, we introduce the conceptof dynamic schema (i.e., dynamic catalogue) that decouplesthe commonly joint meta-data and data on key-value stores(such as Cassandra [6] and BigTable [7]) by maintaining theschema information on SSDs. Lastly, in [14], similar to ourframework, the use of SSDs as cache was also explored ina proof-of-concept key-value store prototype. In contrast, weintroduce the storage hierarchy and our SSD caching tech-niques within a commercialized key-value store. Furthermore,

we identify new avenues for exploiting the use of SSDs withinkey-value stores, namely, our dynamic cataloguing technique.

VIII. CONCLUSION

In this paper, we investigated the performance benefits ofSSDs in key-value stores. We benchmarked different con-figurations of SSD and HDD combinations. We proposedand implemented two specific optimizations for SSD-HDDhybrid systems and showed their effectiveness in detailedbenchmarks. Our extended row cache strategy transparentlystores hot data on SSD and thus extends the row cache inCassandra. Our benchmarking results show that this extensioncan achieve improvements of 85% for realistic workloads. Oursecond technique for SSD-HDD hybrid systems is a dynamicschema catalogue. It reduces the disk impact of row-levelschema models and thus increases the performance of commonworkloads and data sets.

For future work, we will adapt our methodology so it canbe directly run on SSD instead of going through the FTL. Thiswill increase the performance of the SSD operations and allowfor SSD optimized data structures and algorithms.

REFERENCES

[1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, andA. H. Byers, “Big data: The Next Frontier for Innovation, Competition,and Productivity,” McKinsey Global Institute, Tech. Rep., 2011.

[2] T. Rabl, M. Sadoghi, H.-A. Jacobsen, S. Gomez-Villamor, V. Muntes-Mulero, and S. Mankowskii, “Solving Big Data Challenges for Enter-prise Application Performance Management,” PVLDB, 2012.

[3] M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A. Ross, and C. A.Lang, “An object placement advisor for DB2 using solid state storage,”PVLDB, 2009.

[4] ——, “SSD bufferpool extensions for database systems,” PVLDB, 2010.[5] M. Sadoghi, K. A. Ross, M. Canim, and B. Bhattacharjee, “Making

updates disk-I/O friendly using SSDs,” PVLDB’13.[6] A. Lakshman and P. Malik, “Cassandra: a decentralized structured

storage system,” SIGOPS Review, 2010.[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur-

rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributedstorage system for structured data,” in OSDI, 2006.

[8] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo:Amazon’s Highly Available Key-Value Store,” in SOSP, 2007.

[9] R. Cartell, “Scalable SQL and NoSQL data stores,” SIGMOD Record,2010.

[10] M. Cornwell, “Anatomy of a solid-state drive,” Communications of theACM, 2012.

[11] L. Bouganim, B. r Jnsson, and P. Bonnet, “uFLIP: Understanding FlashIO Patterns,” in CIDR ’09: Fourth Biennial Conference on InnovativeData Systems Research.

[12] G. Graefe, “The Five-Minute Rule 20 Years Later: and How FlashMemory Changes the Rules ,” Communications of the ACM, 2009.

[13] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,“Benchmarking cloud serving systems with YCSB,” in SoCC, 2010.

[14] B. Debnath, S. Sengupta, and J. Li, “FlashStore: high throughputpersistent key-value store,” PVLDB, 2010.

Page 19: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Dynamic  Schema• Key-­‐value  stores  covet  schema-­‐less  data  model  

• Very  flexible,  good  for  highly  varying  data  • Schemas  oPen  change,  defining  up  front  can  be  detrimental  !!!!!

!

• ObservaHon:  many  big  data  applicaHons  have  relaHvely  stable  schemas  • e.g.,  Click  stream,  APM,  sensor  data  etc.  

• Redundant  schemas  have  significant  overhead  in  I/O  and  space  usage

!19

Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'

Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1'

Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'

OnHDisk'Format'

Metric'Name' Timestamp' Value' Max' Min'HostA/AgentX/AVGResponse' 1332988833' 4' 6' 1'

ApplicaKon'Format'

Page 20: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Dynamic  Schema  (III)• Don’t  serialize  redundant  schema  with  rows  

• Extract  schema  from  data,  store  on  SSD,  serialize  schema  ID  with  data  

• Allows  for  large  number  of  schemas

!20

Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1'Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'

S1'S2'

Metric'Name'Timestamp' Value' Max' Min'Metric'Name'Timestamp' All' Warn' Error'

HostA/AgentX/AVGResponse'1332988833' S1' 4' 6' 1'HostA/AgentX/AVGResponse'1332988848'

HostA/AgentX/Failures' 1332988849'S1'S2'

5' 7' 1'4' 3' 1'

New'Disk'Format'Schema'Catalogue'

Old'Disk'Format'

SSD

Page 21: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

EvaluaHon:  Dynamic  Schema

• Setup:  • 40M  rows,  variable  columns  5-­‐10  (638  schemas),  6GB  row  cache  

• Results:  • 10%  reducHon  in  disk  usage  (6.8GB  vs  6GB)  • Slightly  improved  throughput,  stable  latency  

• EffecHve  SSD  usage  (only  random  reads)  &  reduce  I/O  and  space  usage

!21

0

200

400

600

800

1000

95% 85% 75%

Thro

ughput (o

ps/

sec)

Read Percentage

DisabledRAM

RAM+SSD

(a) Row Cache (Throughput)

0

1

2

3

4

5

6

7

8

95% 85% 75%

Late

ncy

(m

s)

Read Percentage

DisabledRAM

RAM+SSD

(b) Row Cache (Latency)

0

1000

2000

3000

4000

5000

6000

7000

95% 50% 5%

Thro

ughput (o

ps/

sec)

Read Percentage

RegularDynamic

(c) Dynamic Schema (Throughput)

0

20

40

60

80

100

120

140

95% 50% 5%

Late

ncy

(m

s)

Read Percentage

RegularDynamic

(d) Dynamic Schema (Latency)

Fig. 5. Throughput/Latency Results for Row Cache Extension and Dynamic Schema

and we find this to be much more compelling. In normaloperation, data sizes averaged 6.8GB compressed after theinitial load of 40 million keys. With a modified Cassandra,data sizes averaged at 6.01GB of data, a savings of roughly10%. This value will grow as the number of columns in thetable grow and as column names grow in length.

Another potential benefit for dynamic schema model (omit-ted in the interest of space), is in executing column-slicequeries. When performing a read from Cassandra, it is possibleto read a slice of the row by specifying which columns to read.Though Cassandra has an index per-row, it is only a sample;not every column has an appropriate index entry. If we havea schema on hand, we know precisely the layout of the rowon disk which we can use to optimize the read process andavoid cache pollution.

Finally, it is important to note that we are not using high-endenterprise PCIe-bus SSDs (e.g., FusionIO), yet we are getting asubstantial performance improvement. Therefore, we concludethat even with inexpensive commodity SSDs, a considerablethroughput and latency improvement is achieved.

VII. RELATED WORK

There exists a recent move in the database community toexploit key SSD characteristics such fast random reads thatis orders of magnitude faster than magnetic physical drivesand using SSDs to make updates disk-I/O friendly, e.g., [3],[4], [5]. One way to exploit SSDs is to introduce a storagehierarchy in which SSDs are placed as a cache between mainmemory and disks. This extends the database bufferpool tospan over both main memory and SSDs. A novel temperature-based bufferpool replacement policy was introduced in [4],which substantially improved both transactional and analyticalquery processing in IBM DB2. In our work, we go beyonda simply extension of the bufferpool with SSDs, instead wedevelop specialized bufferpool enhancements that targets theslow read path problem (incurring many random I/Os in orderto consolidate across many SSTables) of key-value stores in thecontext of Cassandra. Furthermore, we introduce the conceptof dynamic schema (i.e., dynamic catalogue) that decouplesthe commonly joint meta-data and data on key-value stores(such as Cassandra [6] and BigTable [7]) by maintaining theschema information on SSDs. Lastly, in [14], similar to ourframework, the use of SSDs as cache was also explored ina proof-of-concept key-value store prototype. In contrast, weintroduce the storage hierarchy and our SSD caching tech-niques within a commercialized key-value store. Furthermore,

we identify new avenues for exploiting the use of SSDs withinkey-value stores, namely, our dynamic cataloguing technique.

VIII. CONCLUSION

In this paper, we investigated the performance benefits ofSSDs in key-value stores. We benchmarked different con-figurations of SSD and HDD combinations. We proposedand implemented two specific optimizations for SSD-HDDhybrid systems and showed their effectiveness in detailedbenchmarks. Our extended row cache strategy transparentlystores hot data on SSD and thus extends the row cache inCassandra. Our benchmarking results show that this extensioncan achieve improvements of 85% for realistic workloads. Oursecond technique for SSD-HDD hybrid systems is a dynamicschema catalogue. It reduces the disk impact of row-levelschema models and thus increases the performance of commonworkloads and data sets.

For future work, we will adapt our methodology so it canbe directly run on SSD instead of going through the FTL. Thiswill increase the performance of the SSD operations and allowfor SSD optimized data structures and algorithms.

REFERENCES

[1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, andA. H. Byers, “Big data: The Next Frontier for Innovation, Competition,and Productivity,” McKinsey Global Institute, Tech. Rep., 2011.

[2] T. Rabl, M. Sadoghi, H.-A. Jacobsen, S. Gomez-Villamor, V. Muntes-Mulero, and S. Mankowskii, “Solving Big Data Challenges for Enter-prise Application Performance Management,” PVLDB, 2012.

[3] M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A. Ross, and C. A.Lang, “An object placement advisor for DB2 using solid state storage,”PVLDB, 2009.

[4] ——, “SSD bufferpool extensions for database systems,” PVLDB, 2010.[5] M. Sadoghi, K. A. Ross, M. Canim, and B. Bhattacharjee, “Making

updates disk-I/O friendly using SSDs,” PVLDB’13.[6] A. Lakshman and P. Malik, “Cassandra: a decentralized structured

storage system,” SIGOPS Review, 2010.[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur-

rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributedstorage system for structured data,” in OSDI, 2006.

[8] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo:Amazon’s Highly Available Key-Value Store,” in SOSP, 2007.

[9] R. Cartell, “Scalable SQL and NoSQL data stores,” SIGMOD Record,2010.

[10] M. Cornwell, “Anatomy of a solid-state drive,” Communications of theACM, 2012.

[11] L. Bouganim, B. r Jnsson, and P. Bonnet, “uFLIP: Understanding FlashIO Patterns,” in CIDR ’09: Fourth Biennial Conference on InnovativeData Systems Research.

[12] G. Graefe, “The Five-Minute Rule 20 Years Later: and How FlashMemory Changes the Rules ,” Communications of the ACM, 2009.

[13] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,“Benchmarking cloud serving systems with YCSB,” in SoCC, 2010.

[14] B. Debnath, S. Sengupta, and J. Li, “FlashStore: high throughputpersistent key-value store,” PVLDB, 2010.

Page 22: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Conclusions• Storing  Cassandra  commit  logs  on  SSD  doesn’t  help  

• Managing  SSDs  at  capacity  degrades  its  performance  

• Using  SSDs  as  a  secondary  row-­‐cache  dramaHcally  improves  performance  

• ExtracHng  redundant  schemas  onto  and  SSD  reduces  disk  space  usage  and  required  I/O

!22

Page 23: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Thanks!!

• QuesHons?  

!

• Contact:    • Prashanth  Menon  ([email protected])

!23

Page 24: CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

Future  Work• What  types  of  tables  benefit  most  from  a  dynamic  schema?  

• Impact  of  compacHon  on  read-­‐heavy  workloads  • How  can  SSDs  be  used  to  improve  the  performance  of  compacHon?  

• How  is  performance  when  storing  only  SSTable  indexes  on  SSD?

!24