CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson


UNIVERSITY OF

TORONTO


Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

CaSSanDra: An SSD Boosted Key-‐Value StorePrashanth Menon, Tilmann Rabl, Mohammad Sadoghi (*), Hans-‐Arno Jacobsen

!1

*


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Outline• ApplicaHon Performance Management

• Cassandra and SSDs

• Extending Cassandra’s Row Cache

• ImplemenHng a Dynamic Schema Catalogue

• Conclusions

!2


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Modern Enterprise Architecture

• Many different soPware systems

• Complex interacHons

• Stateful systems oPen distributed/parHHoned/replicated

• Stateless systems certainly duplicated

!3


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

ApplicaHon Performance Management

• Lightweight agent aSached to each soPware system instance

• Monitors system health

• Traces transacHons

• Determines root causes

• Raw APM metric:

!4

Agent

Agent

Agent

Agent

Agent Agent

AgentAgent

Agent

Agent

Agent

Agent


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

ApplicaHon Performance Management

• Problem: Agents have short memory and only have a local view • What was the average response Hme for requests served by servlet X between December 18-‐31 2011?

• What was the average Hme spent in each service/database to respond to client requests?

!5

Agent

Agent

Agent

Agent

Agent Agent

AgentAgent

Agent

Agent

Agent

Agent


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

APM Metrics Datastore

• All agents store metric data in high write-‐throughput datastore

• Metric data is at a fine granularity (per-‐acHon, millisecond etc)

• User now has global view of metrics

• What is the best database to store APM metrics?

!6

Agent

Agent

Agent

Agent

Agent Agent

AgentAgent

Agent

Agent

Agent

Agent?


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Cassandra Wins APM

• APM experiments performed by Rabl et al. [1] show Cassandra performs best for APM use case • In memory workloads including 95%, 50% and 5% read • Workloads requiring disk access with 95%, 50% and 5% reads

!7

Read: 95%

0

50000

100000

150000

200000

250000

2 4 6 8 10 12

Thro

ughput (O

ps/

sec)

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 6: Throughput for Workload RW

0.1

1

10

100

1000

2 4 6 8 10 12

Late

ncy

(m

s) -

Logarith

mic

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 7: Read latency for Workload RW

has a throughput that is about 10% higher than for the first work-load. HBase’s throughput increases by 40% for the higher writerate, while Project Voldemort’s throughput shrinks by 33% as doesMySQL’s throughput.

For multiple nodes, Cassandra, HBase, and Project Voldemortfollow the same linear behavior as well. MySQL exhibits a goodspeed-up up to 8 nodes, in which MySQL’s throughput matchesCassandra’s throughput. For 12 nodes, its throughput does no longergrow noticeably. Finally, Redis and VoltDB exhibit the same be-havior as for the Workload R.

As can be seen in Figure 7, the read latency of all systems is es-sentially the same for both Workloads R and RW. The only notabledifference is MySQL, which is 75% less for one node and 40% lessfor 12 nodes.

In Figure 8, the write latency for Workload RW is summarized.The trends closely follows the write latency of Workload R. How-ever, there are two important subtle differences: (1) HBase’s la-

0.01

0.1

1

10

100

2 4 6 8 10 12

Late

ncy

(m

s) -

Logarith

mic

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 8: Write latency for Workload RW

0

50000

100000

150000

200000

250000

2 4 6 8 10 12

Thro

ughput (O

ps/

sec)

Number of Nodes

CassandraHBase

Project Voldemort

VoltDBRedis

MySQL

Figure 9: Throughput for Workload W

0.1

1

10

100

1000

10000

2 4 6 8 10 12

Late

ncy

(m

s) -

Logaritm

ic

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 10: Read latency for Workload W

tency is almost 50% lower than for Workload R; and (2) MySQL’slatency is twice as high on average for all scales.

5.3 Workload WWorkload W is the one that is closest to the APM use case (with-

out scans). It has a write rate of 99% which is too high for webinformation systems’ production workloads. Therefore, this is aworkload neither of the systems was specifically designed for. Thethroughput results can be seen in Figure 9. The results for onenode are similar to the results for Workload RW with the differencethat all system have a worse throughput except for Cassandra andHBase. While Cassandra’s throughput increases modestly (2% for12 nodes), HBase’s throughput increases almost by a factor of 2(for 12 nodes).

For the read latency in Workload W, shown in Figure 7, the mostapparent change is the high latency of HBase. For 12 nodes, it goesup to 1 second on average. Furthermore, Voldemort’s read latencyalmost twice as high while it was constant for Workload R and RW.For the other systems the read latency does not change significantly.

The write latency for Workload W is captured in Figure 11. Itcan be seen that HBase’s write latency increased significantly, bya factor of 20. In contrast to the read latency, Project Voldemort’swrite latency is almost identical to workload RW. For the other sys-tems the write latency increased in the order of 5-15%.

5.4 Workload RSIn the second part of our experiments, we also introduce scans

in the workloads. In particular, we used the existing YCSB clientfor Project Voldemort which does not support scans. Therefore,we omitted Project Voldemort in the following experiments. In thescan experiments, we split the read percentage in equal sized scanand read parts. For Workload RS this results in 47% read and scanoperations and 6% write operations.

Read: 50%

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

2 4 6 8 10 12

Thro

ughput (O

pera

tions/

sec)

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 3: Throughput for Workload R

million records per node, thus, scaling the problem size with thecluster size. For each run, we used a freshly installed system andloaded the data. We ran the workload for 10 minutes with max-imum throughput. Figure 3 shows the maximum throughput forworkload R for all six systems.

In the experiment with only one node, Redis has the highestthroughput (more than 50K ops/sec) followed by VoltDB. Thereare no significant differences between the throughput of Cassan-dra and MySQL, which is about half that of Redis (25K ops/sec).Voldemort is 2 times slower than Cassandra (with 12K ops/sec).The slowest system in this test on a single node is HBase with 2.5Koperation per second. However, it is interesting to observe that thethree web data stores that were explicitly built for scalability in webscale – i.e. Cassandra, Voldemort, and HBase – demonstrate a nicelinear behavior in the maximum throughput.

As discussed previously, we were not able to run the cluster ver-sion of Redis, therefore, we used the Jedis library that shards thedata on standalone instances for multiple nodes. In theory, this is abig advantage for Redis, since it does not have to deal with propa-gating data and such. This also puts much more load on the client,therefore, we had to double the number of machines for the YCSBclients for Redis to fully saturate the standalone instances. How-ever, the results do not show the expected scalability. During thetests, we noticed that the data distribution is unbalanced. This ac-tually caused one Redis node to consistently run out of memoryin the 12 node configuration7. For VoltDB, all configurations thatwe tested showed a slow-down for multiple nodes. It seems thatthe synchronous querying in YCSB is not suitable for a distributedVoltDB configuration. For MySQL we used a similar approach asfor Redis. Each MySQL node was independent and the client man-aged the sharding. Interestingly, the YCSB client for MySQL dida much better sharding than the Jedis library, and we observed analmost perfect speed-up from one to two nodes. For higher numberof nodes the increase of the throughput decreased slightly but wascomparable to the throughput of Cassandra.

Workload R was read-intensive and modeled after the require-ments of web information systems. Thus, we expected a low la-tency for read operations at the three web data stores. The averagelatencies for read operations for Workload R can be seen in Figure4. As mentioned before, the latencies are presented in logarithmicscale. For most systems, the read latencies are fairly stable, whilethey differ strongly in the actual value. Again, Cassandra, HBase,and Voldemort illustrate a similar pattern – the latency increasesslightly for two nodes and then stays constant. Project Voldemort

7We tried both supported hashing algorithms in Jedis, Mur-MurHash and MD5, with the same result. The presented resultsare achieved with MurMurHash

0.1

1

10

100

2 4 6 8 10 12

Late

ncy

(m

s) -

Logarith

mic

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 4: Read latency for Workload R

0.01

0.1

1

10

100

2 4 6 8 10 12

Late

ncy

(m

s) -

Logarith

mic

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 5: Write latency for Workload R

has the lowest latency of 230 µs for one node and 260 µs for 12nodes. Cassandra has a higher average latency of 5 - 8 ms andHBase has a much higher latency of 50 - 90 ms. Both shardedstores, Redis, and MySQL, have a similar pattern as well, with Re-dis having the best latency among all systems. In contrast to theweb data stores, they have a latency that tends to decrease with thescale of the system. This is due to the reduced load per system thatreduces the latency as will be further discussed in Section 5.6. Thelatency for reads in VoltDB is increasing which is consistent withthe decreasing throughput. The read latency is surprisingly highalso for the single node case which, however, has a solid through-put.

The latencies for write operations in Workload R can be seen inFigure 5. The differences in the write latencies are slightly big-ger than the differences in the read latencies. The best latency hasHBase which clearly trades a read latency for write latency. It is,however, not as stable as the latencies of the other systems. Cas-sandra has the highest (stable) write latency of the benchmarkedsystems, which is surprising since it was explicitly built for highinsertion rates [19]. Project Voldemort has roughly the same writeas read latency and, thus, is a good compromise for write and readspeed in this type of workload. The sharded solutions, Redis andMySQL, exhibit the same behavior as for read operations. How-ever, Redis has much lower latency then MySQL while it has lessthroughput for more than 4 nodes. VoltDB again has a high latencyfrom the start which gets prohibitive for more than 4 nodes.

5.2 Workload RWIn our second experiment, we ran Workload RW which has 50%

writes. This is commonly classified as a very high write rate. InFigure 6, the throughput of the systems is shown. For a single node,VoltDB achieves the highest throughput, which is only slightly lowerthan its throughput for Workload R. Redis has a similar through-put, but it has 20% less throughput than for Workload R. Cassandra

[1] http://msrg.org/publications/pdf_files/2012/vldb12-bigdata-Solving_Big_Data_Challenges_fo.pdf

http://msrg.org/publications/pdf_files/2012/vldb12-bigdata-Solving_Big_Data_Challenges_fo.pdf


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Cassandra

• Built at Facebook by previous Dynamo engineers • Open sourced to Apache in 2009

• DHT with consistent hashing • MD5 hash of key • MulHple nodes handle segments of ring for load balancing

• Dynamo distribuHon and replicaHon model + BigTable storage model

!8

Commit&&Log&

Memtable&

SS&Tables&


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Cassandra and SSDs• Improve performance by either adding nodes or improving per-‐node performance

• Node performance is directly dependent on the disk I/O performance of the system

• Cassandra stores two enHHes on disk: • Commit Log • SSTables

• Should SSDs be used to store both?

• We evaluated each possible configura<on

!9


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Experiment Setup• Server specificaHon:

• 2x Intel 8-‐core X5450, 16GB RAM, 2x 2TB RAID0 HDD, 2x 250GB Intel x520 SSD • Apache Cassandra 1.10

• Used YCSB benchmark • 100M rows, 50GB total raw data, ‘latest’ distribuHon • 95% read, 5% write

• Minimum three runs per workload, fresh data on each run

• Broken into phases: • Data load • FragmentaHon • Cache warm-‐up • Workload (> 12h process)

!10


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

SSD vs. HDD

• LocaHon of log is irrelevant

• LocaHon of data is important • DramaHc performance improvement of SSD over HDD

• SSD benefits from high parallelism

!11

Configura<on # of clients # of threads/client Loca<on of Data Loca<on of Commit Log

C1 1 2 RAID (HDD) RAID (HDD)C2 1 2 RAID (HDD) SSDC3 1 2 SSD RAID (HDD)C4 1 2 SSD SSDC5 4 16 RAID (HDD) RAID (HDD)C6 4 16 SSD SSD

0

1000

2000

3000

4000

5000

6000

7000

8000

C1 C2 C3 C4 C5 C6

Thro

ughput (o

ps/

sec)

Configuration

(a) HDD vs SSD Throughput

0

1

2

3

4

5

6

7

8

C1 C2 C3 C4 C5 C6

Late

ncy

(m

s)Configuration

(b) HDD vs SDD Latency

0

1000

2000

3000

4000

5000

6000

7000

8000

HDD SSD

Thro

ughput (o

ps/

sec)

Data Location

Empty DiskFull Disk

(c) 99% Fill HDD vs SDD Throughput

0

50

100

150

200

250

HDD SSD

Late

ncy

(m

s)

Data Location

Empty DiskFull Disk

(d) 99% Fill HDD vs SDD Latency

Fig. 4. Throughput/Latency Results for HDD vs SSD and Disk Full vs Disk Empty

on HDD for the bulk of data that is infrequently accessed.Another reason to do this is the fact that SSD performancedegrades with higher fill ratios. As seen in Figure 4(c), theperformance of a highly filled SSD degrades much worse thanthe performance of a highly filled disk. It has to be noted thatthe workload in this case is still read heavy, for write heavyworkloads even worse degradations will be experienced.

When evaluating our extended SSD row cache, the sizeof the data set was 100 million records, where each recordhad five columns having a size of 75 bytes. The total sizeof the data on disk after load averaged 50GB. Our evaluationprocess was broken down into four phases: data loading, datafragmentation, Memtable flush, bufferpool warmup, and trans-actional workload phases. The fragmentation phase attemptsto spread the columns of a row across multiple SSTablesto illustrate the effect of read amplification on LSM-basedstorage systems. In the fragmentation phase, we used a latestrequest distribution with 10% of operations being read and theremaining 90% of operations updating anywhere between oneand all five columns. The warming phase also used a latestrequest distribution with read operations accounting for 99%of all operations. The warmup phase was run until either thecache was full or stored at most 10% of the total dataset.The transactional phase was run with a latest distribution (azipfian distribution where the most recently entered keys arefavoured). These experiments all used configuration C5 (referto Table I), the optimal configuration for HDDs to provide abalanced evaluation.

When evaluating our dynamic schema model, we used adataset consisting of 40 million records where each recordconsisted of between 5 and 10 columns, of 10 bytes. Bydefault, YCSB does not vary the number of columns in arecord during the loading phase. We modified YCSB to createa new varying-size record generator, which we plugged intothe default data generator. Each run of the experiment createda different amount of data on disk, but we observed thatthe average total data size was between 6.5GB and 7GB. Inall runs, we varied the read percentage for the experimentsbetween 95%, 50% and 5% using configuration C6.

A. SSD Row CacheIn Figure 5(a), the throughput of the two Cassandra in-

stances can be seen for the three different workloads thatwere tested. For the 95% read-heavy workload, we see thatthe SSD-enabled row-cache provides an 85% improvementin throughput growing from 384 reads/sec to 710 reads/sec.

This is because a larger portion of the hot data is cached onthe SSD; in fact, our configuration enabled storing more thantwice the amount of data than when using an in-memory cachealone, achieving a cache-hit ratio of more than 85%. Whena read operation reaches the server for a row that does notreside in the off-heap memory cache, only a single SSD seekis required to fulfill the request. In addition, cached data ispre-compacted, meaning that at most one seek is required tofetch the row. We see the same effect in the remaining twoworkloads despite a lower proportion of reads. Cassandra is awrite-optimized system meaning that in write-heavy scenarios,the efficacy of a cache is reduced. This is evidenced by thereduction in the cache-hit ratio from 72% in the workload with85% reads to 60% in the 75%-read workload.

As seen in Figure 5(b), in the 95% read workload, theSSD-enabled row cache averaged a latency of 3ms while thein-memory cache managed a read latency of 5.6ms, a 46%improvement. As the proportion of reads is reduced from 85%to 75%, the latency when using an SSD for the row-cacheremains roughly the same. This is because the latest requestdistribution gives us a high probability that the reads forthe rows can be served directly from Cassandra’s Memtable,which effectively acts as a write-back cache.

B. Dynamic SchemaNext, we illustrate that by extracting the metadata (i.e.,

schema) from the data on-disk we suffer no perceivableperformance penalty. The column names in our test were fixedat 5 bytes and the number of columns varied between 5 and10. This accounts to a minimum saving of 25 bytes frombeing written on a per-row basis. Cassandra, not uncommonfrom many commercial databases, performs buffered I/O;all reads and writes are executed in 16 KB pages. In ourexperiment configuration, one row fits well within a singleCassandra page. This means that reading a row will incur noadditional overhead since the total size of a row with a co-located schema is larger than a modified row with the schemaextracted out. When we extract out the metadata, we expectedno degradation in performance or latency and the results inFigure 5(c) and Figure 5(d) confirm our assertion. Specifically,we conclude that in the 95% and 50%-read workloads, thelatency and throughput were comparable with any differencebeing attributed to the environment.

Throughput and latency are not major motivations for imple-menting the dynamic schema. Fairly significant space savingscan be obtained by extracting redundant schema information


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

SSD vs. HDD (II)

• SSD offers more than 7x improvement to throughput on empty disk

• SSD performance degrades by half as storage device fills up

• Filling the SSD or running it near capacity is not advisable

!12

0

1000

2000

3000

4000

5000

6000

7000

8000

C1 C2 C3 C4 C5 C6

Th

rou

gh

pu

t (o

ps/

sec)

Configuration

(a) HDD vs SSD Throughput

0

1

2

3

4

5

6

7

8

C1 C2 C3 C4 C5 C6

La

ten

cy (

ms)

Configuration

(b) HDD vs SDD Latency

0

1000

2000

3000

4000

5000

6000

7000

8000

HDD SSD

Th

rou

gh

pu

t (o

ps/

sec)

Data Location

Empty DiskFull Disk

(c) 99% Fill HDD vs SDD Throughput

0

50

100

150

200

250

HDD SSD

La

ten

cy (

ms)

Data Location

Empty DiskFull Disk

(d) 99% Fill HDD vs SDD Latency

Fig. 4. Throughput/Latency Results for HDD vs SSD and Disk Full vs Disk Empty

on HDD for the bulk of data that is infrequently accessed.Another reason to do this is the fact that SSD performancedegrades with higher fill ratios. As seen in Figure 4(c), theperformance of a highly filled SSD degrades much worse thanthe performance of a highly filled disk. It has to be noted thatthe workload in this case is still read heavy, for write heavyworkloads even worse degradations will be experienced.

When evaluating our extended SSD row cache, the sizeof the data set was 100 million records, where each recordhad five columns having a size of 75 bytes. The total sizeof the data on disk after load averaged 50GB. Our evaluationprocess was broken down into four phases: data loading, datafragmentation, Memtable flush, bufferpool warmup, and trans-actional workload phases. The fragmentation phase attemptsto spread the columns of a row across multiple SSTablesto illustrate the effect of read amplification on LSM-basedstorage systems. In the fragmentation phase, we used a latestrequest distribution with 10% of operations being read and theremaining 90% of operations updating anywhere between oneand all five columns. The warming phase also used a latestrequest distribution with read operations accounting for 99%of all operations. The warmup phase was run until either thecache was full or stored at most 10% of the total dataset.The transactional phase was run with a latest distribution (azipfian distribution where the most recently entered keys arefavoured). These experiments all used configuration C5 (referto Table I), the optimal configuration for HDDs to provide abalanced evaluation.

When evaluating our dynamic schema model, we used adataset consisting of 40 million records where each recordconsisted of between 5 and 10 columns, of 10 bytes. Bydefault, YCSB does not vary the number of columns in arecord during the loading phase. We modified YCSB to createa new varying-size record generator, which we plugged intothe default data generator. Each run of the experiment createda different amount of data on disk, but we observed thatthe average total data size was between 6.5GB and 7GB. Inall runs, we varied the read percentage for the experimentsbetween 95%, 50% and 5% using configuration C6.

A. SSD Row CacheIn Figure 5(a), the throughput of the two Cassandra in-

stances can be seen for the three different workloads thatwere tested. For the 95% read-heavy workload, we see thatthe SSD-enabled row-cache provides an 85% improvementin throughput growing from 384 reads/sec to 710 reads/sec.

This is because a larger portion of the hot data is cached onthe SSD; in fact, our configuration enabled storing more thantwice the amount of data than when using an in-memory cachealone, achieving a cache-hit ratio of more than 85%. Whena read operation reaches the server for a row that does notreside in the off-heap memory cache, only a single SSD seekis required to fulfill the request. In addition, cached data ispre-compacted, meaning that at most one seek is required tofetch the row. We see the same effect in the remaining twoworkloads despite a lower proportion of reads. Cassandra is awrite-optimized system meaning that in write-heavy scenarios,the efficacy of a cache is reduced. This is evidenced by thereduction in the cache-hit ratio from 72% in the workload with85% reads to 60% in the 75%-read workload.

As seen in Figure 5(b), in the 95% read workload, theSSD-enabled row cache averaged a latency of 3ms while thein-memory cache managed a read latency of 5.6ms, a 46%improvement. As the proportion of reads is reduced from 85%to 75%, the latency when using an SSD for the row-cacheremains roughly the same. This is because the latest requestdistribution gives us a high probability that the reads forthe rows can be served directly from Cassandra’s Memtable,which effectively acts as a write-back cache.

B. Dynamic SchemaNext, we illustrate that by extracting the metadata (i.e.,

schema) from the data on-disk we suffer no perceivableperformance penalty. The column names in our test were fixedat 5 bytes and the number of columns varied between 5 and10. This accounts to a minimum saving of 25 bytes frombeing written on a per-row basis. Cassandra, not uncommonfrom many commercial databases, performs buffered I/O;all reads and writes are executed in 16 KB pages. In ourexperiment configuration, one row fits well within a singleCassandra page. This means that reading a row will incur noadditional overhead since the total size of a row with a co-located schema is larger than a modified row with the schemaextracted out. When we extract out the metadata, we expectedno degradation in performance or latency and the results inFigure 5(c) and Figure 5(d) confirm our assertion. Specifically,we conclude that in the 95% and 50%-read workloads, thelatency and throughput were comparable with any differencebeing attributed to the environment.

Throughput and latency are not major motivations for imple-menting the dynamic schema. Fairly significant space savingscan be obtained by extracting redundant schema information


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

SSD vs. HDD: Summary• Cassandra benefits most when storing data on SSD (not the log)

• LocaHon of commit log not important

• SSD performance inversely proporHonal to fill raHo

• Storing all data on SSD is uneconomical • Replacing 3TB HDD with 3x 1TB SSD is 10x more costly • SSDs have limited lifeHme (10-‐50K write-‐erase cycles), replacement more frequently

• Rabl et al. [1] show adding node is 100% costlier, with 100% throughput improvement

• Build hybrid system to get comparable performance for marginal cost

!13


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Cassandra: Read + Write Path• Write path is fast:

1. Write update into commit log 2. Write update into Memtable

• Memtables flush to SSTables asynchronously when full • Never blocks writes

• Read path can be slow: 1. Read key-‐value from Memtable 2. Read key-‐value from each SSTable on disk 3. Construct merged view of row from each

input source

!14

ReadUpdate

Memtable

SSTableSSTableSSTable SSTableSSTableSSTable

Memory

• Each read needs to do O(# of SSTables) I/O

Disk

Log


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Cassandra: SSTables• Cassandra allows blind-‐writes

• Row data can be fragmented over mulHple SSTables over Hme

!

!

!

!

• Bloom filters and indexes can potenHally help

• Ul<mately, mul<ple fragments need to be read from disk

!15

Employee(ID( First(Name( Last(Name( Age( Department(ID(99231234& Prashanth& Menon& 25& MSRG&

{SSTables


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Cassandra: Row Cache• Row cache buffers full merged row in memory

• Cache miss follows regular read path, constructs merged row, brings into cache

• Makes read path faster for frequently accessed data

• Problem: Row cache occupies memory • Takes away precious memory from rest of system

!16

• Extend the row cache efficiently onto SSD

ReadUpdate

Memtable

SSTableSSTableSSTable SSTableSSTableSSTable

Memory

Disk

Log

Row Cache


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Extended Row Cache• Extend the row cache onto SSD

• Chained with in-‐memory row cache • LRU in-‐memory, overflow onto LRU SSD row cache

• Implemented as append-‐only cache files • Efficient sequenHal writes • Fast random reads

• Zero I/O for hit in first level row cache

• One random I/O on SSD for second level row cache

!17

Log SSTableSSTableSSTable SSTableSSTableSSTable

Memory

Memtable

1rst Level Row Cache

2nd Level Cache Index

Disk

2nd Level Row CacheSSD

ReadUpdate


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

EvaluaHon: SSD Row Cache

• Setup: • 100M rows, 50GB total data, 6GB row cache

• Results: • 75% improvement in throughput • 75% improvement in latency • RAM-‐only cache has too liSle hit raHo

!18

0

200

400

600

800

1000

95% 85% 75%

Thro

ughput (o

ps/

sec)

Read Percentage

DisabledRAM

RAM+SSD

(a) Row Cache (Throughput)

0

1

2

3

4

5

6

7

8

95% 85% 75%

Late

ncy

(m

s)

Read Percentage

DisabledRAM

RAM+SSD

(b) Row Cache (Latency)

0

1000

2000

3000

4000

5000

6000

7000

95% 50% 5%

Thro

ughput (o

ps/

sec)

Read Percentage

RegularDynamic

(c) Dynamic Schema (Throughput)

0

20

40

60

80

100

120

140

95% 50% 5%

Late

ncy

(m

s)

Read Percentage

RegularDynamic

(d) Dynamic Schema (Latency)

Fig. 5. Throughput/Latency Results for Row Cache Extension and Dynamic Schema

and we find this to be much more compelling. In normaloperation, data sizes averaged 6.8GB compressed after theinitial load of 40 million keys. With a modified Cassandra,data sizes averaged at 6.01GB of data, a savings of roughly10%. This value will grow as the number of columns in thetable grow and as column names grow in length.

Another potential benefit for dynamic schema model (omit-ted in the interest of space), is in executing column-slicequeries. When performing a read from Cassandra, it is possibleto read a slice of the row by specifying which columns to read.Though Cassandra has an index per-row, it is only a sample;not every column has an appropriate index entry. If we havea schema on hand, we know precisely the layout of the rowon disk which we can use to optimize the read process andavoid cache pollution.

Finally, it is important to note that we are not using high-endenterprise PCIe-bus SSDs (e.g., FusionIO), yet we are getting asubstantial performance improvement. Therefore, we concludethat even with inexpensive commodity SSDs, a considerablethroughput and latency improvement is achieved.

VII. RELATED WORK

There exists a recent move in the database community toexploit key SSD characteristics such fast random reads thatis orders of magnitude faster than magnetic physical drivesand using SSDs to make updates disk-I/O friendly, e.g., [3],[4], [5]. One way to exploit SSDs is to introduce a storagehierarchy in which SSDs are placed as a cache between mainmemory and disks. This extends the database bufferpool tospan over both main memory and SSDs. A novel temperature-based bufferpool replacement policy was introduced in [4],which substantially improved both transactional and analyticalquery processing in IBM DB2. In our work, we go beyonda simply extension of the bufferpool with SSDs, instead wedevelop specialized bufferpool enhancements that targets theslow read path problem (incurring many random I/Os in orderto consolidate across many SSTables) of key-value stores in thecontext of Cassandra. Furthermore, we introduce the conceptof dynamic schema (i.e., dynamic catalogue) that decouplesthe commonly joint meta-data and data on key-value stores(such as Cassandra [6] and BigTable [7]) by maintaining theschema information on SSDs. Lastly, in [14], similar to ourframework, the use of SSDs as cache was also explored ina proof-of-concept key-value store prototype. In contrast, weintroduce the storage hierarchy and our SSD caching tech-niques within a commercialized key-value store. Furthermore,

we identify new avenues for exploiting the use of SSDs withinkey-value stores, namely, our dynamic cataloguing technique.

VIII. CONCLUSION

In this paper, we investigated the performance benefits ofSSDs in key-value stores. We benchmarked different con-figurations of SSD and HDD combinations. We proposedand implemented two specific optimizations for SSD-HDDhybrid systems and showed their effectiveness in detailedbenchmarks. Our extended row cache strategy transparentlystores hot data on SSD and thus extends the row cache inCassandra. Our benchmarking results show that this extensioncan achieve improvements of 85% for realistic workloads. Oursecond technique for SSD-HDD hybrid systems is a dynamicschema catalogue. It reduces the disk impact of row-levelschema models and thus increases the performance of commonworkloads and data sets.

For future work, we will adapt our methodology so it canbe directly run on SSD instead of going through the FTL. Thiswill increase the performance of the SSD operations and allowfor SSD optimized data structures and algorithms.

REFERENCES

[1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, andA. H. Byers, “Big data: The Next Frontier for Innovation, Competition,and Productivity,” McKinsey Global Institute, Tech. Rep., 2011.

[2] T. Rabl, M. Sadoghi, H.-A. Jacobsen, S. Gomez-Villamor, V. Muntes-Mulero, and S. Mankowskii, “Solving Big Data Challenges for Enter-prise Application Performance Management,” PVLDB, 2012.

[3] M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A. Ross, and C. A.Lang, “An object placement advisor for DB2 using solid state storage,”PVLDB, 2009.

[4] ——, “SSD bufferpool extensions for database systems,” PVLDB, 2010.[5] M. Sadoghi, K. A. Ross, M. Canim, and B. Bhattacharjee, “Making

updates disk-I/O friendly using SSDs,” PVLDB’13.[6] A. Lakshman and P. Malik, “Cassandra: a decentralized structured

storage system,” SIGOPS Review, 2010.[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur-

rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributedstorage system for structured data,” in OSDI, 2006.

[8] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo:Amazon’s Highly Available Key-Value Store,” in SOSP, 2007.

[9] R. Cartell, “Scalable SQL and NoSQL data stores,” SIGMOD Record,2010.

[10] M. Cornwell, “Anatomy of a solid-state drive,” Communications of theACM, 2012.

[11] L. Bouganim, B. r Jnsson, and P. Bonnet, “uFLIP: Understanding FlashIO Patterns,” in CIDR ’09: Fourth Biennial Conference on InnovativeData Systems Research.

[12] G. Graefe, “The Five-Minute Rule 20 Years Later: and How FlashMemory Changes the Rules ,” Communications of the ACM, 2009.

[13] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,“Benchmarking cloud serving systems with YCSB,” in SoCC, 2010.

[14] B. Debnath, S. Sengupta, and J. Li, “FlashStore: high throughputpersistent key-value store,” PVLDB, 2010.


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Dynamic Schema• Key-‐value stores covet schema-‐less data model

• Very flexible, good for highly varying data • Schemas oPen change, defining up front can be detrimental !!!!!

!

• ObservaHon: many big data applicaHons have relaHvely stable schemas • e.g., Click stream, APM, sensor data etc.

• Redundant schemas have significant overhead in I/O and space usage

!19

Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'

Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1'

Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'

OnHDisk'Format'

Metric'Name' Timestamp' Value' Max' Min'HostA/AgentX/AVGResponse' 1332988833' 4' 6' 1'

ApplicaKon'Format'


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Dynamic Schema (III)• Don’t serialize redundant schema with rows

• Extract schema from data, store on SSD, serialize schema ID with data

• Allows for large number of schemas

!20

Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1'Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'

S1'S2'

Metric'Name'Timestamp' Value' Max' Min'Metric'Name'Timestamp' All' Warn' Error'

HostA/AgentX/AVGResponse'1332988833' S1' 4' 6' 1'HostA/AgentX/AVGResponse'1332988848'

HostA/AgentX/Failures' 1332988849'S1'S2'

5' 7' 1'4' 3' 1'

New'Disk'Format'Schema'Catalogue'

Old'Disk'Format'

SSD


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

EvaluaHon: Dynamic Schema

• Setup: • 40M rows, variable columns 5-‐10 (638 schemas), 6GB row cache

• Results: • 10% reducHon in disk usage (6.8GB vs 6GB) • Slightly improved throughput, stable latency

• EffecHve SSD usage (only random reads) & reduce I/O and space usage

!21

0

200

400

600

800

1000

95% 85% 75%

Thro

ughput (o

ps/

sec)

Read Percentage

DisabledRAM

RAM+SSD

(a) Row Cache (Throughput)

0

1

2

3

4

5

6

7

8

95% 85% 75%

Late

ncy

(m

s)

Read Percentage

DisabledRAM

RAM+SSD

(b) Row Cache (Latency)

0

1000

2000

3000

4000

5000

6000

7000

95% 50% 5%

Thro

ughput (o

ps/

sec)

Read Percentage

RegularDynamic

(c) Dynamic Schema (Throughput)

0

20

40

60

80

100

120

140

95% 50% 5%

Late

ncy

(m

s)

Read Percentage

RegularDynamic

(d) Dynamic Schema (Latency)

Fig. 5. Throughput/Latency Results for Row Cache Extension and Dynamic Schema

and we find this to be much more compelling. In normaloperation, data sizes averaged 6.8GB compressed after theinitial load of 40 million keys. With a modified Cassandra,data sizes averaged at 6.01GB of data, a savings of roughly10%. This value will grow as the number of columns in thetable grow and as column names grow in length.

Another potential benefit for dynamic schema model (omit-ted in the interest of space), is in executing column-slicequeries. When performing a read from Cassandra, it is possibleto read a slice of the row by specifying which columns to read.Though Cassandra has an index per-row, it is only a sample;not every column has an appropriate index entry. If we havea schema on hand, we know precisely the layout of the rowon disk which we can use to optimize the read process andavoid cache pollution.

Finally, it is important to note that we are not using high-endenterprise PCIe-bus SSDs (e.g., FusionIO), yet we are getting asubstantial performance improvement. Therefore, we concludethat even with inexpensive commodity SSDs, a considerablethroughput and latency improvement is achieved.

VII. RELATED WORK

There exists a recent move in the database community toexploit key SSD characteristics such fast random reads thatis orders of magnitude faster than magnetic physical drivesand using SSDs to make updates disk-I/O friendly, e.g., [3],[4], [5]. One way to exploit SSDs is to introduce a storagehierarchy in which SSDs are placed as a cache between mainmemory and disks. This extends the database bufferpool tospan over both main memory and SSDs. A novel temperature-based bufferpool replacement policy was introduced in [4],which substantially improved both transactional and analyticalquery processing in IBM DB2. In our work, we go beyonda simply extension of the bufferpool with SSDs, instead wedevelop specialized bufferpool enhancements that targets theslow read path problem (incurring many random I/Os in orderto consolidate across many SSTables) of key-value stores in thecontext of Cassandra. Furthermore, we introduce the conceptof dynamic schema (i.e., dynamic catalogue) that decouplesthe commonly joint meta-data and data on key-value stores(such as Cassandra [6] and BigTable [7]) by maintaining theschema information on SSDs. Lastly, in [14], similar to ourframework, the use of SSDs as cache was also explored ina proof-of-concept key-value store prototype. In contrast, weintroduce the storage hierarchy and our SSD caching tech-niques within a commercialized key-value store. Furthermore,

we identify new avenues for exploiting the use of SSDs withinkey-value stores, namely, our dynamic cataloguing technique.

VIII. CONCLUSION

In this paper, we investigated the performance benefits ofSSDs in key-value stores. We benchmarked different con-figurations of SSD and HDD combinations. We proposedand implemented two specific optimizations for SSD-HDDhybrid systems and showed their effectiveness in detailedbenchmarks. Our extended row cache strategy transparentlystores hot data on SSD and thus extends the row cache inCassandra. Our benchmarking results show that this extensioncan achieve improvements of 85% for realistic workloads. Oursecond technique for SSD-HDD hybrid systems is a dynamicschema catalogue. It reduces the disk impact of row-levelschema models and thus increases the performance of commonworkloads and data sets.

For future work, we will adapt our methodology so it canbe directly run on SSD instead of going through the FTL. Thiswill increase the performance of the SSD operations and allowfor SSD optimized data structures and algorithms.

REFERENCES

[1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, andA. H. Byers, “Big data: The Next Frontier for Innovation, Competition,and Productivity,” McKinsey Global Institute, Tech. Rep., 2011.

[2] T. Rabl, M. Sadoghi, H.-A. Jacobsen, S. Gomez-Villamor, V. Muntes-Mulero, and S. Mankowskii, “Solving Big Data Challenges for Enter-prise Application Performance Management,” PVLDB, 2012.

[3] M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A. Ross, and C. A.Lang, “An object placement advisor for DB2 using solid state storage,”PVLDB, 2009.

[4] ——, “SSD bufferpool extensions for database systems,” PVLDB, 2010.[5] M. Sadoghi, K. A. Ross, M. Canim, and B. Bhattacharjee, “Making

updates disk-I/O friendly using SSDs,” PVLDB’13.[6] A. Lakshman and P. Malik, “Cassandra: a decentralized structured

storage system,” SIGOPS Review, 2010.[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur-

rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributedstorage system for structured data,” in OSDI, 2006.

[8] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo:Amazon’s Highly Available Key-Value Store,” in SOSP, 2007.

[9] R. Cartell, “Scalable SQL and NoSQL data stores,” SIGMOD Record,2010.

[10] M. Cornwell, “Anatomy of a solid-state drive,” Communications of theACM, 2012.

[11] L. Bouganim, B. r Jnsson, and P. Bonnet, “uFLIP: Understanding FlashIO Patterns,” in CIDR ’09: Fourth Biennial Conference on InnovativeData Systems Research.

[12] G. Graefe, “The Five-Minute Rule 20 Years Later: and How FlashMemory Changes the Rules ,” Communications of the ACM, 2009.

[13] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,“Benchmarking cloud serving systems with YCSB,” in SoCC, 2010.

[14] B. Debnath, S. Sengupta, and J. Li, “FlashStore: high throughputpersistent key-value store,” PVLDB, 2010.


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Conclusions• Storing Cassandra commit logs on SSD doesn’t help

• Managing SSDs at capacity degrades its performance

• Using SSDs as a secondary row-‐cache dramaHcally improves performance

• ExtracHng redundant schemas onto and SSD reduces disk space usage and required I/O

!22


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Thanks!!

• QuesHons?

!

• Contact: • Prashanth Menon ([email protected])

!23

mailto:[email protected]


UNIVERSITY OF

TORONTO


Ryan�Johnson


MSRG.ORG

Future Work• What types of tables benefit most from a dynamic schema?

• Impact of compacHon on read-‐heavy workloads • How can SSDs be used to improve the performance of compacHon?

• How is performance when storing only SSTable indexes on SSD?

!24

Technology

CaSSanDra: An SSD Boosted Key-Value Store