52
CASSANDRA & SOLID STATE DRIVES Rick Branson, DataStax

Cassandra and Solid State Drives

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Cassandra and Solid State Drives

CASSANDRA & SOLID STATE DRIVESRick Branson, DataStax

Page 2: Cassandra and Solid State Drives

FACT

CASSANDRA’S STORAGE ENGINE WAS OPTIMIZED

FOR SPINNING DISKS

Page 3: Cassandra and Solid State Drives

LSM-TREES

Page 4: Cassandra and Solid State Drives

WRITE PATH

Page 5: Cassandra and Solid State Drives

Client Cassandra

On-Disk Node Commit Log

{ cf1: { row1: { col1: abc } } }

{ cf1: { row1: { col2: def } } }

{ cf1: { row1: { col1: <del> } } }

{ cf1: { row2: { col1: xyz } } }

{ cf1: { row1: { col3: foo } } }

insert({ cf1: { row1: { col3: foo } } })

In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”

row1

row2

col1: [del] col2: “def” col3: “foo”

col1: “xyz”

COMMIT

Page 6: Cassandra and Solid State Drives

SSTableSSTableSSTableSSTableSSTableSSTableSSTable

1

SSTableSSTableSSTableSSTableSSTableSSTableSSTable

2

SSTableSSTableSSTableSSTableSSTableSSTableSSTable

3

SSTableSSTableSSTableSSTableSSTableSSTableSSTable

4

In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”In-Memory Memtable for “cf1”

row1

row2

col1: [del] col2: “def” col3: “foo”

col1: “xyz”

FLUSH

Page 7: Cassandra and Solid State Drives

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable

31 2 4

SSTables are merged to maintain read performance

COMPACT

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable

Page 8: Cassandra and Solid State Drives

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable SSTableSSTableSSTableSSTableSSTable

SSTableSSTableSSTableSSTableSSTableSSTableSSTableSSTableNew SSTable is streamedto disk and old SSTables

are erased

X X X X

Page 9: Cassandra and Solid State Drives

TAKEAWAYS

• All disk writes are sequential, append-only operations

• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)

• SSTables are completely immutable

Page 10: Cassandra and Solid State Drives

TAKEAWAYS

• All disk writes are sequential, append-only operations

• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)

• SSTables are completely immutable

IMPORTANT

Page 11: Cassandra and Solid State Drives

COMPARED

• Most popular data storage engines rewrite modified data in-place: MySQL (InnoDB), PostgreSQL, Oracle, MongoDB, Membase, BerkeleyDB, etc

• Most perform similar buffering of writes before flushing to disk

• ... but flushes are RANDOM writes

Page 12: Cassandra and Solid State Drives

SPINNING DISKS• Dirt cheap: $0.08/GB

• Seek time limited by time it takes for drive to rotate: IOPS = RPM/60

• 7,200 RPM = ~120 IOPS

• 15,000 RPM has been the max for decades

• Sequential operations are best: 125MB/sec for modern SATA drives

Page 13: Cassandra and Solid State Drives

THAT WAS THE WORLD IN WHICH CASSANDRA

WAS BUILT

Page 14: Cassandra and Solid State Drives

2012: MLC NAND FLASH*

• Affordable: ~$1.75/GB street

• Massive IOPS: 39,500/sec read, 23,000/sec write

• Latency of less than 100µs

• Good sequential throughput: 270MB/sec read, 205MB/sec write

• Way cheaper per IOPS: $0.02 vs $1.25

* based on specifications provided by Intel for 300GB Intel 320 drive

Page 15: Cassandra and Solid State Drives

WITH RANDOM ACCESS STORAGE, ARE CASSANDRA’S

LSM-TREES OBSOLETE?

Page 16: Cassandra and Solid State Drives
Page 17: Cassandra and Solid State Drives

SOLID STATE HAS SOME MAJOR BUTS...

Page 18: Cassandra and Solid State Drives

... BUT• Cannot overwrite directly: must erase

first, then write

• Can write in small increments (4KB), but only erase in ~512KB blocks

• Latency: write is ~100µs, erase is ~2ms

• Limited durability: ~5,000 cycles (MLC) for each erase block

Page 19: Cassandra and Solid State Drives

WEAR LEVELING is used to reduce the number of total erase operations

Page 20: Cassandra and Solid State Drives

WEAR LEVELING

Page 21: Cassandra and Solid State Drives

WEAR LEVELINGErase Block

Page 22: Cassandra and Solid State Drives

WEAR LEVELING

Page 23: Cassandra and Solid State Drives

WEAR LEVELING

Page 24: Cassandra and Solid State Drives

WEAR LEVELING

Disk Page

Page 25: Cassandra and Solid State Drives

WEAR LEVELING

Write 1

Page 26: Cassandra and Solid State Drives

WEAR LEVELING

Write 1

Write 2

Page 27: Cassandra and Solid State Drives

WEAR LEVELING

Write 1

Write 2

Write 3

Page 28: Cassandra and Solid State Drives

Write 1

Write 2

Write 3

How is data from only Write 2 modified?

Remember: the whole block must be erased

Page 29: Cassandra and Solid State Drives

Mark Garbage

Page 30: Cassandra and Solid State Drives

Mark Garbage AppendModified

Data

Empty Block

Page 31: Cassandra and Solid State Drives

Wait... GARBAGE?

Page 32: Cassandra and Solid State Drives

THAT MEANS...

Page 33: Cassandra and Solid State Drives

... fragmentation,WHICH MEANS...

Page 34: Cassandra and Solid State Drives

Garbage Collection!

Page 35: Cassandra and Solid State Drives

GARBAGE COLLECTION

• Compacts fragmented disk blocks

• Erase operations drag on performance

• Modern SSDs do this in the background... as much as possible

• If no empty blocks are available, GC must be done before ANY writes can complete

Page 36: Cassandra and Solid State Drives

WRITE AMPLIFICATION

• When only a few kilobytes are written, but fragmentation causes a whole block to be rewritten

• The smaller & more random the writes, the worse this gets

• Modern “mark and sweep” GC reduces it, but cannot eliminate it

Page 37: Cassandra and Solid State Drives

Torture test shows massive write performance drop-off for heavily fragmented drive

Source: http://www.anandtech.com/show/4712/the-crucial-m4-ssd-update-faster-with-fw0009/6

Page 38: Cassandra and Solid State Drives

Some poorly designed drives COMPLETELY fall apart

Source: http://www.anandtech.com/show/5272/ocz-octane-128gb-ssd-review/6

Page 39: Cassandra and Solid State Drives

Even a well-behaved drive suffers significantly from the

torture test

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11

Page 40: Cassandra and Solid State Drives

Post-torture, all disk blocks were marked empty, and the

“fast” comes back...

Source: http://www.anandtech.com/show/4244/intel-ssd-320-review/11

Page 41: Cassandra and Solid State Drives
Page 42: Cassandra and Solid State Drives

“TRIM”• Filesystems don’t typically immediately

erase data when files are deleted, they just mark them as deleted and erase later

• TRIM allows the OS to actively tell the drive when a region of disk is no longer used

• If an entire erase block is marked as unused, GC is avoided, otherwise TRIM just hastens the collection process

Page 43: Cassandra and Solid State Drives

TRIM only reduces the write amplification effect,

it can’t eliminate it.

Page 44: Cassandra and Solid State Drives

THEN THERE’S LIFETIME...

Page 45: Cassandra and Solid State Drives
Page 46: Cassandra and Solid State Drives
Page 47: Cassandra and Solid State Drives

AnandTech estimates that modern MLC SSDsonly last about 1.5 years under heavy MySQL load,

which causes around 10x write amplification

Page 48: Cassandra and Solid State Drives

REMEMBER THIS?

Page 49: Cassandra and Solid State Drives

TAKEAWAYS

• All disk writes are sequential, append-only operations

• On-disk tables (SSTables) are written in sorted order, so compaction is linear complexity O(N)

• SSTables are completely immutable

Page 50: Cassandra and Solid State Drives

CASSANDRA ONLY WRITES

SEQUENTIALLY

Page 51: Cassandra and Solid State Drives

“For a sequential write workload, write amplification is equal to 1,

i.e., there is no write amplification.”

Source: Hu, X.-Y., and R. Haas, “The Fundamental Limitations of Flash Random Write Performance: Understanding, Analysis, and Performance Modeling”

Page 52: Cassandra and Solid State Drives

THANK YOU.~ @rbranson