Mark Callaghan, Facebook

MySQL versus something elseEvaluating alternative databases

Mark CallaghanSmall Data EngineerOctober, !"#$

Friday, October 25, 13

What metric is important?▪ Throughput

▪ Throughput while minimizing response time variance

▪ Efficiency - reduce cost while meeting response time goals


My focus is storage efficiency▪ Use flash to get IOPs

▪ Use spinning disks to get capacity

▪ Use both to reduce cost while improving quality of service

device frequentreads

frequentwrites read IOPs write IOPs

flash

flash

SATA, /dev/null

SATA, /dev/null

yes yes yes maybe

yes no yes no

no yes no maybe

no no no no


What technology would you choose today?▪ How do you value flexibility?

▪ Newer & faster hardware arrives each year

▪ Servers you buy today will be in production for a few years▪ Software can last even longer in production

▪ We have several generations of HW on the small data tiers▪ Pure-disk (SAS array + HW RAID)▪ Flashcache (SATA array + HW RAID, flash)▪ Pure-flash


Common definitions▪ Sorted run - rows stored in key order

▪ may be stored using many range-partitioned files

▪ Memtable - sorted run in memory▪ L! - " or more sorted runs on disk▪ L", L#, ... Lmax - each is " sorted run on disk

▪ Lmax is the largest level▪ by size L# < L! ... < Lmax

▪ live$ - percentage of live data in the database


Amplification factors▪ Framework for describing efficiency of database algorithms

▪ How much is done physically in response to a logical change?▪ Read amplification

▪ Write amplification▪ Space amplification

▪ Can determine▪ How many disks or flash you must buy▪ How long your flash might last▪ Whether you can buy lower endurance flash


Read amplification▪ Read-amp == disk reads per query

▪ Separate results for point query versus short range scan

▪ Assume some data is in cache▪ Assume the index is covering for the query

▪ Example: b-tree with all non-leaf levels in cache▪ Point read-amp - # disk read to get the leaf block▪ Short range read-amp - # or ! disk reads to get the leaf blocks


Read amplification and bloom filters▪ Bloom filter summary

▪ f(key) -> { no, maybe }▪ Use ~#" bits/row to get reasonable false positive rate▪ Great for avoiding disk reads on point queries▪ Bonus - prevent disk reads for keys that don’t exist

▪ Useless for general range scans like “select x where y < "!!”▪ Can be useful for equality prefix like “select x where q = "! and y < "!!”

▪ use bloom filter on q

▪ Too many bloom filter checks can hurt response time▪ each sorted run on disk needs a bloom filter check


Write amplification▪ Write-amp == bytes written per byte changed

▪ Insert #"" bytes with write-amp=% and %"" bytes will be written

▪ For now ignore penalty from small random writes

▪ Some writes done immediately, others are deferred

▪ Immediate -> redo log▪ Deferred -> b-tree dirty pages not forced on commit, LSM compaction


Write amplification, part !▪ HW can increase write-amp

▪ Read live pages and write them elsewhere when cleaning flash blocks

▪ Only a cost for algorithms that do small random writes

▪ Redo log writes can increase write-amp

▪ Writes must be done to a multiple of %#! or larger▪ Insert #"" byte row, force %#! byte sector for redo has write-amp=%


Why write amplification matters▪ Write endurance for flash device

▪ The wrong algorithm can wear out the device too soon▪ The right algorithm might let you buy lower cost/endurance device

▪ Write-amp can predict peak performance▪ If storage can sustain &"" MB/second of writes▪ And write-amp is #"▪ Then database can sustain &" MB/second of changes


Simple request - make counting faster▪ Some web-scale workloads need to maintain counts

▪ Database is IO-bound▪ Workload should be write-heavy, counters might not be read

▪ update foo set count = count + " where key = ‘bar’▪ Read-modify-write▪ Write-only: write delta, merge deltas later when queried/compacted


Space amplification▪ Space-amp == sizeof(database files) / sizeof(data)

▪ Ignore secondary indexes

▪ Assume database files are in steady state (fragmented & compacted)▪ Space-amp == #"" / 'live

▪ Things that change space amplification▪ B-tree fragmentation▪ Old versions of rows that are yet to be collected▪ Compression

▪ Per row/page metadata (rollback pointer, transaction ID, ...)


Space versus write amplification▪ Sorry for the confusion

▪ Databases store N blocks in # extent▪ Flash devices store N pages in # block

▪ Copy out▪ Read live data from the cleaned extent, write it elsewhere▪ Cost is a function of the percentage of live data▪ Larger live' means less space and more write amplification▪ Smaller live' means more space and less write amplification


Space versus write amplification

(% dead pages !% live pages

Old flash block assuming all blocks have !%' live pages

New flash block

(% pages ready for new writes !% copied pages

Block cleaning copies !% pages

Write #"" pages total per (% new page writes:* 'live is !%'* write-amp is #"" / (#"" - 'live) == #"" / (%* space-amp is #"" / 'live == &


Disclaimer▪ There are many assumptions in the rest of the slides.

▪ Assumption ##: workloads have no skew.▪ Most real workloads have skew.▪ Lets save skew for a much longer discussion

▪ Assumption #!: workload is update-only

▪ I am trying to start a discussion rather than solve everything.▪ This won’t be confused as a lecture on algorithm analysis.▪ We might disagree on technology, but we can agree on terminology


Database algorithms▪ B-tree

▪ Update-in-place (UIP)▪ Copy-on-write using sequential (COW-S) and random (COW-R) writes

▪ Log structured merge tree (LSM)▪ LevelDB-style compaction (leveled)

▪ HBase-style compaction (n-files, size-tiered)

▪ Other

▪ Log-only - Bitcask▪ Memtable + L# - Sophia via Sphia.org▪ Memtable, L", L# - MaSM▪ TokuDB/TokuMX - fractional cascading


B-tree

algorithm fixed-page(fragments) in-place write-back

needsgarbage

collection(block or extent

cleaning)

example

UIP

COW-R

COW-S

yes yes single-block HW GC if flash InnoDB

yes no single-block HW GC if flash LMDB

no no multi-block SW GC ?


B-tree: UIP and COW-R▪ When non-leaf levels are in cache

▪ Point read-amp is #, range read-amp is # or !

▪ When dirty pages are forced after each row change▪ Write-amp is sizeof(page) / sizeof(row)▪ More write-amp from torn-page protection▪ Add +# for redo log▪ Include HW write-amp when using flash▪ Forcing data pages too soon increases write-amp


B-tree: UIP and COW-R, space amplification▪ Fragmentation because b-tree pages are not full on average

▪ After a page split # full page becomes ! half-full pages▪ With InnoDB we have many indexes with pages that are ~)"' full

▪ Fixed page size reduces compression, with InnoDB #X compression▪ Default fixed page size is *kb▪ Compress #)kb to )kb, still write out *kb

▪ It is hard to use a compression window larger than one page▪ Per-row metadata uses "%+ bytes on InnoDB


B-tree: COW-S▪ Read amplification is the same as for UIP and COW-R▪ Write amplification

▪ Smaller page size from better compression and no fragmentation▪ Has SW write-amp, cost of cleaning previously written extents▪ No HW write-amp on flash

▪ Space amplification▪ Compresses better than UIP/COW-R because page size not fixed▪ Almost no fragmentation▪ Space-amp from old versions of pages that have yet to be cleaned▪ More (less) space-amp means less (more) write-amp


LSM with leveled compaction▪ Implemented by LevelDB and Cassandra▪ Database is memtable, L!, L", ..., Lmax▪ Less read-amp and space-amp, more write-amp▪ Similar to original LSM design from paper by O’Neil

▪ Difference is the use of many range-partitioned files per level▪ Increases write-amp by a small amount▪ Prevents temporary doubling of Lmax during compaction

▪ Compaction from L" to L#▪ reads N bytes from L#▪ reads #"*N bytes from L!▪ writes #"*N + N bytes back to L!


LSM with leveled compaction

memtable keys: "..++ keys: "..++ keys: "..++ Level " (#GB)

#"X more data

keys: "".."# keys: ##..#+ keys: +"..++ Level # (#GB)

Level ! (#"GB)keys: """..""#

keys: ""!...""$

keys: +"..++


LSM with leveled compaction▪ Point read amplification

▪ # bloom filter check per L" file and per level for L#->Lmax + # disk read

▪ Range read amplification▪ # disk read per level and per L" file, bloom filters don’t help

▪ Write amplification▪ #" per level starting with L! + # for redo + # for L" + ~# for L#

▪ Space amplification▪ #.# assuming +"' of data is on the maximum level


LSM with n-files compaction▪ Implemented by Hbase, WiredTiger and Cassandra▪ Database is memtable, L!, L"

▪ Files in L" have varying sizes

▪ Less write-amp, more read-amp and space-amp▪ Compaction cost determined by:

▪ #files merged at a time▪ sizeof(L#) / sizeof(file created by memtable flush)

▪ If memtable is " GB, L" is &' GB, # files are merged at a time▪ then a row is written to files of size #, !, &, *, #), $! and )& GB▪ write-amp is (


LSM with n-files compaction, L"=#$ GB

# GB

)& GB

memtable

# GB ! GB ! GB & GB & GB * GB * GB #) GB #) GB $! GB $! GB

L" files L#


LSM with n-files compaction▪ Point read amplification

▪ # bloom filter check per file + # disk read

▪ Range read amplification▪ # disk read per file, bloom filters don’t help with range scans

▪ Write amplification▪ Usually much less than leveled compaction▪ Trade write for space amplification▪ Add # for redo

▪ Space amplification▪ Usually greater than !


Log-only▪ Bitcask (part of Riak/Basho) is an example of this▪ Data is written "+ times

▪ Write data once to a log▪ Write again when row is live during log cleaning

▪ Copy data from tail to head of log when out of disk space


Log-only

Log &

Log $

Log !

Log #

newest log file

oldest log file cleaner

live data

/dev/nulldead data

new data


Log-only▪ Point read amplification is "▪ Range read amplification is one per value in the range▪ Write and space amplification are related

▪ Write amplification is #"" / (#"" - 'live)▪ Space amplification is #"" / 'live

▪ When &&$ of the data in the logs is live▪ Space-amp is #.%▪ Write-amp is $


Memtable + L"▪ I think Sophia (sphia.org) is an example of this▪ Database is memtable, L"▪ Do compaction between memtable & L" when memtable is full▪ Great when database on disk not too much bigger than RAM


Memtable + L"

L#

memtablecompact

new L#


Memtable + L"▪ Point read amplification is "▪ Range read amplification is "▪ Write amplification

▪ The ratio sizeof(database) / sizeof(memtable)

▪ +# for redo log

▪ Space amplification is "


Memtable + L% + L"▪ MaSM is an example of this▪ Database is memtable, L!, L"

▪ sizeof(L") == sizeof(L#)▪ Looks like file structures from !-pass external sort

▪ Tradeoffs▪ Minimize write-amp▪ Maximize read-amp


Memtable + L% + L"

memtable

L#

L" L" L" L" L"

Merge all on compaction


Memtable + L% + L"▪ Point read amplification is " disk read + many bloom filter checks▪ Range read amplification " disk read per L! file + "▪ Write amplification is %

▪ Write to redo log, L" and L#

▪ Space amplification is #


TokuDB, TokuMX▪ Read amplification

▪ # disk read for point queries▪ # or ! disk reads for range read queries

▪ Write amplification▪ #" per level + # for redo▪ Won’t use as many levels as LevelDB

▪ Space amplification▪ No internal fragmentation, variable sizes pages written▪ Similar to LevelDB


Database algorithms

algorithm point read-amp

rangeread-amp write-amp space-amp

UIP b-tree

COW-R b-tree

COW-S b-tree

LSM leveled

LSM n-files

log-only

memtable+L!

memtable+L"+L!

tokudb

# # or ! page/row * HW GC #.% to !

# # or ! page/row * HW GC #.% to !

# # or ! page/row * SW GC #

# + N*bloom N #" per level #.# X

# + N*bloom N can be < #" can be > !

# N # / (# - 'live) # / 'live

# # database/mem #

# + N*bloom N $ !

# ! #" per level #.# X


Two things to remember▪ You can trade space/read versus write amplification

▪ Switch database algorithms or tune existing algorithm▪ Hard to minimize read, write & space amplification

▪ One size doesn’t fit all▪ The workload I care about has different types of indexes

▪ Some indexes should be optimized for short range scans▪ Other indexes can be optimized for write amplification

▪ would be nice to support both in one database engine


Thank youfacebook.com/MySQLatFacebook

Mark CallaghanSmall Data EngineerOctober, !"#$


Technology

Mark Callaghan, Facebook