Upload
ontico
View
114
Download
3
Tags:
Embed Size (px)
DESCRIPTION
HighLoad++ 2013
Citation preview
MySQL versus something elseEvaluating alternative databases
Mark CallaghanSmall Data EngineerOctober, !"#$
Friday, October 25, 13
What metric is important?▪ Throughput
▪ Throughput while minimizing response time variance
▪ Efficiency - reduce cost while meeting response time goals
Friday, October 25, 13
My focus is storage efficiency▪ Use flash to get IOPs
▪ Use spinning disks to get capacity
▪ Use both to reduce cost while improving quality of service
device frequentreads
frequentwrites read IOPs write IOPs
flash
flash
SATA, /dev/null
SATA, /dev/null
yes yes yes maybe
yes no yes no
no yes no maybe
no no no no
Friday, October 25, 13
What technology would you choose today?▪ How do you value flexibility?
▪ Newer & faster hardware arrives each year
▪ Servers you buy today will be in production for a few years▪ Software can last even longer in production
▪ We have several generations of HW on the small data tiers▪ Pure-disk (SAS array + HW RAID)▪ Flashcache (SATA array + HW RAID, flash)▪ Pure-flash
Friday, October 25, 13
Common definitions▪ Sorted run - rows stored in key order
▪ may be stored using many range-partitioned files
▪ Memtable - sorted run in memory▪ L! - " or more sorted runs on disk▪ L", L#, ... Lmax - each is " sorted run on disk
▪ Lmax is the largest level▪ by size L# < L! ... < Lmax
▪ live$ - percentage of live data in the database
Friday, October 25, 13
Amplification factors▪ Framework for describing efficiency of database algorithms
▪ How much is done physically in response to a logical change?▪ Read amplification
▪ Write amplification▪ Space amplification
▪ Can determine▪ How many disks or flash you must buy▪ How long your flash might last▪ Whether you can buy lower endurance flash
Friday, October 25, 13
Read amplification▪ Read-amp == disk reads per query
▪ Separate results for point query versus short range scan
▪ Assume some data is in cache▪ Assume the index is covering for the query
▪ Example: b-tree with all non-leaf levels in cache▪ Point read-amp - # disk read to get the leaf block▪ Short range read-amp - # or ! disk reads to get the leaf blocks
Friday, October 25, 13
Read amplification and bloom filters▪ Bloom filter summary
▪ f(key) -> { no, maybe }▪ Use ~#" bits/row to get reasonable false positive rate▪ Great for avoiding disk reads on point queries▪ Bonus - prevent disk reads for keys that don’t exist
▪ Useless for general range scans like “select x where y < "!!”▪ Can be useful for equality prefix like “select x where q = "! and y < "!!”
▪ use bloom filter on q
▪ Too many bloom filter checks can hurt response time▪ each sorted run on disk needs a bloom filter check
Friday, October 25, 13
Write amplification▪ Write-amp == bytes written per byte changed
▪ Insert #"" bytes with write-amp=% and %"" bytes will be written
▪ For now ignore penalty from small random writes
▪ Some writes done immediately, others are deferred
▪ Immediate -> redo log▪ Deferred -> b-tree dirty pages not forced on commit, LSM compaction
Friday, October 25, 13
Write amplification, part !▪ HW can increase write-amp
▪ Read live pages and write them elsewhere when cleaning flash blocks
▪ Only a cost for algorithms that do small random writes
▪ Redo log writes can increase write-amp
▪ Writes must be done to a multiple of %#! or larger▪ Insert #"" byte row, force %#! byte sector for redo has write-amp=%
Friday, October 25, 13
Why write amplification matters▪ Write endurance for flash device
▪ The wrong algorithm can wear out the device too soon▪ The right algorithm might let you buy lower cost/endurance device
▪ Write-amp can predict peak performance▪ If storage can sustain &"" MB/second of writes▪ And write-amp is #"▪ Then database can sustain &" MB/second of changes
Friday, October 25, 13
Simple request - make counting faster▪ Some web-scale workloads need to maintain counts
▪ Database is IO-bound▪ Workload should be write-heavy, counters might not be read
▪ update foo set count = count + " where key = ‘bar’▪ Read-modify-write▪ Write-only: write delta, merge deltas later when queried/compacted
Friday, October 25, 13
Space amplification▪ Space-amp == sizeof(database files) / sizeof(data)
▪ Ignore secondary indexes
▪ Assume database files are in steady state (fragmented & compacted)▪ Space-amp == #"" / 'live
▪ Things that change space amplification▪ B-tree fragmentation▪ Old versions of rows that are yet to be collected▪ Compression
▪ Per row/page metadata (rollback pointer, transaction ID, ...)
Friday, October 25, 13
Space versus write amplification▪ Sorry for the confusion
▪ Databases store N blocks in # extent▪ Flash devices store N pages in # block
▪ Copy out▪ Read live data from the cleaned extent, write it elsewhere▪ Cost is a function of the percentage of live data▪ Larger live' means less space and more write amplification▪ Smaller live' means more space and less write amplification
Friday, October 25, 13
Space versus write amplification
(% dead pages !% live pages
Old flash block assuming all blocks have !%' live pages
New flash block
(% pages ready for new writes !% copied pages
Block cleaning copies !% pages
Write #"" pages total per (% new page writes:* 'live is !%'* write-amp is #"" / (#"" - 'live) == #"" / (%* space-amp is #"" / 'live == &
Friday, October 25, 13
Disclaimer▪ There are many assumptions in the rest of the slides.
▪ Assumption ##: workloads have no skew.▪ Most real workloads have skew.▪ Lets save skew for a much longer discussion
▪ Assumption #!: workload is update-only
▪ I am trying to start a discussion rather than solve everything.▪ This won’t be confused as a lecture on algorithm analysis.▪ We might disagree on technology, but we can agree on terminology
Friday, October 25, 13
Database algorithms▪ B-tree
▪ Update-in-place (UIP)▪ Copy-on-write using sequential (COW-S) and random (COW-R) writes
▪ Log structured merge tree (LSM)▪ LevelDB-style compaction (leveled)
▪ HBase-style compaction (n-files, size-tiered)
▪ Other
▪ Log-only - Bitcask▪ Memtable + L# - Sophia via Sphia.org▪ Memtable, L", L# - MaSM▪ TokuDB/TokuMX - fractional cascading
Friday, October 25, 13
B-tree
algorithm fixed-page(fragments) in-place write-back
needsgarbage
collection(block or extent
cleaning)
example
UIP
COW-R
COW-S
yes yes single-block HW GC if flash InnoDB
yes no single-block HW GC if flash LMDB
no no multi-block SW GC ?
Friday, October 25, 13
B-tree: UIP and COW-R▪ When non-leaf levels are in cache
▪ Point read-amp is #, range read-amp is # or !
▪ When dirty pages are forced after each row change▪ Write-amp is sizeof(page) / sizeof(row)▪ More write-amp from torn-page protection▪ Add +# for redo log▪ Include HW write-amp when using flash▪ Forcing data pages too soon increases write-amp
Friday, October 25, 13
B-tree: UIP and COW-R, space amplification▪ Fragmentation because b-tree pages are not full on average
▪ After a page split # full page becomes ! half-full pages▪ With InnoDB we have many indexes with pages that are ~)"' full
▪ Fixed page size reduces compression, with InnoDB #X compression▪ Default fixed page size is *kb▪ Compress #)kb to )kb, still write out *kb
▪ It is hard to use a compression window larger than one page▪ Per-row metadata uses "%+ bytes on InnoDB
Friday, October 25, 13
B-tree: COW-S▪ Read amplification is the same as for UIP and COW-R▪ Write amplification
▪ Smaller page size from better compression and no fragmentation▪ Has SW write-amp, cost of cleaning previously written extents▪ No HW write-amp on flash
▪ Space amplification▪ Compresses better than UIP/COW-R because page size not fixed▪ Almost no fragmentation▪ Space-amp from old versions of pages that have yet to be cleaned▪ More (less) space-amp means less (more) write-amp
Friday, October 25, 13
LSM with leveled compaction▪ Implemented by LevelDB and Cassandra▪ Database is memtable, L!, L", ..., Lmax▪ Less read-amp and space-amp, more write-amp▪ Similar to original LSM design from paper by O’Neil
▪ Difference is the use of many range-partitioned files per level▪ Increases write-amp by a small amount▪ Prevents temporary doubling of Lmax during compaction
▪ Compaction from L" to L#▪ reads N bytes from L#▪ reads #"*N bytes from L!▪ writes #"*N + N bytes back to L!
Friday, October 25, 13
LSM with leveled compaction
memtable keys: "..++ keys: "..++ keys: "..++ Level " (#GB)
#"X more data
keys: "".."# keys: ##..#+ keys: +"..++ Level # (#GB)
Level ! (#"GB)keys: """..""#
keys: ""!...""$
keys: +"..++
Friday, October 25, 13
LSM with leveled compaction▪ Point read amplification
▪ # bloom filter check per L" file and per level for L#->Lmax + # disk read
▪ Range read amplification▪ # disk read per level and per L" file, bloom filters don’t help
▪ Write amplification▪ #" per level starting with L! + # for redo + # for L" + ~# for L#
▪ Space amplification▪ #.# assuming +"' of data is on the maximum level
Friday, October 25, 13
LSM with n-files compaction▪ Implemented by Hbase, WiredTiger and Cassandra▪ Database is memtable, L!, L"
▪ Files in L" have varying sizes
▪ Less write-amp, more read-amp and space-amp▪ Compaction cost determined by:
▪ #files merged at a time▪ sizeof(L#) / sizeof(file created by memtable flush)
▪ If memtable is " GB, L" is &' GB, # files are merged at a time▪ then a row is written to files of size #, !, &, *, #), $! and )& GB▪ write-amp is (
Friday, October 25, 13
LSM with n-files compaction, L"=#$ GB
# GB
)& GB
memtable
# GB ! GB ! GB & GB & GB * GB * GB #) GB #) GB $! GB $! GB
L" files L#
Friday, October 25, 13
LSM with n-files compaction▪ Point read amplification
▪ # bloom filter check per file + # disk read
▪ Range read amplification▪ # disk read per file, bloom filters don’t help with range scans
▪ Write amplification▪ Usually much less than leveled compaction▪ Trade write for space amplification▪ Add # for redo
▪ Space amplification▪ Usually greater than !
Friday, October 25, 13
Log-only▪ Bitcask (part of Riak/Basho) is an example of this▪ Data is written "+ times
▪ Write data once to a log▪ Write again when row is live during log cleaning
▪ Copy data from tail to head of log when out of disk space
Friday, October 25, 13
Log-only
Log &
Log $
Log !
Log #
newest log file
oldest log file cleaner
live data
/dev/nulldead data
new data
Friday, October 25, 13
Log-only▪ Point read amplification is "▪ Range read amplification is one per value in the range▪ Write and space amplification are related
▪ Write amplification is #"" / (#"" - 'live)▪ Space amplification is #"" / 'live
▪ When &&$ of the data in the logs is live▪ Space-amp is #.%▪ Write-amp is $
Friday, October 25, 13
Memtable + L"▪ I think Sophia (sphia.org) is an example of this▪ Database is memtable, L"▪ Do compaction between memtable & L" when memtable is full▪ Great when database on disk not too much bigger than RAM
Friday, October 25, 13
Memtable + L"
L#
memtablecompact
new L#
Friday, October 25, 13
Memtable + L"▪ Point read amplification is "▪ Range read amplification is "▪ Write amplification
▪ The ratio sizeof(database) / sizeof(memtable)
▪ +# for redo log
▪ Space amplification is "
Friday, October 25, 13
Memtable + L% + L"▪ MaSM is an example of this▪ Database is memtable, L!, L"
▪ sizeof(L") == sizeof(L#)▪ Looks like file structures from !-pass external sort
▪ Tradeoffs▪ Minimize write-amp▪ Maximize read-amp
Friday, October 25, 13
Memtable + L% + L"
memtable
L#
L" L" L" L" L"
Merge all on compaction
Friday, October 25, 13
Memtable + L% + L"▪ Point read amplification is " disk read + many bloom filter checks▪ Range read amplification " disk read per L! file + "▪ Write amplification is %
▪ Write to redo log, L" and L#
▪ Space amplification is #
Friday, October 25, 13
TokuDB, TokuMX▪ Read amplification
▪ # disk read for point queries▪ # or ! disk reads for range read queries
▪ Write amplification▪ #" per level + # for redo▪ Won’t use as many levels as LevelDB
▪ Space amplification▪ No internal fragmentation, variable sizes pages written▪ Similar to LevelDB
Friday, October 25, 13
Database algorithms
algorithm point read-amp
rangeread-amp write-amp space-amp
UIP b-tree
COW-R b-tree
COW-S b-tree
LSM leveled
LSM n-files
log-only
memtable+L!
memtable+L"+L!
tokudb
# # or ! page/row * HW GC #.% to !
# # or ! page/row * HW GC #.% to !
# # or ! page/row * SW GC #
# + N*bloom N #" per level #.# X
# + N*bloom N can be < #" can be > !
# N # / (# - 'live) # / 'live
# # database/mem #
# + N*bloom N $ !
# ! #" per level #.# X
Friday, October 25, 13
Two things to remember▪ You can trade space/read versus write amplification
▪ Switch database algorithms or tune existing algorithm▪ Hard to minimize read, write & space amplification
▪ One size doesn’t fit all▪ The workload I care about has different types of indexes
▪ Some indexes should be optimized for short range scans▪ Other indexes can be optimized for write amplification
▪ would be nice to support both in one database engine
Friday, October 25, 13
Thank youfacebook.com/MySQLatFacebook
Mark CallaghanSmall Data EngineerOctober, !"#$
Friday, October 25, 13