Upload
couchbase
View
204
Download
3
Tags:
Embed Size (px)
Citation preview
A Next Generation Storage Engine for NoSQL Database Systems
Chiyoung SeoSoftware Architect, Couchbase Inc.
Chin HongVP Product Management, Couchbase Inc.
©2015 Couchbase Inc. 2
Why a new KV storage engine?
ForestDB Overview Compact Index Structures
WAL (Write-Ahead Logging)
Optimizations for SSDs (Solid-State Drives)
Performance Evaluations LevelDB, RocksDB
WiredTiger (B+Tree, LSM Tree)
Summary
Contents
©2014 Couchbase, Inc. 2
©2015 Couchbase Inc. 4
Operate on huge volume of unstructured data
Significant amount of new data is constantly generated from hundreds of millions of users or devices
Still require high performance and scalability in managing their ever-growing database
Underlying storage engine is one of the most critical parts in database systems to provide high performance / scalability
Modern Web / Mobile Applications
©2014 Couchbase, Inc. 4
©2015 Couchbase Inc. 5
Main storage index structure in a database field: SQLite, Couchstore, WiredTiger
Generalization of binary search tree
Each node consists of two or more {key, value (or pointer)} pairs
Fanout (or branch) degree: # of KV pairs in a node
Node size is generally fitted into multiple page size
B+Tree
©2014 Couchbase, Inc. 5
©2015 Couchbase Inc. 6
Not suitable to index variable or fixed-length long keys
Significant space overhead as entire key strings are indexed in non-leaf nodes
Tree depth grows quickly as more data is loaded
In-place updates lead to database fragmentations
I/O performance is degraded significantly as the data size gets bigger and the database is fragmented
Several variants of B+Tree were proposed. Most popular is LSM Tree.
B+Tree Limitations
04/26
…
…
…
Key
Value (or Pointer)
longer keys
©2015 Couchbase Inc. 7
03/26
LSM Tree (Log Structured Merge Tree)
Main file organization for many products: HBase, Cassandra, LevelDB, MongoDB(WiredTiger-LSM)
Improve write performance by
Appending all updated and new data to a sequential log
Deferring and batching index changes efficiently in sorted runs
…
In-memory
Sequential log
flush/merge merge
C1 tree C2 tree
merge
Capacity increases exponentially
©2015 Couchbase Inc. 8
Not suitable to index variable or fixed-length long keys
Significant space overhead as entire key strings are indexed in non-leaf nodes
Tree depth grows quickly as more data is loaded. Merge operations between trees occur more frequently
Read is generally slower as the system may need to traverse multiple trees to find the record
LSM Limitations
04/26
…
In-memory
Sequential log
flush/merge merge
C1 tree C2 tree
merge
Capacity increases exponentially
©2015 Couchbase Inc. 9
Fast and scalable index structure for variable or fixed-length long keys Targeting block I/O storage devices not only SSD but also legacy HDD
Less storage space overhead Reduce write amplification
Efficient for different key patterns Keys with or without common prefixes
Efficient for mixed workloads
Goals for Next-Generation Storage Engine
06/26
©2015 Couchbase Inc. 11
Key-Value storage engine developed by Couchbase Caching / Storage team
Its main index structure is built from Hierarchical B+-Tree based Trie or HB+-Trie HB+-Trie was originally presented at ACM SIGMOD 2011 Programming
Contest, by Jung-Sang Ahn who works at Couchbase(http://db.csail.mit.edu/sigmod11contest/sigmod_2011_contest_poster_jungsang_ahn.pdf)
Significantly better read and write performance with less storage overhead
Support various server OSs (Centos, Ubuntu, Debian, Mac OS x, Windows) and mobile OSs (iOS, Android)
1.0 beta was released Oct, 2014
ForestDB
©2014 Couchbase, Inc. 11
©2015 Couchbase Inc. 12
Multi-Version Concurrency Control (MVCC) with append-only storage model
Write-Ahead Logging (WAL)
A value can be retrieved by its sequence number or disk offset in addition to a key
Custom compare function to support a customized key order
Snapshot support to provide different views of database
Rollback to revert the database to a specific point
Ranged iteration by keys or sequence numbers
Transactional support with read-committed or read-uncommitted isolation level
Manual or auto compaction configured per KV instance
Main Features
©2014 Couchbase, Inc. 12
©2015 Couchbase Inc. 15
Trie (prefix tree) whose node is B+Tree A key is split into the list of fixed-size chunks (sub-string of the key)
HB+Trie (Hierarchical B+Tree based Trie)
Variable length key: Fixed size (e.g. 4-byte)a83jgls83jgo29a…
07/26Lexicographical ordered traversal
Search using Chunk1
Document
B+Tree (Node of HB+Trie)
Node of B+Tree
Chunk1 Chunk2 Chunk3 …
a83j gls8 3jgo …
Search using Chunk2
Search using Chunk3 07/26
©2015 Couchbase Inc. 16
Prefix Compression
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
1stInsert ‘aaaa’
B+Tree using 1st
chunk as key
©2015 Couchbase Inc. 17
Prefix Compression
1stInsert ‘aaaa’
aaaaa
Distinguishable by first chunk ‘a’
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte B+Tree using 1st
chunk as key
©2015 Couchbase Inc. 18
Prefix Compression
Distinguishable by first chunk ‘b’
B+Tree using 1st
chunk as key
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
Insert ‘bbbb’
aaaa
1st
a
bbbbb
©2015 Couchbase Inc. 19
Prefix Compression
B+Tree using 1st
chunk as key
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
Insert ‘aaab’
aaaa
1st
a
bbbbb
Cannot distinguish using first chunk ‘a’
©2015 Couchbase Inc. 20
Prefix Compression
Insert ‘aaab’
aaaaCannot distinguish using first chunk ‘a’
First distinguishable chunk: 4th
B+Tree using 1st
chunk as key
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
1st
a
bbbbb
©2015 Couchbase Inc. 21
Prefix Compression
Store skipped common prefix ‘aa’
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
1st
a
bbbbb
4th
aa
aaaaa
aaabb
B+Tree using 4th
chunk as key, skipping common
prefix ‘aa’
©2015 Couchbase Inc. 22
Prefix Compression
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
1st
a
bbbbb
4th
aa
aaaaa
aaabb
Insert ‘bbcd’
Cannot distinguish using first chunk ‘b’
B+Tree using 4th
chunk as key, skipping common
prefix ‘aa’
©2015 Couchbase Inc. 23
Prefix Compression
08/26
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
1st
a
bbbbb
4th
aa
aaaaa
aaabb
Insert ‘bbcd’
Cannot distinguish using first chunk ‘b’
B+Tree using 4th
chunk as key, skipping common
prefix ‘aa’ First distinguishable chunk: 3rd
©2015 Couchbase Inc. 24
As original trie, each node (B+Tree) is created on-demand (except for root node)
Example: Chunk size = 1 byte
Prefix Compression
1st
a b
4th
aa
aaaaa
aaabb
3rd
b
bbbb bbcdb c
Store skipped common prefix ‘b’
B+Tree using 3rd
chunk as key, skipping common
prefix ‘b’
08/26
©2015 Couchbase Inc. 25
Compact Index Structure
When keys have common prefixes (e.g., secondary index keys)
©2015 Couchbase Inc. 26
Compact Index Structure
09/26
1st
Insert a83jfl2iejzm302k,dpwk3gjrieorigje,z9382h3igor8eh4k,283hgoeir8goerha,023o8f9o8zufisue
a83jfl2iejzm302k
a8dpwk3gjrieorigje
dpz9382h3igor8eh4k
z9283hgoeir8goerha
28023o8f9o8zufisue
02
Majority of keys can be indexed by first chunk There will be only one B+Tree on HB+Trie
We don’t need to store & compare entire key string
When keys have common prefixes (e.g., secondary index keys)
When keys are sufficiently long & uniform random (e.g., UUID or hash value) Example: Chunk size = 2 bytes
©2015 Couchbase Inc. 28
Suppose that Node size: 4 KB / key length: 64 bytes / pointer (or value) size: 8 bytes
Indexing 1 billion keys
Compaction overhead can be reduced significantly
Buffer cache can accommodate more pages and manage them more efficiently
Compact Index Structure - Benefits
14.1 times smaller
10/26
Original B+Tree HB+Trie (4-byte chunk)
Fanout 4096 / (64+8) ~= 56 4096 / (4+8) ~= 341
Height log56(10003) ~= 6 log341(10003) ~= 4
Space needed for the index 4KB * ~= 2139.07 GB 4KB * ~= 151.70 GB
©2015 Couchbase Inc. 39
ForestDB maintains two index structures HB+Trie: key index
Sequence B+Tree: sequence number (8-byte integer) index
Retrieve the file offset to a value using key or sequence number
ForestDB Index Structures
DB file Doc Doc Doc Doc Doc Doc …
HB+Trie
B+Tree
key
Sequence number
11/26
©2015 Couchbase Inc. 40
Append updates first, and update the main indexes later
Main purposes
To maximize write throughput by sequential writes (append-only updates)
To reduce # of index nodes to be written by batched updates
Write-Ahead Logging
DB file Docs Index nodes
ID index Seq no. index
WAL indexes:in-memory structures(hash table)
H
DB header (1 block)
HB+Trie nodes
©2015 Couchbase Inc. 41
Document updatesDocs
Append updates first, and update the main indexes later
Main purposes
To maximize write throughput by sequential writes (append-only updates)
To reduce # of index nodes to be written by batched updates
Write-Ahead Logging
DB file Docs Index nodes
ID index Seq no. index
WAL indexes:in-memory structures(hash table)
H
DB header (1 block)
HB+Trie nodes
©2015 Couchbase Inc. 42
15/26
Append documents
Append updates first, and update the main indexes later
Main purposes
To maximize write throughput by sequential writes (append-only updates)
To reduce # of index nodes to be written by batched updates
Write-Ahead Logging
DB file Docs Index nodes
ID index Seq no. index
WAL indexes:in-memory structures(hash table)
H Docs
©2015 Couchbase Inc. 43
15/26
Append updates first, and update the main indexes later
Main purposes
To maximize write throughput by sequential writes (append-only updates)
To reduce # of index nodes to be written by batched updates
Write-Ahead Logging
Update WAL indexes
DocsDB file Docs Index nodes
h(key)h(key)
…
OffsetOffset
…
h(seq no)h(seq no)
…
OffsetOffset
…
ID index Seq no. index
WAL indexes:in-memory structures(hash table)
H
©2015 Couchbase Inc. 44
Append updates first, and update the main indexes later
Main purposes
To maximize write throughput by sequential writes (append-only updates)
To reduce # of index nodes to be written by batched updates
Write-Ahead Logging
Append DB headerfor every commit
HDocsDB file Docs Index nodes
h(key)h(key)
…
OffsetOffset
…
h(seq no)h(seq no)
…
OffsetOffset
…
ID index Seq no. index
WAL indexes:in-memory structures(hash table)
H
©2015 Couchbase Inc. 45
Append DB headerfor every commit
HDocsDB file Docs Index nodes
h(key)h(key)
…
OffsetOffset
…
h(seq no)h(seq no)
…
OffsetOffset
…
ID index Seq no. index
WAL indexes:in-memory structures(hash table)
H15/26
Append updates first, and update the main indexes later
Main purposes
To maximize write throughput by sequential writes (append-only updates)
To reduce # of index nodes to be written by batched updates
Write-Ahead Logging
< Key query>1. Retrieve WAL index first2. If hit return immediately3. If miss retrieve HB+Trie (or B+Tree)
©2015 Couchbase Inc. 60
OS file system stack overhead Metadata update
Page cache shared among processes
Compaction overhead Need to read an entire database file and write all valid pages into a
new file
Use too much disk I/O bandwidth
Lack of utilizing parallel channels inside SSD Fetching multiple blocks at the same time, which are stored in
different channels
Using async I/O library (e.g., libaio)
Current Limitations
26/26
©2015 Couchbase Inc. 61
26/26
OS File System Stack Overhead
SSD SSD SSD
Block I/O Interface (SATA, PCI)
OS File System
Page Cache
Meta Data Mgmt
Database Storage Engine
SSD SSD SSD
Block I/O Interface (SATA, PCI)
Database Storage Engine
… Buffer Cache
Typical Database Storage Stack
Advanced Database Storage Stack
Volume Manager
©2015 Couchbase Inc. 63
Required for append-only storage model Garbage collect stale data blocks
Use significant disk I/O bandwidth Read the entire database file and write all valid blocks into a new file
Affect other performance metrics Regular read / write performance drops significantly
Database Compaction
©2014 Couchbase, Inc. 63
©2015 Couchbase Inc. 64
Logical page can change its physical address in flash memory whenever it is overwritten
For this reason, the mapping table between LBA and PBA is maintained by Flash Translation Layer (FTL)
SWAT-Based Compaction Optimization
©2014 Couchbase, Inc. 64
A B C D E F…
Logical Address in File System (LBA)
FTL Address Mapping: LBA PBA
Physical Address inFlash Memory (PBA)
A A’…
©2015 Couchbase Inc. 65
SWAT-Based Compaction Optimization
©2014 Couchbase, Inc. 65
Document
B+Tree (Node of HB+Trie)
B+Tree Node
Old Ver. of B+Tree Node
I
G H
E
A B
F
C D C’
F’
H’
I’
G
E
A B DC’
F’
H’
I’
Current DB file
New CompactedDB file
A new compacted file can be simply created by creating the new LBA to PBA mappings that contain the valid pages only in the current DB file
Need to extend the FTL by adding a new interface SWAT (Swap and Trim)
©2015 Couchbase Inc. 66
Implement SWAT interface on the OpenSSD development platform by adapting its FTL code
Total time taken for compactions was reduced by 17x
Number of compactions triggered was reduced by 4x
SWAT-Based Compaction Optimization
©2014 Couchbase, Inc. 66
©2015 Couchbase Inc. 67
Exploit async I/O library (e.g., libaio) to better utilize the parallel I/O capabilities by SSDs
Quite useful in querying secondary indexes when items satisfying a query predicate are located in multiple blocks on different channels
Utilizing Parallel Channels on SSDs
©2014 Couchbase, Inc. 67
©2015 Couchbase Inc. 71
Evaluation Environments 64-bit machine running Centos 6.5
Intel Xeon 2.00 GHz CPU (6 cores, 12 threads)
32GB RAM and Crucial M4 SSD
Data Key size 32 bytes and value size 1KB
Load 100M items
Logical data size 100GB total
ForestDB Evaluation – LevelDB, RocksDB
©2015 Couchbase Inc. 72
LevelDB
Compression is disabled
Write buffer size: 256 MB (initial load), 4 MB (otherwise)
Buffer cache size: 8 GB
RocksDB
Compression is disabled
Write buffer size: 256 MB (initial load), 4 MB (otherwise)
Maximum number of background compaction threads: 8
Maximum number of background memtable flushes: 8
Maximum number of write buffers: 8
Buffer cache size: 8 GB (uncompressed)
ForestDB
Compression is disabled
WAL size: 4,096 documents
Buffer cache size: 8 GB
KV Storage Engine Configurations
©2015 Couchbase Inc. 75
Read-Only Performance
0
5000
10000
15000
20000
25000
30000
1 2 4 8
Op
erat
ion
s p
er s
eco
nd
# reader threads
Read-Only Performance
ForestDB LevelDB RocksDB
2x ~ 5x
©2015 Couchbase Inc. 76
Write-Only Performance
0
2000
4000
6000
8000
10000
12000
1 4 16 64 256
Op
era
tion
s p
er
se
con
d
Write batch size (# documents)
Write-Only Performance
ForestDB LevelDB RocksDB
- Small batch size (e.g., < 10) is not usually common
3x ~ 5x
©2015 Couchbase Inc. 77
Write-Only Performance
0
50
100
150
200
250
300
350
400
450
1 4 16 64 256
Wri
te a
mp
lific
atio
n(N
orm
aliz
ed t
o a
sin
gle
do
c si
ze)
Write batch size (# documents)
Write Amplification
ForestDB LevelDB RocksDB
ForestDB shows 4x ~ 20x less write amplification
©2015 Couchbase Inc. 79
Mixed Workload Performance
0
2000
4000
6000
8000
10000
12000
1 2 4 8
Op
era
tion
s p
er
se
con
d
# reader threads
Mixed (Unrestricted) Performance
ForestDB LevelDB RocksDB
2x ~ 5x
©2015 Couchbase Inc. 81
Evaluation Environment
CPU: Intel Core i7-3770 CPU @ 3.40 GHz ( 8 virtual cores)
RAM: 32 GB (DDR3, 1600 MHz)
OS: Ubuntu 12.04.5 LTS (Linux version 3.8.0-29-generic)
Disk: Samsung SSD 840 EVO (formatted with Ext4)
Benchmark: ForestDB-Benchmark
WiredTiger version: 2.5.0
©2015 Couchbase Inc. 83
Initial Load: Insertions / Sec
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 1000 2000 3000 4000 5000 6000 7000 8000In
sert
ion
s p
er
seco
nd
Elapsed time (second)
Bulk Load
FDB WT LSM FDB (avg) WT LSM (avg)
2.5x faster
Note: Excluded WiredTiger B+ Tree due to slow speed
Key size: 32 bytes on average
Document size: 128 bytes on average
# of documents: 200,000,000 (more than 40GB DB size)
Cache size: 16GB
Asynchronous writes
©2015 Couchbase Inc. 85
Read-Only Performance (small document + no DGM)
Key size: 32 bytes on average
Document size: 128 bytes on average
# of documents: 10,000,000 (1.2GB DB size)
Cache size: 16GB
0
500000
1000000
1500000
2000000
2500000
3000000
1 2 4 8 16
Op
erat
ion
s p
er s
eco
nd
# reader threads
Read-Only Throughput (Small)
ForestDB WT B-tree WT LSM
1.5x – 3x slower
©2015 Couchbase Inc. 89
Read-Only Performance (large document + DGM + long key)
Key size: 256 bytes, 1024 bytes on average
Document size: 1 KB on average
# of documents: 10,000,000 (11GB DB size)
RAM size: 2GB
Cache size: 512 MB
©2015 Couchbase Inc. 90
Read-Only Performance (large document + DGM + long key)
1 2 4 8 16
ForestDB 9382.94 14291.1 20673.55 27054.92 29730.66
WT B-tree 4132.66 7778.51 13924.61 22880.54 29805.06
WT LSM 3160.17 4351.8 7452.94 12130.47 15675.62
0
5000
10000
15000
20000
25000
30000
35000
Op
erat
ion
s p
er s
eco
nd
# reader threads
Read-Only Throughput (Key: 256 bytes)
ForestDB WT B-tree WT LSM
2x - 3x fasterNote that disk is fully utilized with 16 threads
©2015 Couchbase Inc. 91
Read-Only Performance (large document + DGM + long key)
1 2 4 8 16
ForestDB 5626.42 9604.27 14774.53 19178.67 19858.83
WT B-tree 190 391.05 607.21 698.01 729.8
WT LSM 589.4 793.14 851.37 857.18 858.91
0
5000
10000
15000
20000
25000O
per
atio
ns
per
sec
on
d
# reader threads
Read-Only Throughput (Key: 1024 bytes)
ForestDB WT B-tree WT LSM
24x - 27x faster
©2015 Couchbase Inc. 92
Write-Only Performance
Key size: 48 bytes on average
Document size: 1 KB on average
# of documents: 10,000,000 (11GB DB size)
RAM size: 2GB
Cache size: 512 MB
©2015 Couchbase Inc. 93
Write-Only Throughput (Synchronous)
0
5000
10000
15000
20000
1 4 16 64 256
Op
erat
ion
s p
er s
eco
nd
Batch size per commit
Write-Only Throughput (Synchronous)
ForestDB WT B-tree WT LSM
3x – 6x faster
©2015 Couchbase Inc. 94
Write-Only Amplification (Synchronous)
12.16.4
4.8 4.4 4.4
13.1 11.9 11.7 11.7 11.7
126.6 124.262.8
31.5 32.4
1
10
100
1000
1 4 16 64 256
Wri
te A
mp
lific
atio
n
Batch size per commit
Write Amplification (Synchronous)
ForestDB WT B-tree WT LSM
3x - 20x less amplification
©2015 Couchbase Inc. 97
Mixed Workload Performance
Key size: 48 bytes on average
Document size: 1 KB on average
# of documents: 10,000,000 (11GB DB size)
RAM size: 2GB
Cache size: 512 MB
Single writer thread and multiple reader threads
Writer batch size: 16 documents on average
©2015 Couchbase Inc. 98
Mixed Workload (Read: 80%, Write = 20%)
0
5000
10000
15000
20000
25000
1 2 4 8 16
Op
erat
ion
s p
er s
eco
nd
# reader threads
Mixed Workloads (R:W = 8:2)
ForestDB WT B-tree WT LSM
2x - 8x faster
©2015 Couchbase Inc. 102
Compact and efficient storage for variety of data – HB+Trie
In-memory WAL indexes to improve write/read performance
Optimized for new SSD storage technology Bypassing OS file system, SWAT-based compaction, Parallel IO channels
Unified storage engine that performs well for various workloads
Unified storage engine that scales from small devices to large servers
Couchbase Server secondary index
Couchbase Lite
Couchbase Server KV engine
ForestDB - Summary
©2014 Couchbase, Inc. 102