Couchbase Live Europe 2015: Storing Big Data: ForestDB the Couchbase Next Generation Storage Engine

A Next Generation Storage Engine for NoSQL Database Systems

Chiyoung SeoSoftware Architect, Couchbase Inc.

Chin HongVP Product Management, Couchbase Inc.

©2015 Couchbase Inc. 2

Why a new KV storage engine?

ForestDB Overview Compact Index Structures

WAL (Write-Ahead Logging)

Optimizations for SSDs (Solid-State Drives)

Performance Evaluations LevelDB, RocksDB

WiredTiger (B+Tree, LSM Tree)

Summary

Contents

©2014 Couchbase, Inc. 2

Why a new KV storage engine?


Operate on huge volume of unstructured data

Significant amount of new data is constantly generated from hundreds of millions of users or devices

Still require high performance and scalability in managing their ever-growing database

Underlying storage engine is one of the most critical parts in database systems to provide high performance / scalability

Modern Web / Mobile Applications



Main storage index structure in a database field: SQLite, Couchstore, WiredTiger

Generalization of binary search tree

Each node consists of two or more {key, value (or pointer)} pairs

Fanout (or branch) degree: # of KV pairs in a node

Node size is generally fitted into multiple page size

B+Tree



Not suitable to index variable or fixed-length long keys

Significant space overhead as entire key strings are indexed in non-leaf nodes

Tree depth grows quickly as more data is loaded

In-place updates lead to database fragmentations

I/O performance is degraded significantly as the data size gets bigger and the database is fragmented

Several variants of B+Tree were proposed. Most popular is LSM Tree.

B+Tree Limitations

04/26

…

…

…

Key

Value (or Pointer)

longer keys


03/26

LSM Tree (Log Structured Merge Tree)

Main file organization for many products: HBase, Cassandra, LevelDB, MongoDB(WiredTiger-LSM)

Improve write performance by

Appending all updated and new data to a sequential log

Deferring and batching index changes efficiently in sorted runs

…

In-memory

Sequential log

flush/merge merge

C1 tree C2 tree

merge

Capacity increases exponentially


Not suitable to index variable or fixed-length long keys

Significant space overhead as entire key strings are indexed in non-leaf nodes

Tree depth grows quickly as more data is loaded. Merge operations between trees occur more frequently

Read is generally slower as the system may need to traverse multiple trees to find the record

LSM Limitations

04/26

…

In-memory

Sequential log

flush/merge merge

C1 tree C2 tree

merge

Capacity increases exponentially


Fast and scalable index structure for variable or fixed-length long keys Targeting block I/O storage devices not only SSD but also legacy HDD

Less storage space overhead Reduce write amplification

Efficient for different key patterns Keys with or without common prefixes

Efficient for mixed workloads

Goals for Next-Generation Storage Engine

06/26

ForestDB


Key-Value storage engine developed by Couchbase Caching / Storage team

Its main index structure is built from Hierarchical B+-Tree based Trie or HB+-Trie HB+-Trie was originally presented at ACM SIGMOD 2011 Programming

Contest, by Jung-Sang Ahn who works at Couchbase(http://db.csail.mit.edu/sigmod11contest/sigmod_2011_contest_poster_jungsang_ahn.pdf)

Significantly better read and write performance with less storage overhead

Support various server OSs (Centos, Ubuntu, Debian, Mac OS x, Windows) and mobile OSs (iOS, Android)

1.0 beta was released Oct, 2014

ForestDB



Multi-Version Concurrency Control (MVCC) with append-only storage model

Write-Ahead Logging (WAL)

A value can be retrieved by its sequence number or disk offset in addition to a key

Custom compare function to support a customized key order

Snapshot support to provide different views of database

Rollback to revert the database to a specific point

Ranged iteration by keys or sequence numbers

Transactional support with read-committed or read-uncommitted isolation level

Manual or auto compaction configured per KV instance

Main Features


ForestDB: Main Index Structure


Trie (prefix tree) whose node is B+Tree A key is split into the list of fixed-size chunks (sub-string of the key)

HB+Trie (Hierarchical B+Tree based Trie)

Variable length key: Fixed size (e.g. 4-byte)a83jgls83jgo29a…

07/26Lexicographical ordered traversal

Search using Chunk1

Document

B+Tree (Node of HB+Trie)

Node of B+Tree

Chunk1 Chunk2 Chunk3 …

a83j gls8 3jgo …

Search using Chunk2

Search using Chunk3 07/26


Prefix Compression

08/26

As original trie, each node (B+Tree) is created on-demand (except for root node)

Example: Chunk size = 1 byte

1stInsert ‘aaaa’

B+Tree using 1st

chunk as key


Prefix Compression

1stInsert ‘aaaa’

aaaaa

Distinguishable by first chunk ‘a’

08/26


Example: Chunk size = 1 byte B+Tree using 1st

chunk as key


Prefix Compression

Distinguishable by first chunk ‘b’

B+Tree using 1st

chunk as key

08/26



Insert ‘bbbb’

aaaa

1st

a

bbbbb


Prefix Compression

B+Tree using 1st

chunk as key

08/26



Insert ‘aaab’

aaaa

1st

a

bbbbb

Cannot distinguish using first chunk ‘a’


Prefix Compression

Insert ‘aaab’

aaaaCannot distinguish using first chunk ‘a’

First distinguishable chunk: 4th

B+Tree using 1st

chunk as key

08/26



1st

a

bbbbb


Prefix Compression

Store skipped common prefix ‘aa’

08/26



1st

a

bbbbb

4th

aa

aaaaa

aaabb

B+Tree using 4th

chunk as key, skipping common

prefix ‘aa’


Prefix Compression

08/26



1st

a

bbbbb

4th

aa

aaaaa

aaabb

Insert ‘bbcd’

Cannot distinguish using first chunk ‘b’

B+Tree using 4th


prefix ‘aa’


Prefix Compression

08/26



1st

a

bbbbb

4th

aa

aaaaa

aaabb

Insert ‘bbcd’

Cannot distinguish using first chunk ‘b’

B+Tree using 4th


prefix ‘aa’ First distinguishable chunk: 3rd




Prefix Compression

1st

a b

4th

aa

aaaaa

aaabb

3rd

b

bbbb bbcdb c

Store skipped common prefix ‘b’

B+Tree using 3rd


prefix ‘b’

08/26


Compact Index Structure

When keys have common prefixes (e.g., secondary index keys)


Compact Index Structure

09/26

1st

Insert a83jfl2iejzm302k,dpwk3gjrieorigje,z9382h3igor8eh4k,283hgoeir8goerha,023o8f9o8zufisue

a83jfl2iejzm302k

a8dpwk3gjrieorigje

dpz9382h3igor8eh4k

z9283hgoeir8goerha

28023o8f9o8zufisue

02

Majority of keys can be indexed by first chunk There will be only one B+Tree on HB+Trie

We don’t need to store & compare entire key string

When keys have common prefixes (e.g., secondary index keys)

When keys are sufficiently long & uniform random (e.g., UUID or hash value) Example: Chunk size = 2 bytes


Suppose that Node size: 4 KB / key length: 64 bytes / pointer (or value) size: 8 bytes

Indexing 1 billion keys

Compaction overhead can be reduced significantly

Buffer cache can accommodate more pages and manage them more efficiently

Compact Index Structure - Benefits

14.1 times smaller

10/26

Original B+Tree HB+Trie (4-byte chunk)

Fanout 4096 / (64+8) ~= 56 4096 / (4+8) ~= 341

Height log56(10003) ~= 6 log341(10003) ~= 4

Space needed for the index 4KB * ~= 2139.07 GB 4KB * ~= 151.70 GB

ForestDB: Write-Ahead Logging


ForestDB maintains two index structures HB+Trie: key index

Sequence B+Tree: sequence number (8-byte integer) index

Retrieve the file offset to a value using key or sequence number

ForestDB Index Structures

DB file Doc Doc Doc Doc Doc Doc …

HB+Trie

B+Tree

key

Sequence number

11/26


Append updates first, and update the main indexes later

Main purposes

To maximize write throughput by sequential writes (append-only updates)

To reduce # of index nodes to be written by batched updates

Write-Ahead Logging

DB file Docs Index nodes

ID index Seq no. index

WAL indexes:in-memory structures(hash table)

H

DB header (1 block)

HB+Trie nodes


Document updatesDocs


Main purposes



Write-Ahead Logging




H

DB header (1 block)

HB+Trie nodes


15/26

Append documents


Main purposes



Write-Ahead Logging




H Docs


15/26


Main purposes



Write-Ahead Logging

Update WAL indexes

DocsDB file Docs Index nodes

h(key)h(key)

…

OffsetOffset

…

h(seq no)h(seq no)

…

OffsetOffset

…



H



Main purposes



Write-Ahead Logging

Append DB headerfor every commit

HDocsDB file Docs Index nodes

h(key)h(key)

…

OffsetOffset

…

h(seq no)h(seq no)

…

OffsetOffset

…



H


Append DB headerfor every commit

HDocsDB file Docs Index nodes

h(key)h(key)

…

OffsetOffset

…

h(seq no)h(seq no)

…

OffsetOffset

…



H15/26


Main purposes



Write-Ahead Logging

< Key query>1. Retrieve WAL index first2. If hit return immediately3. If miss retrieve HB+Trie (or B+Tree)

Optimizations for Solid-State Drives


OS file system stack overhead Metadata update

Page cache shared among processes

Compaction overhead Need to read an entire database file and write all valid pages into a

new file

Use too much disk I/O bandwidth

Lack of utilizing parallel channels inside SSD Fetching multiple blocks at the same time, which are stored in

different channels

Using async I/O library (e.g., libaio)

Current Limitations

26/26


26/26

OS File System Stack Overhead

SSD SSD SSD

Block I/O Interface (SATA, PCI)

OS File System

Page Cache

Meta Data Mgmt

Database Storage Engine

SSD SSD SSD

Block I/O Interface (SATA, PCI)

Database Storage Engine

… Buffer Cache

Typical Database Storage Stack

Advanced Database Storage Stack

Volume Manager


Required for append-only storage model Garbage collect stale data blocks

Use significant disk I/O bandwidth Read the entire database file and write all valid blocks into a new file

Affect other performance metrics Regular read / write performance drops significantly

Database Compaction



Logical page can change its physical address in flash memory whenever it is overwritten

For this reason, the mapping table between LBA and PBA is maintained by Flash Translation Layer (FTL)

SWAT-Based Compaction Optimization


A B C D E F…

Logical Address in File System (LBA)

FTL Address Mapping: LBA PBA

Physical Address inFlash Memory (PBA)

A A’…




Document

B+Tree (Node of HB+Trie)

B+Tree Node

Old Ver. of B+Tree Node

I

G H

E

A B

F

C D C’

F’

H’

I’

G

E

A B DC’

F’

H’

I’

Current DB file

New CompactedDB file

A new compacted file can be simply created by creating the new LBA to PBA mappings that contain the valid pages only in the current DB file

Need to extend the FTL by adding a new interface SWAT (Swap and Trim)


Implement SWAT interface on the OpenSSD development platform by adapting its FTL code

Total time taken for compactions was reduced by 17x

Number of compactions triggered was reduced by 4x




Exploit async I/O library (e.g., libaio) to better utilize the parallel I/O capabilities by SSDs

Quite useful in querying secondary indexes when items satisfying a query predicate are located in multiple blocks on different channels

Utilizing Parallel Channels on SSDs


ForestDB: EvaluationForestDB, LevelDB, RocksDB


Evaluation Environments 64-bit machine running Centos 6.5

Intel Xeon 2.00 GHz CPU (6 cores, 12 threads)

32GB RAM and Crucial M4 SSD

Data Key size 32 bytes and value size 1KB

Load 100M items

Logical data size 100GB total

ForestDB Evaluation – LevelDB, RocksDB


LevelDB

Compression is disabled

Write buffer size: 256 MB (initial load), 4 MB (otherwise)

Buffer cache size: 8 GB

RocksDB


Write buffer size: 256 MB (initial load), 4 MB (otherwise)

Maximum number of background compaction threads: 8

Maximum number of background memtable flushes: 8

Maximum number of write buffers: 8

Buffer cache size: 8 GB (uncompressed)

ForestDB


WAL size: 4,096 documents

Buffer cache size: 8 GB

KV Storage Engine Configurations


Initial Load Performance

3x ~ 6x less time


Read-Only Performance

0

5000

10000

15000

20000

25000

30000

1 2 4 8

Op

erat

ion

s p

er s

eco

nd

# reader threads

Read-Only Performance

ForestDB LevelDB RocksDB

2x ~ 5x


Write-Only Performance

0

2000

4000

6000

8000

10000

12000

1 4 16 64 256

Op

era

tion

s p

er

se

con

d

Write batch size (# documents)



- Small batch size (e.g., < 10) is not usually common

3x ~ 5x



0

50

100

150

200

250

300

350

400

450

1 4 16 64 256

Wri

te a

mp

lific

atio

n(N

orm

aliz

ed t

o a

sin

gle

do

c si

ze)

Write batch size (# documents)

Write Amplification


ForestDB shows 4x ~ 20x less write amplification


Mixed Workload Performance

0

2000

4000

6000

8000

10000

12000

1 2 4 8

Op

era

tion

s p

er

se

con

d

# reader threads

Mixed (Unrestricted) Performance


2x ~ 5x

ForestDB: EvaluationForestDB, WiredTiger (B+Tee, LSM Tree)


Evaluation Environment

CPU: Intel Core i7-3770 CPU @ 3.40 GHz ( 8 virtual cores)

RAM: 32 GB (DDR3, 1600 MHz)

OS: Ubuntu 12.04.5 LTS (Linux version 3.8.0-29-generic)

Disk: Samsung SSD 840 EVO (formatted with Ext4)

Benchmark: ForestDB-Benchmark

WiredTiger version: 2.5.0


Initial Load: Insertions / Sec

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 1000 2000 3000 4000 5000 6000 7000 8000In

sert

ion

s p

er

seco

nd

Elapsed time (second)

Bulk Load

FDB WT LSM FDB (avg) WT LSM (avg)

2.5x faster

Note: Excluded WiredTiger B+ Tree due to slow speed

Key size: 32 bytes on average

Document size: 128 bytes on average

# of documents: 200,000,000 (more than 40GB DB size)

Cache size: 16GB

Asynchronous writes


Read-Only Performance (small document + no DGM)


Document size: 128 bytes on average

# of documents: 10,000,000 (1.2GB DB size)

Cache size: 16GB

0

500000

1000000

1500000

2000000

2500000

3000000

1 2 4 8 16

Op

erat

ion

s p

er s

eco

nd

# reader threads

Read-Only Throughput (Small)

ForestDB WT B-tree WT LSM

1.5x – 3x slower


Read-Only Performance (large document + DGM + long key)

Key size: 256 bytes, 1024 bytes on average

Document size: 1 KB on average

# of documents: 10,000,000 (11GB DB size)

RAM size: 2GB

Cache size: 512 MB



1 2 4 8 16

ForestDB 9382.94 14291.1 20673.55 27054.92 29730.66

WT B-tree 4132.66 7778.51 13924.61 22880.54 29805.06

WT LSM 3160.17 4351.8 7452.94 12130.47 15675.62

0

5000

10000

15000

20000

25000

30000

35000

Op

erat

ion

s p

er s

eco

nd

# reader threads

Read-Only Throughput (Key: 256 bytes)


2x - 3x fasterNote that disk is fully utilized with 16 threads



1 2 4 8 16

ForestDB 5626.42 9604.27 14774.53 19178.67 19858.83

WT B-tree 190 391.05 607.21 698.01 729.8

WT LSM 589.4 793.14 851.37 857.18 858.91

0

5000

10000

15000

20000

25000O

per

atio

ns

per

sec

on

d

# reader threads

Read-Only Throughput (Key: 1024 bytes)


24x - 27x faster






RAM size: 2GB

Cache size: 512 MB


Write-Only Throughput (Synchronous)

0

5000

10000

15000

20000

1 4 16 64 256

Op

erat

ion

s p

er s

eco

nd

Batch size per commit

Write-Only Throughput (Synchronous)


3x – 6x faster


Write-Only Amplification (Synchronous)

12.16.4

4.8 4.4 4.4

13.1 11.9 11.7 11.7 11.7

126.6 124.262.8

31.5 32.4

1

10

100

1000

1 4 16 64 256

Wri

te A

mp

lific

atio

n

Batch size per commit

Write Amplification (Synchronous)


3x - 20x less amplification


Mixed Workload Performance




RAM size: 2GB

Cache size: 512 MB

Single writer thread and multiple reader threads

Writer batch size: 16 documents on average


Mixed Workload (Read: 80%, Write = 20%)

0

5000

10000

15000

20000

25000

1 2 4 8 16

Op

erat

ion

s p

er s

eco

nd

# reader threads

Mixed Workloads (R:W = 8:2)


2x - 8x faster

Summary


Compact and efficient storage for variety of data – HB+Trie

In-memory WAL indexes to improve write/read performance

Optimized for new SSD storage technology Bypassing OS file system, SWAT-based compaction, Parallel IO channels

Unified storage engine that performs well for various workloads

Unified storage engine that scales from small devices to large servers

Couchbase Server secondary index

Couchbase Lite

Couchbase Server KV engine

ForestDB - Summary


©2015 Couchbase Inc. 103©2014 Couchbase, Inc. 103

Questions?

[email protected]@couchbase.com

mailto:[email protected]

mailto:[email protected]