A Deep Dive Into Understanding Apache Cassandra

Preview:

DESCRIPTION

Inside Cassandra – C* is an interesting piece of software for many reasons, but it is especially interesting in its use of elegant data structures and algorithms. This talk will focus on the data structures and algorithms that make C* such a scalable and performant database. We will walk along the write, read and delete paths exploring the low-level details of how each of these operations work. We will also explore some of the background processes that maintain availability and performance. The goal of this talk is to gain a deeper understanding of C* by exploring the low-level details of its implementation.

Citation preview

Inside Cassandra

Michael Penick

Overview

• To disk and back again• Cassandra Internals by Aaron Morton• Goals– RDBMS comparison to C*– Make educated decisions

I’m configuration

Node 3Node 2

Node 1Node 0

Distributed Hashing

A B

C D

E F

G H

I J

K L

M N

O P

Location = Hash(Key) % # Nodes

Node 4

Node 3Node 2

Node 1Node 0

Distributed Hashing

A B

C D

F G

H

K

J

LP

O

M

I

N

E

% Data Moved = 100 * N / (N + 1)

Consistent Hashing0

Node 1

Node 2Node 3

Node 4

Consistent Hashing0

A

E

I

M

B

F

J

N C

G

K

O

D

H

L

PAdd Node 0

A

E

I

M

B

F

J

N C

G

K

O

D

H

L

P

% Data Moved = 100 * 1 / N

Tunable Consistency

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE

Tunable Consistency

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM

Hinted Handoff0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3and

hinted_handoff_enabled = true

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY

Write locally: system.hints

Note: Doesn’t not count toward consistency level (except ANY)

Tunable Consistency

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY EACH_QUORUM

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

Appends FWD_TO parameter to

message

Read Repair

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

R3Client

SELECT * FROM table USING CONSISTENCY ONE

replication_factor = 3and

read_repair_chance > 0

Write

Memory

Disk

Commit Log

Memtable

K1 C1:V1 C2:V2

K1 C1:V1 C2:V2

SSTable #1

K1 C1:V1 C2:V2

… …

Flush when:> commitlog_total_space_in_mb

or> memtable_total_space_in_mb

Write

Memory

Disk

Commit Log

Memtable

K1 C3:V3

K1 C3:V3

SSTable #1 SSTable #2

K1 C1:V1 C2:V2

… … …

Note: All writes are sequential!

Physical Volume #1 Physical Volume #2

K1 C3:V3

Commit Log

Mutation #3 Mutation #2 Mutation #1 Commit LogExecutor

Commit LogAllocatorSegment #3 Segment #2 Segment #1 Segment #1

Commit LogFile

Memory

Disk

Commit LogFile

Commit LogFile

Flush! Write!commitlog_segment_size_in_mb

Commit Log

• commitlog_sync1. periodic (default)• commitlog_sync_period_in_ms (default: 10 seconds)

2. batch• commitlog_batch_window_in_ms

Memtable

• ConcurrentSkipListMap<RowPosition, AtomicSortedColumns> rows;

• AtomicSortedColumns.Holder– DeletionInfo deletionInfo; // tombstone– SnapTreeMap<ByteBuffer, Column> map;

• Goals– Fast operations– Fast concurrent access– Fast in-order iteration– Atomic/Isolated operations within a column family

Skip List

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

Get 7

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 5 6 7

NIL

NIL

NIL

NIL

Skip List

Insert 4

1 2 3 5 6 7

NIL

NIL

NIL

NIL

Skip List

Insert 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

ConcurrentSkipListMap uses: p = 0.5

Skip List

Insert 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

H 1 3 T

2

H 1 3 T

2

CAS

Skip List

while(true):next = current.nextnew_node.next = nextif(CompareAndSwap(current.next, next,

new_node)):break

Skip List

H 1 3 T

H 1 3 T

2

CAS

I’m lost!

Skip ListH 1 3 T

CAS

H 1 3 T

H 1 3 T

CAS

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

CAS

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

Skip List

Delete 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

CAS

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 5 6 7

NIL

NIL

NIL

NIL

SnapTree

3

2 5

1 4 6

Node Balance Factor

1 0

2 1

3 0

4 0

5 0

6 0

Balance Factor = Height(Left-Subtree) – Height(Right-Subtree)

SnapTree

5

2 6

1 3

4

Node Balance Factor

1 0

2 -1

3 -1

4 0

5 2

6 0

Balance Factor must be -1, 0 or +1

SnapTree5

3

4A

B C

D

5

4

3

A B

C

D

4

3 5

A B C D

Left-Right Case

Left-Left Case

SnapTree3

5

4D

CB

A

3

4

5

DC

B

A

4

3 5

A B C D

Right-Left Case

Right-Right Case

SnapTree5

2 6

1 3

4

Node Balance Factor1 0

2 1

3 1

4 0

5 2

6 0

5

2

6

1

3

4

Node Balance Factor

1 0

2 -1

3 -1

4 0

5 2

6 0

SnapTree

Node Balance Factor1 0

2 1

3 1

4 0

5 2

6 0

5

2

6

1

3

4

Node Balance Factor

1 0

2 1

3 0

4 0

5 0

6 0

3

2 5

1 4 6

Epoch

SnapTree

5

2 6

1 3

4

Root

Lock

4

Version(5) is 0

Version(2) is 0

Does Version(5) == 0?

Insert

Epoch

SnapTree

5

2 6

1 3

4

Root4Get

Version(5) is 0

Version(2) is 0

Does Version(5) == 0?

Epoch

SnapTree

Root

5

2

6

1

3

4

4Get

Does Version(5) == 0?

NO! Go back to 5

Epoch

SnapTree

Root4

3

2 5

1 4 6

3Delete

Lock : (

Epoch

SnapTree

Root4

3

2 5

1 4 6

3Delete

Lock

Epoch

SnapTree

Root4

3

2 5

1 4 6

3Delete

Lock

SetValue(3, null)

SnapTreeEpoch #1

Root

3

2 5

1 4 6

Clone StopDelete

Insert

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Clone

Epoch #3

Root

I’m shared!

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

3

2 5

1 4 6

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

3

2 5

1 4 6

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

3

2 5

1 4 6

7

Snap Tree

C* 2.0.0 - File: db/AtomicSortedColumns.java Line: 307

SSTableFilter.db Data.db

K1

K2

K3

C1

C1

C2

C2

C3

CRC.db0xFFCC23ED

0x1FEA2321

0xCE652133

Index.db

K1

K2

K3

00001

00002

00003

CompressionInfo.db

00001

00002

00003

00001

00004

00006

Compression? NoYes

• CASSANDRA-2319• Promote row index

• CASSANDRA-4885• Remove … per-row

bloom filters

Delete

• Essentially a write (mutation)• Data not remove immediately, but a

tombstone record added• tombstone time > gc_grace = data removed

(compaction)

Bloom Filter

Bloom Filter

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

K1Hash Insert

Bloom Filter

0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

K1Hash Insert

Bloom Filter

1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0

K1Hash InsertHashHash

hash = murmur3(key) # creates two hashesfor i in count(hash):

result[i] = abs(hash[0] + i * hash[1]) % num_keys)

Bloom Filter

Bloom Filter Probability Calculation

Config: bloom_filter_fp_chance,and

SSTable: number of rows

Num hashes, and

Num bits per entry

Read

Memory

Disk

Memtable

K1 C4:V4

SSTable #2

K1 C3:V3

SSTable #1

K1 C1:V1 C2:V2

… …

Memtable

K1 C5:V5

… K1 C4:V4C1:V1 C2:V2 C3:V3 C5:V5

Row Cache

= Off-heap

row_cache_size_in_mb > 0

Read

Memory

Disk

Bloom Filter

Key Cache

Partition Summary

Compression Offsets

Partition Index Data

Cache Hit

Cache Miss

= Off-heap

key_cache_size_in_mb > 0

index_interval = 128 (default)

Compaction (Size-tiered)

Compaction (Size-tiered)

min_compaction_threshold = 4

Memtable flush!

Compaction (Size-tiered)

Compaction (Leveled)

Memtable flush!

Compaction (Leveled)

L0: 160 MB L1: 160 MB x 10

sstable_size_in_mb = 160

L2: 160 MB x 100

Compaction (Leveled)

L0: 160 MB L1: 160 MB x 10 L2: 160 MB x 100

Topics

• CAS (PAXOS)• Anti-entropy (Merkel trees)• Gossip (Failure detection)

Thanks

Recommended