The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, DataStax) | Cassandra Summit 2016

Wei Deng, Ryan Svihla

The Missing Manual: Leveled Compaction Strategy

2© DataStax, All Rights Reserved.

1 Why

2 Background

3 How it affects you

4 Myth Busters

5 What’s coming

About us

© DataStax, All Rights Reserved. 3

Wei Deng, Ryan Svihla

• Both been working for DataStax for 3 years

• Helped a lot of customers on production deployments and shared their struggles with LCS

• Active on LCS related JIRAs

Why?Are we lazy, or crazy to talk about a five-year-old technology

Why?• Motivations

• Not a lot of technical content dedicated to LCS• Many sharp edges for non-suspecting users that can cause operation pains

• Intended Audience• You’ve been managing C* cluster using LCS for a few months and have been burned by it a

few times• You just started using C* and are wondering what are the pitfalls to watch out

• What you can expect to get out of this talk• A better understanding about inner-workings of LCS and some counter-intuitive behaviors• Understand some concepts frequently used in the code (but not adequately discussed in

existing documentation) before diving into the LCS source code and JIRAs


A brief history on LCS• One of the three main Compaction Strategies in C*: STCS, LCS, DTCS/TWCS• Original idea comes from LevelDB of Google’s Chromium project• Introduced in C* 1.0 almost 5 years ago• A lot of people adopted it in production starting from C* 1.2 days• Been going through many changes and is still evolving• At various points people from community would like to change it to the default

compaction strategy to replace STCS but didn’t succeed (see CASSANDRA-7987 and CASSANDRA-12209)

• Interesting divide in the community: committers and users• A more detailed history can be found by looking at labels = “lcs” on ASF JIRA site.


https://issues.apache.org/jira/browse/CASSANDRA-7987

https://issues.apache.org/jira/browse/CASSANDRA-12209

A Quick Diversion to LevelDB• LevelDB was introduced by Google, designed as an embedded key/value store in

browsers and mobile OS• It was designed for low concurrency embedded purposes• Spawned many additional key/value store use cases (Bitcoin, Akka, ActiveMQ,

RocksDB, etal)• NoSQL database that adopted it for a multiuser database (riak, hyperdex, etal) had to

do some serious surgery similar to LCS (Don't block writes for L0, concurrent compactions, etc.)

• As Basho’s Matthew Von-Maszewski pointed out in his Riak LevelDB talk, there are some common problems that many LevelDB forks suffer, e.g. L0 overwhelmed (pause), write amplification, repair speed


https://www.youtube.com/watch?v=vo88IdglU_8

BackgroundDesign Principles and Basic Concepts

What is overlap?• It’s not a LCS-only concept, but it’s quite important for understanding LCS• How the code decides overlap between two SSTableReader objects (‘current’ and ‘previous’):

• current.first.compareTo(previous.last) <= 0

• Assume you’ve done the following operations to insert a few partitions and incur two flushes:• Insert P1, P2, P3, P4, flush

• Insert P5, P6, P7, P8, flush


cqlsh:lcstest> select token(key), key from t1;

system.token(key) | key

----------------------+-----

-8293482207053059348 | P3

-3394362371136168911 | P7

-2426286626330648673 | P8

2250124551998075523 | P1

3019394044321026632 | P6

5060311682392199866 | P4

5805828953249703115 | P5

9074316327295047501 | P2

-3394362371136168911 <= 9074316327295047501

-8293482207053059348 | P3 2250124551998075523 | P1 5060311682392199866 | P4 9074316327295047501 | P2pr

evio

us

-3394362371136168911 | P7-2426286626330648673 | P8 3019394044321026632 | P6 5805828953249703115 | P5cu

rren

t

current.first <= previous.last ?

https://github.com/apache/cassandra/blob/8a3f0e11fecde964730cd67d6376660ae562b208/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java%23L248

Basic Design Principles• LCS arranges data in SSTables on multiple levels

• Newly appeared SSTables (not from compaction) almost always show up in L0

• Only L0 is allowed to have SSTables of any size, L1+ levels can only have 160MB SSTables (with minor exceptions)

• Only L0 can allow SSTables to overlap, for all other levels, it’s guaranteed to not overlap within the level

• When LCS is able to catch up (i.e. no SSTable in L0), a read in the worst case just need to read from N SSTables (N = the highest level that has data), i.e. at most one SSTable from each L(N>=1) level; ideally 90% of the read will hit the highest level and be satisfied by one SSTable

• Each L(N) level (N>=1) is supposed to accommodate 10N SSTables. So going one level higher, you will be able to accommodate 10x more SSTables

• When a L0 SSTable is picked as a compaction candidate, it will find all other overlapping L0 SSTables (maximum 32), as well as all of the overlapping L1 SSTables (most likely all 10 L1s), to perform a compaction, and the result will be written into multiple SSTables in L1

• When the total size of all SSTables in L1 is greater than 10 x 160MB, the first L1 SSTable since the last compacted key could be picked as the next compaction candidate, and it will be compacted together with all overlapping L2 SSTables, and the result will be written into multiple SSTables in L2

• For higher levels, it will follow the same fashion, adding candidates from highest level to the lowest


Basic Design Principles• LCS arranges data in SSTables on multiple levels

• Newly appeared SSTables (not from compaction) almost always show up in L0

• Only L0 is allowed to have SSTables of any size, L1+ levels can only have 160MB SSTables (with minor exceptions)

• Only L0 can allow SSTables to overlap, for all other levels, it’s guaranteed to not overlap within the level

• When LCS is able to catch up (i.e. no SSTable in L0), a read in the worst case just need to read from N SSTables (N = the highest level that has data), i.e. at most one SSTable from each L(N>=1) level; ideally 90% of the read will hit the highest level and be satisfied by one SSTable

• Each L(N) level (N>=1) is supposed to accommodate 10N SSTables. So going one level higher, you will be able to accommodate 10x more SSTables

• When a L0 SSTable is picked as a compaction candidate, it will find all other overlapping L0 SSTables (maximum 32), as well as all of the overlapping L1 SSTables (most likely all 10 L1s), to perform a compaction, and the result will be written into multiple SSTables in L1

• When the total size of all SSTables in L1 is greater than 10 x 160MB, the first L1 SSTable since the last compacted key could be picked as the next compaction candidate, and it will be compacted together with all overlapping L2 SSTables, and the result will be written into multiple SSTables in L2

• For higher levels, it will follow the same fashion, adding candidates from highest level to the lowest


Basic Design Principles (cont.)• On all levels, a Compaction Score will be periodically calculated, based on the total size of all non-

compacting SSTables in each level, divided by this level’s capacity (for example, L1 is supposed to accommodate 1.6GB, L2 is supposed to accommodate 16GB). Going from highest level, any level having score higher than 1.001 can be picked by a compaction thread

• Before any one of the higher-level compaction starts, check if L0 is too far-behind (i.e. having > 32 SSTables that are not yet in a compaction), if yes, do STCS in L0 first before doing higher level LCS compaction

• After going through compactions at all higher levels, move onto doing regular LCS compaction in L0• The number of compaction threads is capped by concurrent_compactors, and throughput throttled

• For each compaction session, the level for the added sstables is the max of the removed ones, plus one if the removed were all on the same level

• If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level into lower level compactions

• In addition to the DataStax blog post on LCS, more details are explained in the code comments• http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

• org.apache.cassandra.db.compaction.LeveledManifest.java

• org.apache.cassandra.db.compaction.LeveledCompactionStrategy.java© DataStax, All Rights Reserved. 12

http://www.datastax.com/dev/blog/when-to-use-leveled-compaction



https://github.com/apache/cassandra/blob/8a3f0e11fecde964730cd67d6376660ae562b208/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java

https://github.com/apache/cassandra/blob/8a3f0e11fecde964730cd67d6376660ae562b208/src/java/org/apache/cassandra/db/compaction/LeveledCompactionStrategy.java

Basic Design Principles (cont.)• On all levels, a Compaction Score will be periodically calculated, based on the total size of all non-

compacting SSTables in each level, divided by this level’s capacity (for example, L1 is supposed to accommodate 1.6GB, L2 is supposed to accommodate 16GB). Going from highest level, any level having score higher than 1.001 can be picked by a compaction thread

• Before any one of the higher-level compaction starts, check if L0 is too far-behind (i.e. having > 32 SSTables that are not yet in a compaction), if yes, do STCS in L0 first before doing higher level LCS compaction

• After going through compactions at all higher levels, move onto doing regular LCS compaction in L0• The number of compaction threads is capped by concurrent_compactors, and throughput throttled

• For each compaction session, the level for the added sstables is the max of the removed ones, plus one if the removed were all on the same level

• If we go 25 rounds without compacting in the highest level, we start bringing in sstables from that level into lower level compactions

• In addition to the DataStax blog post on LCS, more details are explained in the code comments• http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

• org.apache.cassandra.db.compaction.LeveledManifest.java

• org.apache.cassandra.db.compaction.LeveledCompactionStrategy.java© DataStax, All Rights Reserved. 13




https://github.com/apache/cassandra/blob/8a3f0e11fecde964730cd67d6376660ae562b208/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java

https://github.com/apache/cassandra/blob/8a3f0e11fecde964730cd67d6376660ae562b208/src/java/org/apache/cassandra/db/compaction/LeveledCompactionStrategy.java

How up-level is happening• Use a very simple key/value table in our demo• For each partition, the key is an integer, and the

value is a 40MB blob, so each partition is about 40MB and 4 partitions can fill up a 160MB SSTable

• Insert partitions with key 1-8, trigger a flush, then insert 9-16, another flush, then insert 17-24, flush, insert 25-32, and so on

• Each L(N>0) level has capacity of 160MB x 4N , so L1 has max capacity of 640MB, L2 has max capacity of 2.56GB and so on, for easier illustration of levels


CREATE TABLE kvstore (

key int PRIMARY KEY,

value text

);

How up-level is happening


system.token(key) | key | value (40MB)----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa 3728482343045213994 | 9 | aaa 4279681877540623768 | 14 | aaa 5176205029172940157 | 21 | aaa 5467144456125416399 | 17 | aaa 6462532412986175674 | 24 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa

-7509452495886106294 | 5 -4069959284402364209 | 1 -3799847372828181882 | 8 -3248873570005575792 | 2 -2729420104000364805 | 4 1634052884888577606 | 7 2705480034054113608 | 6 9010454139840013625 | 3

-6715243485458697746 | 10 -5477287129830487822 | 16 -5034495173465742853 | 13 -4156302194539278891 | 11 -1191135763843456182 | 15 3728482343045213994 | 9 4279681877540623768 | 14 8582886034424406875 | 12

L0Can overlapNo size limit

-7509452495886106294 | 5

-6715243485458697746 | 10

-5477287129830487822 | 16

-5034495173465742853 | 13 -4156302194539278891 | 11

-4069959284402364209 | 1

-3799847372828181882 | 8

-3248873570005575792 | 2 -2729420104000364805 | 4

-1191135763843456182 | 15

1634052884888577606 | 7

2705480034054113608 | 6 3728482343045213994 | 9

4279681877540623768 | 14

8582886034424406875 | 12

9010454139840013625 | 3

L1No overlap41 x 160MB


…

flush

flush



system.token(key) | key | value (40MB)----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa 3728482343045213994 | 9 | aaa 4279681877540623768 | 14 | aaa 5176205029172940157 | 21 | aaa 5467144456125416399 | 17 | aaa 6462532412986175674 | 24 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa L0

Can overlapNo size limit

-7509452495886106294 | 5

-6715243485458697746 | 10

-5477287129830487822 | 16

-5034495173465742853 | 13 -4156302194539278891 | 11

-4069959284402364209 | 1

-3799847372828181882 | 8

-3248873570005575792 | 2 -2729420104000364805 | 4

-1191135763843456182 | 15

1634052884888577606 | 7

2705480034054113608 | 6 3728482343045213994 | 9

4279681877540623768 | 14

8582886034424406875 | 12

9010454139840013625 | 3



…





-7509452495886106294 | 5

-6715243485458697746 | 10

-5477287129830487822 | 16

-5034495173465742853 | 13 -4156302194539278891 | 11

-4069959284402364209 | 1

-3799847372828181882 | 8

-3248873570005575792 | 2 -2729420104000364805 | 4

-1191135763843456182 | 15

1634052884888577606 | 7

2705480034054113608 | 6 3728482343045213994 | 9

4279681877540623768 | 14

8582886034424406875 | 12

9010454139840013625 | 3



…

-9157060164899361011 | 23 -3974532302236993209 | 19 -2695747960476065067 | 18 -1117083337304738213 | 22 1388667306199997068 | 20 5176205029172940157 | 21 5467144456125416399 | 17 6462532412986175674 | 24





-7509452495886106294 | 5

-6715243485458697746 | 10

-5477287129830487822 | 16

-5034495173465742853 | 13 -4156302194539278891 | 11

-4069959284402364209 | 1

-3799847372828181882 | 8

-3248873570005575792 | 2 -2729420104000364805 | 4

-1191135763843456182 | 15

1634052884888577606 | 7

2705480034054113608 | 6 3728482343045213994 | 9

4279681877540623768 | 14

8582886034424406875 | 12

9010454139840013625 | 3



…

-9157060164899361011 | 23 -3974532302236993209 | 19 -2695747960476065067 | 18 -1117083337304738213 | 22 1388667306199997068 | 20 5176205029172940157 | 21 5467144456125416399 | 17 6462532412986175674 | 24

90°







…

23 19 18 22 20 21 17 24

5 10 16 13 11 1 8 2 4 15 7 6 9 14 12 3







…

23 5 10 16 13 11 1 19 8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3

Total size of all L1 SSTables ≈ 40MB/partition x 24 partitions = 960MB / (41 x 160MB) > 1.001







…

23 5 10 16 13 11 1 19

8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3

Total size of all L1 SSTables ≈ 40MB/partition x 16 partitions = 640MB / (41 x 160MB) ≤ 1.001



system.token(key) | key | value (40MB)----------------------+-----+------------- -9157060164899361011 | 23 | aaa -7509452495886106294 | 5 | aaa -7492342649816923291 | 28 | aaa -6715243485458697746 | 10 | aaa -5477287129830487822 | 16 | aaa -5034495173465742853 | 13 | aaa -4944298994521308347 | 30 | aaa -4156302194539278891 | 11 | aaa -4069959284402364209 | 1 | aaa -3974532302236993209 | 19 | aaa -3799847372828181882 | 8 | aaa -3248873570005575792 | 2 | aaa -2729420104000364805 | 4 | aaa -2695747960476065067 | 18 | aaa -1191135763843456182 | 15 | aaa -1117083337304738213 | 22 | aaa 945780720898208171 | 27 | aaa 1388667306199997068 | 20 | aaa 1634052884888577606 | 7 | aaa 2705480034054113608 | 6 | aaa

... 8041825692956475789 | 25 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa




…

23 5 10 16 13 11 1 19

8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3

28 30 27 29 26 31 32 25




... 8041825692956475789 | 25 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa




…

23 5 10 16 13 11 1 19

8 2 4 18 15 22 20 7 6 9 14 21 17 24 12 3

28 30 27 29 26 31 32 25




... 8041825692956475789 | 25 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa




…

23 5 10 16 13 11 1 19

27 20 7 6 29 9 14 26 21 17 31 24

32 25 12 328 30 8 2 4 18 15 22

Total size of all L1 SSTables ≈ 40MB/partition x 24 partitions = 960MB / (41 x 160MB) > 1.001

lastCompactedKeys[minLevel] = SSTableReader.sstableOrdering.max(added).last;




... 8041825692956475789 | 25 | aaa 8582886034424406875 | 12 | aaa 9010454139840013625 | 3 | aaa




…

23 5 10 16 13 11 1 19 27 20 7 6

29 9 14 26 21 17 31 24

32 25 12 328 30 8 2

4 18 15 22

lastCompactedKeys[minLevel] = SSTableReader.sstableOrdering.max(added).last;

Fully Expired SSTable


• If an SSTable contains only tombstones and it is guaranteed that the SSTable is not shadowing data in any other SSTable, compaction can drop that SSTable

• If you see SSTables with only tombstones (note that TTL’ed data is considered tombstones once the time to live has expired) but it is not being dropped by compaction, it is likely that other SSTables contain older data

• In LCS, these other SSTables containing older data are often in higher levels• This directly impacts the effectiveness of tombstone compaction, and hence why you

always want tombstones to go to the highest level as quickly as possible

Write Amplification


• All fresh data enters L0, and will have to move up level one by one, no skipping level possible

• In the ideal situation, highest level will have 90% of data, hence more data -> more Write Amplification, because any data that eventually shows up on the highest level N had to be written N+1 times

• This is why LCS is more demanding on CPU and IO capabilities compared to STCS

JBOD and CASSANDRA-6696• JBOD is a primary use case for choosing LCS• However, it used to create “zombie resurrection” problem when JBOD disk fails• CASSANDRA-6696 (C* 3.2+) will create flush writer and compaction strategy per disk,

so all flush and compaction activities for each disk will only be responsible for one set of token ranges

• More parallelism for compaction (no worries about overlap at higher levels between different compaction threads)

• This also makes LCS to be able to handle more data per node, as each JBOD volume is doing compaction on its own


How it affects you?Best Practices

Where LCS fits the best• Use cases needing very consistent read performance with much higher read to write

ratio• Wide-partition data model with limited (or slow-growing) number of total partitions but

a lot of updates and deletes, or fully TTL’ed dataset


Typical anti-pattern use cases for LCS• Heavy write with all unique partitions• Too frequent memtable flushes• Large vnode nodes• Frequent bulk load with the goal of finishing the load as quickly as possible• Any other condition that could create too many L0 SSTables• High density node (> 1TB per node or > 0.5TB per CQL table per node)• Node without enough IO (i.e. non-SSD, such as SAN, NFS or spindles) and CPU

resources• Partitions that are too wide (> 160MB)• Use LCS simply for saving disk spaces (because “LCS requires less free space”)• Have too many SSTables without enough file handles and using snapshot repair


Troubleshooting• Debug.log

• nodetool setlogginglevel org.apache.cassandra.db.compaction.LeveledManifest TRACE

• nodetool setlogginglevel org.apache.cassandra.db.compaction.LeveledCompactionStrategy TRACE

• log_all=true (CASSANDRA-10805, C* 3.6+) or JMX• enable in table schema

• Will create an individual compaction.log file outside of system.log or debug.log

• Can be very useful for a visualizer

• The offline compaction-stress tool automatically enable it


Visualization demo• Developed by Carl Yeksigian, still Work In Progress, 0.1 version• Leverage the compaction.log generated by CASSANDRA-10805, which was also

authored by Carl


Myth BustersObservations that often confuse people

Myth #1

• Using LCS, I can avoid STCS’s 50% free space requirement and use up as much as 90% of my disk space?

WRONG

• You should not size your hardware based on 90% disk utilization• STCS in L0 can easily throw off this assumption, also when L0 is overloaded, L0->L1 can easily

break the “up to 90%” assumption


Myth #2

• Why so many small SSTables show up in L0 when I start Repair Service?

• Every repair session will force affected nodes to flush the corresponding memtable• This is the only way to guarantee merkle tree comparison won’t miss any data from memtable

• If something is wrong with repair scheduling, two repair sessions could be scheduled back-to-back, causing unnecessary memtable flush to happen

• If you use vnode, one range of data tends to spread across many nodes and naturally many SSTables, hence when streaming the replica difference, it’s more likely to source from many nodes and many SSTables


Myth #3

• Why compaction is no longer runnning and I’m still seeing something like [1, 12/10, 104/100, …] ?

• Even when not doing STCS, L0 SSTables compacted don’t always end up in L1. If the combined size of L0 SSTables involved in a compaction is smaller than max_sstable_size, then the output will stay in L0

• Upper level compaction is triggered by compaction scores (watch debug.log to see it in action), and compaction scores is completely based on size, instead of number of SSTables in each level. For brevity of display, “nodetool cfstats” chooses to use # of SSTables


Myth #4

• Why it takes so long to free up disk space after delete?

• Because any newly inserted data (including delete/tombstone) will have to bubble up

• A partition can have fragments in all different levels• To allow the data belonging to a partition to be cleared, the tombstone has to

bubble up to the top and is also expired• See fully expired sstable slide.


What’s comingJIRA references to LCS’s past, present and future

What has happened since C* 2.1• Partition SSTables by token ranges (CASSANDRA-6696, C* 3.2+)• Try to include overlapping sstables in tombstone compactions to be able to actually drop the tombstones

(CASSANDRA-7019, C* 3.10+) See talk this afternoon at 4:10pm in 210C• Add major compaction to LCS (CASSANDRA-7272, C* 2.2+), however note CASSANDRA-11817• Send source sstable level when bootstrapping or replacing nodes (CASSANDRA-7460, C* 2.2+)• Adjusting LCS to work with incremental repair (CASSANDRA-8004, C* 2.1.2+)• A tool called sstableofflinerelevel to create a "decent" sstable leveling from a bunch of given SSTables (CASSANDRA-

8301, C* 2.1.5+)• Use a new JBOD volume every time when a newly compacted SSTable is written (CASSANDRA-8329, C* 2.1.3+)• Compaction throughput under stress improved by 2x with various optimizations (CASSANDRA-8915, C* 3.0+,

CASSANDRA-8920, C* 2.2+, CASSANDRA-8988, C* 2.2+, CASSANDRA-10099, C* 3.6+)• Compact only certain token range (CASSANDRA-10643, C* 3.10+)• Allow L0 STCS and L0->L1 LCS to happen at the same time if there’s no overlap (CASSANDRA-10979, C* 2.1.14+)• Improve Cassandra startup speed with LCS tables (CASSANDRA-12114, C* 3.0.9+, C* 3.8+)• Compaction Throttling is not working as expected (CASSANDRA-12366, C* 3.10+)• Compute last compacted keys for each level at startup (CASSANDRA-6216, C* 3.0.9+, C* 3.10+)


What has happened since C* 2.1• Partition SSTables by token ranges (CASSANDRA-6696, C* 3.2+)• Try to include overlapping sstables in tombstone compactions to be able to actually drop the tombstones

(CASSANDRA-7019, C* 3.10+) See talk this afternoon at 4:10pm in 210C• Add major compaction to LCS (CASSANDRA-7272, C* 2.2+), however note CASSANDRA-11817• Send source sstable level when bootstrapping or replacing nodes (CASSANDRA-7460, C* 2.2+)• Adjusting LCS to work with incremental repair (CASSANDRA-8004, C* 2.1.2+)• A tool called sstableofflinerelevel to create a "decent" sstable leveling from a bunch of given SSTables (CASSANDRA-

8301, C* 2.1.5+)• Use a new JBOD volume every time when a newly compacted SSTable is written (CASSANDRA-8329, C* 2.1.3+)• Compaction throughput under stress improved by 2x with various optimizations (CASSANDRA-8915, C* 3.0+,

CASSANDRA-8920, C* 2.2+, CASSANDRA-8988, C* 2.2+, CASSANDRA-10099, C* 3.6+)• Compact only certain token range (CASSANDRA-10643, C* 3.10+)• Allow L0 STCS and L0->L1 LCS to happen at the same time if there’s no overlap (CASSANDRA-10979, C* 2.1.14+)• Improve Cassandra startup speed with LCS tables (CASSANDRA-12114, C* 3.0.9+, C* 3.8+)• Compaction Throttling is not working as expected (CASSANDRA-12366, C* 3.10+)• Compute last compacted keys for each level at startup (CASSANDRA-6216, C* 3.0.9+, C* 3.10+)


JIRAs comingA few JIRAs worth waiting for (already patch available) or watching (WIP) or contributing/upvoting (not yet started):

• Different optimizations to pick compaction candidates to avoid write amplification and merge/eliminate tombstones (CASSANDRA-11106)

• Range Aware compaction (CASSANDRA-10540)

• Create new sstables in the highest possible level (CASSANDRA-6323)

• Fractual cascading (CASSANDRA-7065)

• Allow multiple overlapping SSTables in L1 or above (CASSANDRA-7409)

• L0 should have a separate configurable bloom filter false positive ratio (CASSANDRA-8434)

• Dynamically adjusting the ideal top level size to how much data is actually in it (CASSANDRA-9829)

• Option to disable bloom filter in highest level of LCS sstables (CASSANDRA-9830)

• LCS repair: compact tables before making available in L0 (CASSANDRA-10862)

• Make the fanout size for LeveledCompactionStrategy to be configurable (CASSANDRA-11150)

• Improve parallelism on higher level compactions in LCS (CASSANDRA-12464)


Some parting thoughts• Try out the latest C* trunk/3.x and play with the logging level• Test out bigger sstable_size_in_mb with your environment (CASSANDRA-12591)• Leverage the new compaction-stress tool to evaluate your deployment environment


Q & A

Appendix / Deleted ScenesThe stuff we didn’t have time to cover

Some more comments on LevelDB design from Cassandra source code

“LevelDB gives each level a score of how much data it contains vs its ideal amount, and compacts the level with the highest score. But this falls apart spectacularly once you get behind. Consider this set of levels:

L0: 988 [ideal: 4]

L1: 117 [ideal: 10]

L2: 12 [ideal: 100]

The problem is that L0 has a much higher score (almost 250) than L1 (11), so what we'll do is compact a batch of MAX_COMPACTING_L0 sstables with all 117 L1 sstables, and put the result (say, 120 sstables) in L1. Then we'll compact the next batch of MAX_COMPACTING_L0, and so forth. So we spend most of our i/o rewriting the L1 data with each batch.

If we could just do *all* L0 a single time with L1, that would be ideal. But we can’t. LevelDB's way around this is to simply block writes if L0 compaction falls behind. We don't have that luxury. So instead, we 1) force compacting higher levels first, which minimizes the i/o needed to compact optimally which gives us a long term win, and 2) if L0 falls behind, we will size-tiered compact it to reduce read overhead until we can catch up on the higher levels.

This isn't a magic wand -- if you are consistently writing too fast for LCS to keep up, you're still screwed. But if instead

you have intermittent bursts of activity, it can help a lot.”© DataStax, All Rights Reserved. 46

Basics on compaction in general• Why you need compaction

• Log Structured Merge Tree• Immutable data files (SSTables)• Super fast writes, but read will need to read as little amount of data files as possible

• What it usually does• Normally combine *multiple* SSTables and generate another set of SSTables to merge

partition fragments from the source SSTables• In this process, remove tombstones or change expired data to tombstones• Tombstone compaction can also be automatically triggered on individual SSTables.

• How it is triggered• Background / minor compaction• Major compaction• User defined compaction


Impact from Incremental Repair• Anti-compaction at the end of incremental repair session will create repaired and

unrepaired SSTables• Maintaining separate pools of repaired and unrepaired SSTables causes extra

complexity for LCS• Two instances of LeveledCompactionStrategy running, one for each pool• When moving a SSTable (at L1+) from unrepaired to repaired, it can cause overlap

and we will have to send this SSTable back to L0


What is suspectness and why it matters?• A suspected SSTable is marked when any IO exception is encountered when reading any

component of this SSTable. For example, a SSTable component corruption can lead compaction to mark this SSTable as suspected.

“Note that we ignore suspect-ness of L1 sstables here, since if an L1 sstable is suspect we're

basically screwed, since we expect all or most L0 sstables to overlap with each L1 sstable.

So if an L1 sstable is suspect we can't do much besides try anyway and hope for the best. ”

• This is actually one of the reasons why you may see a huge backup whenever there is sstable corruption. Because if the corruption happens in L1, no L0 compaction can happen any more, so no matter how many compaction threads you have, you may be blocked in L0->L1 compaction.


Subrange compaction• For partitions with “Justin Bieber” effect• Only compact a given sub range• Could be useful if you know a partition/token has been misbehaving - either gathering

many updates or many deletes• “nodetool compact -st x -et y” will pick all SSTables containing the range between x

and y and issue a compaction for those sstables• With LCS it can issue the compaction for a subset of the SSTables, the resulting

SSTables will end up in L0


Typical anti-pattern use cases for LCS• Heavy write with all unique partitions• Too frequent memtable flushes• Large vnode nodes• Any other condition that could create too many L0 SSTables• High density node (> 1.5TB per node or > 1TB per CQL table per node)• Node without enough IO (i.e. non-SSD, such as SAN, NFS or spindles) and CPU resources• Frequent bulk load with the goal of finishing the load as quickly as possible• Partitions that are too wide (> 160MB)• Use LCS simply for saving disk spaces (because “LCS requires less free space”)• Have too many SSTables without enough file handles and using snapshot repair


How is Flush triggered?• memtable reaches threshold• commitlog reaches threshold• memtable_flush_period_in_ms• manual flush• repair triggering flush• snapshot/backup• nodetool drain• drop table• when you change compaction strategy• When the node is requested to transfer ranges (streaming-in)


Typical anti-pattern use cases for LCS• Heavy write with all unique partitions• Too frequent memtable flushes• Large vnode nodes• Any other condition that could create too many L0 SSTables• High density node (> 1.5TB per node or > 1TB per CQL table per node)• Node without enough IO (i.e. non-SSD, such as SAN, NFS or spindles) and CPU resources• Frequent bulk load with the goal of finishing the load as quickly as possible• Partitions that are too wide (> 160MB)• Use LCS simply for saving disk spaces (because “LCS requires less free space”)• Have too many SSTables without enough file handles and using snapshot repair


Other reasons massive SSTables can show up in L0

• Large vnode with repair streaming (CASSANDRA-10862)• Bootstrap without sending source level (CASSANDRA-7460)• Bulk load utility sstableloader from cluster-wide snapshots could be another bad

source of small L0 sstables (RF x RF) (CASSANDRA-4756)


Pitfalls to avoid• Fast write that exceeds the capability of compaction• Really large partitions (usually LCS is known to handle wide partitions well, but you

cannot go too high, especially not exceeding the max_sstable_size)• A single CQL table holding too much data with LCS compaction• Large num_token• Bugs or data corruption that can block LCS compaction and create a lot of L0 overflow


Switch between STCS and LCS• How do you properly do that?

• Simply change table schema can create a lot of impact on your production workload• Least intrusive

• Use JMX to change Compaction parameters – set CompactionParametersJson• Finish compaction on a couple of nodes before starting the next• Be careful: don’t trigger schema update on this node before compaction around all

nodes finishes• At the end, make schema change permanent by ALTER TABLE


Tuning Knobs to be aware of• concurrent_compactors in yaml (be mindful of heap pressure)• compaction_throughput_mb_per_sec in yaml• All other yaml parameters impacting flush frequency

• memtable_heap_space_in_mb

• memtable_offheap_space_in_mb

• memtable_cleanup_threshold

• commitlog_total_space_in_mb

• JVM flag to disable stcs_in_l0 –Dcassandra.disable_stcs_in_l0• sstable_size_in_mb in table schema• Tombstone compaction related parameters in table schema• For TTL’ed data, gc_grace_seconds in table schema• Fan out size (CASSANDRA-7871, CASSANDRA-11550)• Major compaction (caution)• You can always throttle your write throughput• Surprise! Switching to STCS can be an option too


Useful tools• Passive

• nodetool cfstats• nodetool compactionstats• nodetool compactionhistory• Transaction log in the data directory (CASSANDRA-7066)• JMX for monitoring (check out the two MBeans: o.a.c.db.ColumnFamilyStoreMBean,

o.a.c.db.compaction.CompactionManagerMBean)

• Proactive• sstableofflinerelevel (great tool, but don’t expect it to do magic) (CASSANDRA-8301)• Major compaction (performance caveat!) (CASSANDRA-7272)• sstablelevelreset (Weapon of Massive Destruction) (CASSANDRA-5271)• User Defined Compaction - JMX or nodetool (CASSANDRA-10660)


Signs of a struggling LCS compaction• 100s-1000s (or more) of L0 SSTables in cfstats• 1000s (or more) of pending compactions in compactionstats• Heavy GC activities that are not driven by read/write workload• Much more disk space utilized than the real data volume• Running out of file handles• OOM with snapshot repair


Typical debug.log EntriesTRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:119 - Adding BigTableReader(path='/home/automaton/cassandra-trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-1178-big-Data.db') to L3DEBUG [CompactionExecutor:39] 2016-08-23 21:14:07,236 CompactionTask.java:249 - Compacted (7f458a20-6976-11e6-9ee3-dd591a930c3b) 1 sstables to [/home/automaton/cassandra-trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-1178-big,] to level=3. 160.000MiB to 160.000MiB (~100% of original) in 12,865ms. Read Throughput = 12.436MiB/s, Write Throughput = 12.436MiB/s, Row Throughput = ~55,379/s. 719,931 total partitions merged to 719,931. Partition merge counts were {1:719931, }TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:748 - Estimating [0, 0, 0, 0, 0, 0, 0, 0, 0] compactions to do for keyspace1.standard1TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:748 - Estimating [0, 225, 2, 0, 0, 0, 0, 0, 0] compactions to do for keyspace1.standard1TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,236 LeveledManifest.java:319 - Compaction score for level 3 is 0.0010000005662441254TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:319 - Compaction score for level 2 is 1.0000003707408904TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:319 - Compaction score for level 1 is 23.493476694822313TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:573 - Choosing candidates for L1TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 0: 0TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 1: 107TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 2: 0TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,237 LeveledManifest.java:406 - CompactionCounter: 3: 1TRACE [CompactionExecutor:39] 2016-08-23 21:14:07,238 LeveledManifest.java:335 - Compaction candidates for L1 are standard1-913(L1),DEBUG [CompactionExecutor:39] 2016-08-23 21:14:07,238 CompactionTask.java:153 - Compacting (86f10a60-6976-11e6-9ee3-dd591a930c3b) [/home/automaton/cassandra-trunk/data/data/keyspace1/standard1-37b1059068e011e68a2a31a1f422f58b/mc-913-big-Data.db:level=1, ]TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,450 LeveledManifest.java:473 - L1 contains 235 SSTables (36.709GiB) in Manifest@546534055TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,450 LeveledManifest.java:473 - L2 contains 101 SSTables (15.781GiB) in Manifest@546534055TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L3 contains 1 SSTables (160.000MiB) in Manifest@546534055TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:149 - Replacing [standard1-674(L2), ]TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:166 - Adding [standard1-1179(L3), ]TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L1 contains 235 SSTables (36.709GiB) in Manifest@546534055TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L2 contains 101 SSTables (15.781GiB) in Manifest@546534055TRACE [CompactionExecutor:71] 2016-08-23 21:14:09,451 LeveledManifest.java:473 - L3 contains 1 SSTables (160.000MiB) in Manifest@546534055© DataStax, All Rights Reserved. 60

Typical compaction.log Entries{"type":"pending","keyspace":"stresscql","table":"blogposts","time":1472851796649,"strategyId":"1","pending":5}{"type":"compaction","keyspace":"stresscql","table":"blogposts","time":1472851799027,"start":"1472851795595","end":"1472851799027","input":[{"strategyId":"1","table":{"generation":3244,"version":"mc","size":169275615,"details":{"level":2,"min_token":"-5663415854288007738","max_token":"-5630042604729221974"}}},{"strategyId":"1","table":{"generation":3970,"version":"mc","size":156101287,"details":{"level":3,"min_token":"-5681368161876797914","max_token":"-5523713966014233820"}}}],"output":[{"strategyId":"1","table":{"generation":4003,"version":"mc","size":170059193,"details":{"level":3,"min_token":"-5681368161876797914","max_token":"-5654579422012458443"}}},{"strategyId":"1","table":{"generation":4008,"version":"mc","size":155296723,"details":{"level":3,"min_token":"-5654551486308333779","max_token":"-5523713966014233820"}}}]}© DataStax, All Rights Reserved. 61

Myth #5

• Why I cannot use up all of my concurrent compactor threads?

• It’s all about overlap• L0 -> L1 can only happen in one thread (most likely)

• L1+ up-level can only happen when L(N) sstable overlaps with a small subset of L(N+1) SSTables

• Some of your SSTables are marked as suspected so cannot participate in compaction


Myth #6

• Why memtable flushes at a much smaller size than the allocated memtable space, causing way more L0 SSTables than expected?

• The threshold for regular memtable flush is calculated based not only on allocated onheap and offheap memtable space, but also on memtable_cleanup_threshold

• memtable_cleanup_threshold by default is calculated based on memtable_flush_writers (inverse proportion)

• it’s recommended to always manually set this value, especially if you’ve manually configured memtable_flush_writers

• Your commitlog total space is too low, which leads commitlog filled up too quickly under heavy writes


Myth #7

• Why pending compaction numbers is often not accurate for LCS?

• It’s a known limitation of the current implementation• Look at this code on how the pending compaction task numbers are calculated (it is an

*estimate*):estimated[i] = (long)Math.ceil((double)Math.max(0L, SSTableReader.getTotalBytes(sstables) - (long)(maxBytesForLevel(i, maxSSTableSizeInBytes) * 1.001)) / (double)maxSSTableSizeInBytes);

• In other words, it’s (total data in the current level – current level’s capacity) / 160MB, then add up individual estimates from all levels

• But it never consider the additional writes needed beyond the current level


https://github.com/apache/cassandra/blob/a7926b33968b9b1497fe281547e0cd0c656c7364/src/java/org/apache/cassandra/db/compaction/LeveledManifest.java%23L770-L776

Myth #8

• Why I am seeing a very low compaction throughput number (MB/s) when I have super fast SSD and plenty of idle CPU?

• L0->L1 can only be single-threaded, so faster CPU helps but more CPU cores won’t for this phase• Part of reporting problem, especially if you’re using compression, see CASSANDRA-11697 (WIP) • Part of throttling problem, see CASSANDRA-12366 (C* 3.10+)• Between 2.1 and 3.0, we’ve improved compaction throughput under stress by 2x, because of

CASSANDRA-8988 (C* 2.2+), CASSANDRA-8920 (C* 2.2+), CASSANDRA-8915 (C* 3.0+).• There are still remaining issues, but are currently being actively worked on, so you should watch

CASSANDRA-11697 closely


Software

The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, DataStax) | Cassandra Summit 2016