SCYLLA: NoSQL at Ludicrous Speedreport.idx365.com/TalkingData/【T112017-数据... · On-heap /...

SCYLLA:

NoSQL at Ludicrous Speed

主讲人：ScyllaDB软件工程师贺俊

Today we will cover:

+ Intro: Who we are, what we do, who uses it

+ Why we started ScyllaDB

+ Why should you care

+ How we made design decisions to achieve no-compromise performance and availability

+ Founded by KVM hypervisor creators+ Q2 2014 - Pivot to the database world+ Q3 2015 - Decloak during Cassandra Summit 2015, Beta+ Q1 2016 - General Availability+ Q3 2016 - First Scylla Summit: 100+ Attendees+ Q1 2017 - Completed B round+ $25MM in funding+ HQs: Palo Alto, CA; Herzelia, Israel+ 42+ employees, hiring!

Introduction

Why?@#$%$$%^?

Scylla benchmark by Samsung

What we do: Scylla, towards the best NoSQL

+ > 1 million OPS per node

+ < 1ms 99% latency

+ Auto tuned

+ Scale up and out

+ Open source

+ Large community (piggyback on Cassandra)

+ Blends in the ecosystem- Spark, Presto, time series, search, ..

Where Scylla is deployed?

Why we started Scylla?

+ Originally it was about performance/efficiency only+ Over time, we understood we can deliver more:

+ SLA between background and foreground tasks+ Work well on any given hardware {back pressure}+ Deliver consistent, low 99th percentile latency+ Reduction in admin effort+ Low latency under the face of failures (hot cache load balancing)+ High observability

Cassandra Scylla

Throughput: Cannot utilize multi-core efficiently Scales linearly - shard-per-core

Latency: High due to Java and JVM’s GC Low and consistent - own cache

Complexity: Intricate tuning and configuration Auto tuned, dynamic scheduling

Admin: Maintenance impacts performance SLA guarantee for admin vs serving

Case study: Document column family

• Outbrain is the world’s largest content discovery platform.

• Over 557 million unique visitors from across the globe.

• 250 billion personalized content recommendations every month.

Outbrain: Cassandra plus Memcache

• First read from memcached, go to Cassandra on misses.

• Pain: 1) Stale data from cache 2) Complexity 3) Cold cache -> C* gets full volume

ReadMicroservice

Write Process

Scylla/Cassandra side by side deployment

• Writes are written in parallel to C* and Scylla

• Reads are done in parallel:

1) Memcached + Cassandra 2) Scylla (no cache at all)

ReadMicroservice

Write Process

Scylla (w/o cache) vs Cassandra + Memcached

Scylla Cassandra Diff%

Requests/Minute

12,000,000 500,000(memcache

handles 11,500,000)

AVG Latency

4 ms 8 ms 2X

Max Latency

8 ms 35 ms 3.5X

Hardware 9 machines 30+9 machines

Cassandra’s latency

Scylla’s latency

What does it mean for a non Cassandra user?

+ Throughput, latency and scale benefits+ Wide range of big data integration: {Kariosdb, Spark,

JanusGraph, Presto, Kafka, Elastic}+ Best HA/DR in the industry. + Stop using caches in front of the database+ Consolidate HBase, Redis, MySQL, Mongo and others

Assorted Quotes

Design decisions: #1 The trivials

• SSTable file format• Configuration file format• CQL language• CQL native protocol• JMX management protocol• Management command line

Design decisions: #2 Compatibility

Double cluster - Migration w/o downtime

AppCassandra

Scylla

CQLproxy

Design decisions: #3 All things async

Design decisions: #4 Shard per core

Threads Shards

SCYLLA DB: Network Comparison

Kernel

Cassandra

TCP/IPScheduler

queuequeuequeuequeuequeuethreads

NICQueues

Kernel

Traditional stack Scylla sharded stack

Memory

Application

TCP/IP

Task Schedulerqueuequeuequeuequeuequeuesmp queue

NICQueue

Kernel (isn’t

involved)

Userspace

Application

TCP/IP

NICQueue

Kernel (isn’t

involved)

Userspace

Application

TCP/IP

NICQueue

Kernel (isn’t

involved)

Userspace

CoreDatabase

Task Scheduler

queuequeuequeuequeuequeuesmp queue

NICQueue

Userspace

Scylla has its own task schedulerTraditional stack Scylla’s stack

Promise

Promise is a pointer to eventually computed value

Task is a pointer to a lambda function

Scheduler

Thread

Thread is a function pointer

Stack is a byte array from 64k to megabytes

SCYLLA IS DIFFERENT

p DMAp Log structured

merge treep DBaware cachep Userspace I/O

scheduler

p NUMA friendlyp Log structured

allocatorp Zero copy

p Thread per corep Lock-freep Task schedulerp Reactor programingp C++14

p Multi queue p Poll modep Userspace TCP/IP

Scylla vs C* latency by Kenshoo

Design Decision: #5 Unified cacheCassandra Scylla

Key cache

Row cache

On-heap /Off-heap

Linux page cache

SSTables

Unified cache

SSTables

Complex Tuning

Cassandra Streaming configuration

Design decisions: #6 I/O scheduler

Scylla I/O Scheduling

Commitlog

Compaction

UserspaceI/O

SchedulerDisk

Max useful disk concurrency

I/O queued in FS/deviceNo queues

I/O scheduler result by Kenshoo

Memtable

Seastar SchedulerCompaction

Repair

Commitlog

Compaction Backlog Monitor

Memory Monitor

Adjust priority

Design Decision: #7 Workload conditioning

Workload Conditioning in practice

Disk can’t keep up:workload conditioning will figure out the right request rate

Upcoming releases+ Enterprise release, based on 1.6+ 1.7 - May 2017

▪ Counters

▪ New intra-node sharding algorithm

▪ SStableloader from 2.2/3.x

▪ Debian

+ 2.0 – Sep 2017▪ Materialized views

▪ Execution blocks (cpu cache optimization which boost performance)

▪ Partial row cache (for wide row streaming)

▪ Heat Weighted Load Balancing

Vertical HorizontalCoredatabase

Scylla Beyond Cassandra

Resources

slideshare.net/ScyllaDB

dor@scylladb.com (@DorLaor)

avi@scylladb.com (@AviKivity)

@scylladb

http://bit.ly/2oHAfok

youtube.com/c/scylladbgithub.com/scylladb/scylla

scylladb.com/blog

THANKSSCYLLA: NoSQL at Ludicrous Speed

SCYLLA: NoSQL at Ludicrous Speedreport.idx365.com/TalkingData/【T112017-数据... · On-heap /...

Documents

binary heap, d-ary heap, binomial heap, amortized analysis

More sorting algorithms: Heap sort & Radix sort. Heap Data Structure and Heap Sort (Chapter 7.6)

Statically Inferring Complex Heap, Array, and Numeric ...@R data structures already discussed, the variable maps I 6next Fig.1. thttpd’s cache data structure. The thttpd cache maps

heap binomiali

Binomial Heap

Heap Off Memory WTF ?? Reducing Heap memory stress · PDF fileAbstract * Java memory fundamental * heap off memory principles * heap-off cache with Apache DirectMemory

skew heap, leftist heap - Department of Computer Science and

Procedures on Heap - WordPress.com … · 28/06/2016 · Heap Sort Algorithm The heap sort algorithm starts by using procedure BUILD-HEAP to build a heap on the input array A[1

Windows Heap Overflows - orkspace.net · Windows Heap Overflows Introduction This presentation will examine how to exploit heap based buffer overflows on the Windows platform. Heap

Heap concetti ed applicazioni. maggio 2002ASD - Heap2 heap heap = catasta condizione di heap 1.albero binario perfettamente bilanciato 2.tutte le foglie

Heap Management

Heap Spray

Imogen Heap

S HEAP SIMON HEAP SIMON HEAP AGE: 21 EYE COLOR: … · 2017-01-18 · HEAP SIMON HEAP SIMON HEAP AGE: 21 EYE COLOR: Greenish-black HOME ADDRESS: The Observatory, the Badlands DISTINGUISHING

Windows Kernel Internals User-mode Heap Manager Heap Stats 0:006> !heap -s The process has the following heap extended settings 00000008: - Low Fragmentation Heap activated for all

HEAP Chart

Fibonacci Heap. p2. Procedure Binary heap (worst-case) Binomial heap (worst-case) Fibonacci heap (amortized) MAKE-HEAP (1) INSERT (lg n) O(lg n)

Heap Cache Exploitation - White Paper by IBM Internet ...the-eye.eu/.../index-of/index-of.es/EBooks/HeapCacheExploitation.pdf · • Advanced Windows Debugging Sample Chapter (Hewardt

HEAP DATA ALLOCATION TO SCRATCH-PAD … › ~barua › AngelDominguez-PhD-Thesis.pdf10.9 Details on cache experiments with original JEC benchmarks . . . . . 278 10.10Details on cache

CMSC 341 Binomial Queues and Fibonacci Heaps. Basic Heap Operations OpBinary Heap Leftist Heap Binomial Queue Fibonacci Heap insertO(lgN) deleteMinO(lgN)