Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Intuitions for Scaling Data-Centric Architectures

Ben StopfordConfluent Inc

Intuitions for Scale

Intuition does not come to the unprepared mind

Locality &Sequential Addressing

Computers work best with sequential workloads

Disk buffer

Page cache

L3 cacheL2 cache

L1 cache

Pre-fetch is your friend

Random vs. Sequential Addressing

300 reads/sec 200MB/s

e.g. sequential is ~7000x faster for 100B rows

This isn’t just Disk

Random RAM ~ Sequential Disk

10-100x

We can write sequentially to a file quickly

Reading Efficiently

Position & Scan(pages)

Avoid Random Reads

Writing Tradeoffs

Append OnlyJournal

(Sequential IO)

Update in PlaceOrdered File(Random IO)

Supporting Lookups

Add Indexes for Selectivity

dave fred hary mikesteve vince

Heap file

Goodbye Sequential Write Performance

dave fred hary mikesteve vince

Random IO

Sequential IO

Option A: Put Index in Memory

Option B: Use a chronology of small index files

Writes

batch up

write to disk

older files

small index file

…with tricks to optimise out the need for random IO

file metadata & bloom filter

Log Structured Merge Trees

• A collection of small, immutable indexes

• Append only, de-duplicate by merging files

• Low memory index structures increase read performance

Shift problem of Random Access from “write” to “read” concern

Option C: Brute Force

A1A2A3A4

‘column per file’ arrangement

same order for each file

Option C: Columnar

Merge Join

compressedcolumns

A2A3A4

Brute Force, by Column

• Less IO, by column, compressed• Held in Row order => merge joins via

rowid• Predicates can operate on

compressed data• Late materialisation.

Many of the most scalable technologies play to one of these core efficiencies

Riak, Mongo etc

(Queues are Databases - 1995 Jim Gray)

Hbase, Cassandra, RocksDB etc

Redshift etc, Parquet (Hadoop)

A2A3A4

Parallelism

Partitioning & Replication

Partitioning - KV

K-V storessingle endpoint query routing

Partitioning - Batch

Divide and conquer

Partitioning: Concurrency Limits

Use of secondary indexes can limit concurrency at

Replication

• Replication provides one route out of this.

• Replicas isolate load -> scales out concurrency for general workloads.

• Obviously provides redundancy etc too.

• If async, trades off against consistency (CAP)

Atomaticity & Ordering

These can be expensive

Solution: Avoid, Isolate or embrace disorder (Bloom etc)

Atomic(Mutable)

Immutable

Circling Synchronous, Mutable State

Trapped in the Persist & Query pattern… in

a fully ACID world

Separating Paradigms - CQRS

Client

Command

DB DBDenormalis

Precompute

realtime node

historynode

Query hits both

Operational /Analytic BridgeD

Client

ClientMutable

Search

NoSQLStream

ImmutableViews

denormalise

Stream layer (fast)

Batch LayerServing Layer

ata Query

Lambda ArchitectureSeparating Stream & Batch

Stream Data platformsViews

Client

Search

Columnar

Hadoop

Stream processo

Isolate consistency concerns, Leverage in-flight data, Promote immutable replicas

Stream

Things we Like

Treating state is an immutable chronology

Listening and reacting to things as they are written

Replaying things that happened before

history

Regenerate state

Enrich views

Avoiding (or Isolating) the need to mutate

Mutable Immutable

Read-optimising the immutable

Denormalise

Primitive operations for Shards and Replicas (sync/async)

Being able to reason about time in an asynchronous world

Blending the utility of different tools in a single data platform

Stream

Thanks

slides available @ benstopford.com

Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Documents

Confluent and Syncsort Webinar August 2016

Le Confluent - etab.ac-poitiers.fr

When Intuitions Fail

12 Intuitions and illusions - Aurelie Herbelotaurelieherbelot.net/resources/papers/2015_intuitions_illusions.pdf · 12 Intuitions and illusions ... argument, cognitive epistemology

Advanced databases ben stopford

Game Semantics Intuitions

Confluent kafka meetupseattle jan2017

Framing Moral Intuitions

Confluent™ - uploads-ssl.webflow.com

The probability of non-confluent systems

Multiple Realizability Intuitions

Origins of Mathematical Intuitions - CiteSeer

Info au Confluent décembre 2006

THREE MARITIME SCENARIOS 2020 – 2050webinars.capitallink.com/2020/stopford/paper.pdf · Stopford -Three Maritime Scenarios 2020-2050 1 Contents: ... maritime industry must rebuild

Philosophical Intuitions

INTUITIONS - IDAGIO

Alice Stopford Green

X-Phi Without Intuitions?

Intuitions on Proportional Fairness

Hintikka Emperors new Intuitions