51
Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Embed Size (px)

Citation preview

Page 1: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Intuitions for Scaling Data-Centric Architectures

Ben StopfordConfluent Inc

Page 2: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Intuitions for Scale

Intuition does not come to the unprepared mind

A.E.

Page 3: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Locality &Sequential Addressing

Page 4: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Computers work best with sequential workloads

Disk buffer

Page cache

L3 cacheL2 cache

L1 cache

Pre-fetch is your friend

Page 5: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Random vs. Sequential Addressing

300 reads/sec 200MB/s

e.g. sequential is ~7000x faster for 100B rows

Page 6: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

This isn’t just Disk

L3

L2L1

Random RAM ~ Sequential Disk

10-100x

Page 7: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Files

Page 8: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

We can write sequentially to a file quickly

Page 9: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Reading Efficiently

Scan

Position & Scan(pages)

Page 10: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Avoid Random Reads

Page 11: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Writing Tradeoffs

Append OnlyJournal

(Sequential IO)

Update in PlaceOrdered File(Random IO)

v2

v1

v2

v1

Page 12: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Supporting Lookups

Page 13: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Add Indexes for Selectivity

bob

dave fred hary mikesteve vince

Index

Heap file

Page 14: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Goodbye Sequential Write Performance

bob

dave fred hary mikesteve vince

Random IO

Sequential IO

Page 15: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Option A: Put Index in Memory

RAM

Disk

Page 16: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Option B: Use a chronology of small index files

Writes

batch up

sort

write to disk

older files

small index file

Page 17: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

…with tricks to optimise out the need for random IO

RAM

Disk

file metadata & bloom filter

Page 18: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Log Structured Merge Trees

• A collection of small, immutable indexes

• Append only, de-duplicate by merging files

• Low memory index structures increase read performance

Shift problem of Random Access from “write” to “read” concern

Page 19: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Option C: Brute Force

A B C

A1A2A3A4

B1

B2

B3B4

C1

C2

C3

C4

‘column per file’ arrangement

same order for each file

Page 20: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Option C: Columnar

Merge Join

compressedcolumns

A1

A2A3A4

B1

B2

B3B4

C1

C2

C3

C4

Page 21: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Brute Force, by Column

• Less IO, by column, compressed• Held in Row order => merge joins via

rowid• Predicates can operate on

compressed data• Late materialisation.

Page 22: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Many of the most scalable technologies play to one of these core efficiencies

Page 23: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Riak, Mongo etc

RAM

Disk

Page 24: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Kafka

(Queues are Databases - 1995 Jim Gray)

Page 25: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Hbase, Cassandra, RocksDB etc

LSM

Page 26: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Redshift etc, Parquet (Hadoop)

A B C

A1

A2A3A4

B1

B2

B3B4

C1

C2

C3

C4

Page 27: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Parallelism

Partitioning & Replication

Page 28: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Partitioning - KV

K-V storessingle endpoint query routing

Page 29: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Partitioning - Batch

Divide and conquer

Page 30: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Partitioning: Concurrency Limits

Use of secondary indexes can limit concurrency at

scale

Page 31: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Replication

Page 32: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Replication

• Replication provides one route out of this.

• Replicas isolate load -> scales out concurrency for general workloads.

• Obviously provides redundancy etc too.

• If async, trades off against consistency (CAP)

Page 33: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Atomaticity & Ordering

These can be expensive

Page 34: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Solution: Avoid, Isolate or embrace disorder (Bloom etc)

Atomic(Mutable)

Immutable

Page 35: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Circling Synchronous, Mutable State

Trapped in the Persist & Query pattern… in

a fully ACID world

Page 36: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Separating Paradigms - CQRS

Client

Command

Query

DB DBDenormalis

e/

Precompute

Page 37: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

DRUID

realtime node

historynode

Query hits both

Page 38: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Operational /Analytic BridgeD

ATA

Client

Client

ClientMutable

Search

SQL

NoSQLStream

ImmutableViews

denormalise

Page 39: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Stream layer (fast)

Batch LayerServing Layer

All

you

r d

ata Query

Query

Lambda ArchitectureSeparating Stream & Batch

Page 40: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

All

you

r d

ata

Stream Data platformsViews

Client

Client

Kafka

Search

Columnar

Hadoop

Stream processo

r

Page 41: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Isolate consistency concerns, Leverage in-flight data, Promote immutable replicas

Sys 1

Sys 2

Sys 3

Stream

Page 42: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Things we Like

Page 43: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Treating state is an immutable chronology

time

Page 44: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Listening and reacting to things as they are written

Page 45: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Replaying things that happened before

history

Regenerate state

Enrich views

Page 46: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Avoiding (or Isolating) the need to mutate

Mutable Immutable

Page 47: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Read-optimising the immutable

Denormalise

Page 48: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Primitive operations for Shards and Replicas (sync/async)

Page 49: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Being able to reason about time in an asynchronous world

Page 50: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Blending the utility of different tools in a single data platform

Sys 1

Sys 2

Sys 3

Stream

Page 51: Intuitions for Scaling Data-Centric Architectures Ben Stopford Confluent Inc

Thanks

slides available @ benstopford.com