A Guide to the Post Relational Revolution

A GUIDE TO THE POST RELATIONAL

REVOLUTION

@iconara

speakerdeck.com/u/iconara(real time!)

Theo / @iconara

Chief Architect atCo-organizer of the local Ruby, Scala and JavaScript user groups

More rep on StackOverflow than both Jeff & Joel

THE WORLDISN’T FLAT

OUT IS THENEW UPwhen scaling up you’re

constrained by Moore’s Law

DISTRIBUTED SYSTEMS ARE

ABOUT TRADEOFFS

WHO NEEDSACID, ANYWAY?

banks, perhaps

JOINS AREA CRUTCH

why split up your data, if all you’re going to do is assemble it over and over again?

OBJECTS DON’TFIT IN TABLES

can you say “impedance mismatch”?

40 YEARS IS A LONG TIME

you didn’t have 256 gigabytes of RAM in 1970

THE RELATIONAL MODEL ISN’T A

GOLDEN HAMMERthe existence of object relational

mappers should be proof enough

WELCOME TO THE POST RELATIONAL

REVOLUTION

POST RELATIONAL STORAGE

KEY/VALUESTORES

the simplest possible database,not exactly a new idea

VALUEKEY

OPAQUE

Riak, Voldemort, LevelDB,Tokyo Cabinet, Berkeley DB

STRUCTUREDKEY/VALUE STORES

sometimes you need just a little bit more

the Bigtable model, “column oriented”, “sparse tables” found in Cassandra and HBase

COLUMN KEYROW KEY

VALUE

COLUMN KEY

VALUE

+ TIMESTAMPSORTED

“datastructure server”, e.g. Redis

KEY VALUE VALUE VALUE

LIST OR SET

KEYVALUE VALUE VALUE

SORTED SET OR HASH

KEY KEY KEY

KEY VALUE

INCREMENT, APPEND, SLICE, CAS

DOCUMENT DATABASES

object databases, but for hipsters

complex objects with lists, numbers, stringssecondary indexes* and partial updates,

MongoDB, CouchDB, RavenDB, Lotus Notes

* subject to availability

{ "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" }, "phoneNumber": [ { "type": "home", "number": "212 555-1234" }, { "type": "cell", "number": "646 555-4567" } ] }

GRAPHDATABASES

relational, for real

traversal algorithms, extreme data complexity,Neo4j, AllegroGraph, FlockDB

NODE

NODE

NODE

NODE

NODE

NAME + PROPERTIES

NAME

DIVERSITYI haven’t even mentioned search & indexing systems like Solr and Elastic Search, or distributed filesystems

SOMETIMES TABLES ARE GREAT, TOO

but mostly when you rely heavily on GROUP BY, SUM, AVG, etc. and can’t precompute

POST RELATIONAL SCALING

CAP

CONSISTENCYAVAILABILITY

PARTITION TOLERANCE(choose any two)

OK?

PARTITION TOLERANCE ISN’T

OPTIONAL

CONSISTENCYVS. AVAILABILITY(but in reality, it’s not even that simple)

CONSISTENCYyou can always read what you just wrote,

but keys may become unavailable

AVAILABILITYyou can always read and write,

but you may not always get the latest value

NOT EITHER ORmost databases let you choose

on a query-by-query basis

SHARDINGscaling writes in a consistent system

divide the keyspace into shards, or regions(and store each one redundantly)

SHARD SHARD SHARD

KEYSPACE

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

DIVIDED BY DATA SIZE

ZA

split a shard when it grows too big, move one of the new shards onto a new node

SHARD SHARD SHARD

KEYSPACE

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

SPLIT

SHARD

REPLICA

REPLICA

REPLICA

ZA

in reality there’s chunks, tablets or “virtual shards”that are distributed over physical shards

SHARD SHARD SHARD

KEYSPACE

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

REPLICA

SHARD

REPLICA

REPLICA

REPLICA

ZA

HBASE, MONGODBsharding is easy in theory, hard in practice,

lots data needs to be moved when adding nodes

CONSISTENT HASHING

scaling writes in an available system

each node is responsible for a range of the keyspace,keys are hashed and mapped to the first following node,

(optionally) replicated to subsequent nodes

KEYSPACE

NODE

NODE

NODE

NODE

hash(key)replication

02n

KEYSPACE

NODE

NODE

NODE

NODE

NODE

NEW NODE

02n

when a new node is added, only part of the keyspace needs to be moved

KEYSPACE

NODE

NODE

NODE

NODE

NODE

02n

in practice, “virtual nodes” are evenly distributed over the keyspace, and then mapped onto physical nodes

CASSANDRA, RIAKperfect balance, in theory,

but rings may still need rebalancing

GOSSIP, HINTED HANDOFF, LOG STRUCTURED

STORAGE, COMPACTION, VECTOR CLOCKS, READ REPAIR, JOURNALING, QUORUMS, EVENTUAL

CONSISTENCY, DYNAMO, MAP/REDUCE, 2PC

a few of the things I haven’t mentioned, look them up

LESSONS LEARNED

EVERYTHING THEY TAUGHT YOU

ABOUT DATABASES AT UNIVERSITY

IS WRONG

almost

THINK ABOUT YOUR QUERIES FIRST

don’t optimize for insertion, denormalize heavily, disk is cheap, this ain’t 1970

GIVE A LOT OF THOUGHT TO YOUR

PRIMARY KEYSrange queries over cleverly designedprimary keys can be very powerful,

good keys required for efficient sharding

M04L7NOC5NQSM04L7O05MIU2M04NX42YFUCRM04NYR7VWKJCM04NZA8MJOOAM04NZB88CT14M04NZPOCE8DMM04NZQ9G2T0SM04NZQE7E5VXM04NZSK4V3JNM04NZTRG661RM04NZTSUITJ7M04NZUAILUS5M04NZUG4DTXNM04NZWB9VV0CM04NZWW52T8NM04NZX2JEVO9M04NZX7WD77WM04NZXGOLDEXM04NZXKNQWB3M04NZXLGJ3M6M04NZY7GO39GM04NZZ2SQF1IM04O013HN9L9M04O014DASE6M04O02PE8AD3M04O02PGJBR1M04O03UPTRWGM04O04833ZTLM04O04GH21JFM04O04JQ8B57M04O04UHK3U4M04O056QBNBHM04O05E8XO8NM04O069O8CDKM04O06MG47WKM04O07BHELVDM04O07F30WYXM04O0B39DGEA

M04NZW B9VV0Ctimestamp

2012-02-28 23:59:56 UTCrandom number681 731 004

B9VV0C M04NZWtimestamp2012-02-28 23:59:56 UTC

random number681 731 004

CONSISTENCYIS OVERRATED

when you need it you need it, but most of the time you don’t

DELETING DATA IS NOT TRIVIAL

sometimes delete operations can be more costly than inserts, design your cleaning process early

REDISMONGODB

CASSANDRAour current toolbox

REDISswiss army knife, we use it for “virtual memory”,

counters and even messaging

REDISnot distributed (yet), no automatic failover

MONGODBa very good replacement for MySQL,

replication and automatic failover is fantastic

MONGODBglobal write lock kills performance, easily fragmented,

sharding is complex and (has been) very buggy

MONGODBwe use it for precomputing and storing

metrics for our reporting app

MONGODBwe’re currently pushing around 5K updates/s over three

replica sets, each update incrementing up to 20 numbers

CASSANDRAlow level building blocks, no single point of failure,

great horizontal scalability, TTL on values

CASSANDRAwe use it to store data about website visits,

indexing it to support complex queries

CASSANDRAmillions of rows, some with millions of

columns, adding ~1K new every second

one million writes per second

LEARN SOMETHING NEW TODAY

nosql.mypopescu.comhighscalability.comnosqltapes.com

KTHXBAItwitter.com/iconara

speakerdeck.com/u/iconaraarchitecturalatrocities.com

burtcorp.com

Technology

A Guide to the Post Relational Revolution