Upload
iconara
View
336
Download
0
Embed Size (px)
DESCRIPTION
Presentation held at Scandinavian Developer Conference, April 2012
Citation preview
A GUIDE TO THE POST RELATIONAL
REVOLUTION
@iconara
speakerdeck.com/u/iconara(real time!)
Theo / @iconara
Chief Architect atCo-organizer of the local Ruby, Scala and JavaScript user groups
More rep on StackOverflow than both Jeff & Joel
THE WORLDISN’T FLAT
OUT IS THENEW UPwhen scaling up you’re
constrained by Moore’s Law
DISTRIBUTED SYSTEMS ARE
ABOUT TRADEOFFS
WHO NEEDSACID, ANYWAY?
banks, perhaps
JOINS AREA CRUTCH
why split up your data, if all you’re going to do is assemble it over and over again?
OBJECTS DON’TFIT IN TABLES
can you say “impedance mismatch”?
40 YEARS IS A LONG TIME
you didn’t have 256 gigabytes of RAM in 1970
THE RELATIONAL MODEL ISN’T A
GOLDEN HAMMERthe existence of object relational
mappers should be proof enough
WELCOME TO THE POST RELATIONAL
REVOLUTION
POST RELATIONAL STORAGE
KEY/VALUESTORES
the simplest possible database,not exactly a new idea
VALUEKEY
OPAQUE
Riak, Voldemort, LevelDB,Tokyo Cabinet, Berkeley DB
STRUCTUREDKEY/VALUE STORES
sometimes you need just a little bit more
the Bigtable model, “column oriented”, “sparse tables” found in Cassandra and HBase
COLUMN KEYROW KEY
VALUE
COLUMN KEY
VALUE
+ TIMESTAMPSORTED
“datastructure server”, e.g. Redis
KEY VALUE VALUE VALUE
LIST OR SET
KEYVALUE VALUE VALUE
SORTED SET OR HASH
KEY KEY KEY
KEY VALUE
INCREMENT, APPEND, SLICE, CAS
DOCUMENT DATABASES
object databases, but for hipsters
complex objects with lists, numbers, stringssecondary indexes* and partial updates,
MongoDB, CouchDB, RavenDB, Lotus Notes
* subject to availability
{ "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021" }, "phoneNumber": [ { "type": "home", "number": "212 555-1234" }, { "type": "cell", "number": "646 555-4567" } ] }
GRAPHDATABASES
relational, for real
traversal algorithms, extreme data complexity,Neo4j, AllegroGraph, FlockDB
NODE
NODE
NODE
NODE
NODE
NAME + PROPERTIES
NAME
DIVERSITYI haven’t even mentioned search & indexing systems like Solr and Elastic Search, or distributed filesystems
SOMETIMES TABLES ARE GREAT, TOO
but mostly when you rely heavily on GROUP BY, SUM, AVG, etc. and can’t precompute
POST RELATIONAL SCALING
CAP
CONSISTENCYAVAILABILITY
PARTITION TOLERANCE(choose any two)
OK?
PARTITION TOLERANCE ISN’T
OPTIONAL
CONSISTENCYVS. AVAILABILITY(but in reality, it’s not even that simple)
CONSISTENCYyou can always read what you just wrote,
but keys may become unavailable
AVAILABILITYyou can always read and write,
but you may not always get the latest value
NOT EITHER ORmost databases let you choose
on a query-by-query basis
SHARDINGscaling writes in a consistent system
divide the keyspace into shards, or regions(and store each one redundantly)
SHARD SHARD SHARD
KEYSPACE
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
DIVIDED BY DATA SIZE
ZA
split a shard when it grows too big, move one of the new shards onto a new node
SHARD SHARD SHARD
KEYSPACE
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
SPLIT
SHARD
REPLICA
REPLICA
REPLICA
ZA
in reality there’s chunks, tablets or “virtual shards”that are distributed over physical shards
SHARD SHARD SHARD
KEYSPACE
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
REPLICA
SHARD
REPLICA
REPLICA
REPLICA
ZA
HBASE, MONGODBsharding is easy in theory, hard in practice,
lots data needs to be moved when adding nodes
CONSISTENT HASHING
scaling writes in an available system
each node is responsible for a range of the keyspace,keys are hashed and mapped to the first following node,
(optionally) replicated to subsequent nodes
KEYSPACE
NODE
NODE
NODE
NODE
hash(key)replication
02n
KEYSPACE
NODE
NODE
NODE
NODE
NODE
NEW NODE
02n
when a new node is added, only part of the keyspace needs to be moved
KEYSPACE
NODE
NODE
NODE
NODE
NODE
02n
in practice, “virtual nodes” are evenly distributed over the keyspace, and then mapped onto physical nodes
CASSANDRA, RIAKperfect balance, in theory,
but rings may still need rebalancing
GOSSIP, HINTED HANDOFF, LOG STRUCTURED
STORAGE, COMPACTION, VECTOR CLOCKS, READ REPAIR, JOURNALING, QUORUMS, EVENTUAL
CONSISTENCY, DYNAMO, MAP/REDUCE, 2PC
a few of the things I haven’t mentioned, look them up
LESSONS LEARNED
EVERYTHING THEY TAUGHT YOU
ABOUT DATABASES AT UNIVERSITY
IS WRONG
almost
THINK ABOUT YOUR QUERIES FIRST
don’t optimize for insertion, denormalize heavily, disk is cheap, this ain’t 1970
GIVE A LOT OF THOUGHT TO YOUR
PRIMARY KEYSrange queries over cleverly designedprimary keys can be very powerful,
good keys required for efficient sharding
M04L7NOC5NQSM04L7O05MIU2M04NX42YFUCRM04NYR7VWKJCM04NZA8MJOOAM04NZB88CT14M04NZPOCE8DMM04NZQ9G2T0SM04NZQE7E5VXM04NZSK4V3JNM04NZTRG661RM04NZTSUITJ7M04NZUAILUS5M04NZUG4DTXNM04NZWB9VV0CM04NZWW52T8NM04NZX2JEVO9M04NZX7WD77WM04NZXGOLDEXM04NZXKNQWB3M04NZXLGJ3M6M04NZY7GO39GM04NZZ2SQF1IM04O013HN9L9M04O014DASE6M04O02PE8AD3M04O02PGJBR1M04O03UPTRWGM04O04833ZTLM04O04GH21JFM04O04JQ8B57M04O04UHK3U4M04O056QBNBHM04O05E8XO8NM04O069O8CDKM04O06MG47WKM04O07BHELVDM04O07F30WYXM04O0B39DGEA
M04NZW B9VV0Ctimestamp
2012-02-28 23:59:56 UTCrandom number681 731 004
B9VV0C M04NZWtimestamp2012-02-28 23:59:56 UTC
random number681 731 004
CONSISTENCYIS OVERRATED
when you need it you need it, but most of the time you don’t
DELETING DATA IS NOT TRIVIAL
sometimes delete operations can be more costly than inserts, design your cleaning process early
REDISMONGODB
CASSANDRAour current toolbox
REDISswiss army knife, we use it for “virtual memory”,
counters and even messaging
REDISnot distributed (yet), no automatic failover
MONGODBa very good replacement for MySQL,
replication and automatic failover is fantastic
MONGODBglobal write lock kills performance, easily fragmented,
sharding is complex and (has been) very buggy
MONGODBwe use it for precomputing and storing
metrics for our reporting app
MONGODBwe’re currently pushing around 5K updates/s over three
replica sets, each update incrementing up to 20 numbers
CASSANDRAlow level building blocks, no single point of failure,
great horizontal scalability, TTL on values
CASSANDRAwe use it to store data about website visits,
indexing it to support complex queries
CASSANDRAmillions of rows, some with millions of
columns, adding ~1K new every second
one million writes per second
LEARN SOMETHING NEW TODAY
nosql.mypopescu.comhighscalability.comnosqltapes.com
KTHXBAItwitter.com/iconara
speakerdeck.com/u/iconaraarchitecturalatrocities.com
burtcorp.com