Lightning talk: highly scalabe databases and the PACELC theorem

Preview:

Citation preview

Constraints of

Highly Scalable

Databases

1.

Traditional Databases

Recap of the ACID constraints

“Traditional” databases operate with the Transaction

paradigm that guarantees certain properties

� (A) Atomicity� (C) Consistency� (I) Isolation� (D) Durability

The ACID Guarantees

1. Atomicity

Each transaction must be “all or nothing” - if any part fails, the whole transaction must be rolled back as if it never happened.

2. Consistency

The end-state of a transaction must follow all the rules defined in the database: data constraints, cascades, triggers etc.

3. Isolation

The result of 2 concurrent operations should be the same as if they occurred in sequential order.

4. Durability

A transaction, once committed, will survive permanently even if the system fails.

This includes disk crashes, power outages, etc.

Locking

● Read / write / range locks

How do they do this?

Concurrency Control

● 2-phase commit (2PC), 3PC protocols

● Distributed locks

But then came the 2000s

And Scale Happened

Traditional RDBMSs were not designed for

the needs of modern web applications

Global Scale

Netflix knows which movies you watched, when, at what point(s) you paused and for how long, etc.

It then replicates that data across 3 global data centers.

Volume

In 2008, Facebook had only 100 million users and needed 8,000 shards of MySQL.

Today it has ~ 1.86 Billion users.

Speed

In 2013 Twitter was recording 150,000 new tweets/second every single day.

What to do?

Scale up! (?)- Increase memory, cores, CPU- Cache reads with memcached- Master-slave replication- Sharding

NOT

ENOUGH

2.

Redefining Constraints

Replacing ACID with BASE

“DMBS research is about ACID

(mostly). But we forfeit “C” and “I” for availability, graceful

degradation, and performance. This tradeoff is fundamental.

- Eric Brewer, 2000

Eric Brewer proposed a new set of properties: BASE

Soft StateBasically Available

Eventual consistency

System is always available for clients (but may not be consistent)

Database is no longer in charge of “valid” data state. The app is now responsible.

If all goes well, all clients will eventually see the same thing. Probably.

In the world of BASE parameters,

A different set of priorities rule

� Availability is most important� Weak consistency (i.e. stale data) is okay� Approximate answers are okay� Aggressive (optimistic) algorithms are okay� Simple, fast, easy evolution of the schema is important

A new set of constraints:

the CAP Theorem

It is impossible for a distributed computer system to simultaneously provide more than 2 of these 3 guarantees:� Consistency� Availability� Partition tolerance

(Eric Brewer, 1998-2000)

The CAP Parameters

1. Consistency*

All clients get the same view of the data, or they get an error

(i.e. every read receives the most recent write)

2. Availability

All clients can always read and always write

(i.e. every request receives a non-error response)

3. Partition tolerance

The system functions even if some nodes are unavailable

(i.e. system operates despite an arbitrary number of messages being dropped by the network between nodes)

All NoSQL databases live somewhere on this

spectrum, based on how they’re tuned

ACID BASE

● What levels of availability do you choose to provide?● What levels of consistency do you choose to provide?● What do you do when a partition is detected?● How do you recover from a partition event?

But wait…

we’re not

through yet

2010: Daniel Abadi (Yale) says CAP is misleading

The trade-offs defined by CAP’s “pick any 2” are misleading:

● The only time you need to make a trade-off is when there is a partition event (P)

● Systems that sacrifice C must do so all the time● But systems that sacrifice A only need to do so when

there’s a partition

Most importantly, you don’t give up C to gain A

You give up C to get another missing ingredient: L

LATENCY

Latency = how long must a client request wait for your response?

Imagine replicating data across global data centers

Data Center 1

Data Center 2

Data Center 3

Data Center 4Data Center n

Data Center 5

“A high availability requirement implies that the system must replicate data.

But as soon as a distributed system replicates data, a tradeoff between

consistency and latency arises.

- Abadi, 2010

The PACELC theorem (Abadi, 2010)

In a system that replicates data:� If a partition (P) is detected, how does the system trade off

○ (A) Availability or○ (C) Consistency

� Else (E) how does the system trade off○ (L) Latency or○ (C) Consistency

DDBS P+A P+C E+L E+C

Dynamo, Cassandra,

Riak

Mongo, H-Store, VoltDb

Yahoo! PNUTS

Comparing NoSQL databases using PACELC

References

� Images and title ideas from:○ http://blog.nahurst.com/visual-guide-to-nosql-systems ○ http://digbigdata.com/know-thy-cap-theorem-for-nosql/

� Detailed references at: ○ http://www.bardoloi.com/blog/2017/03/06/pacelc-theorem/

thanks!

Any questions?

You can find me at@bardoloi

Recommended