Upload
vishal-bardoloi
View
68
Download
1
Embed Size (px)
Citation preview
Constraints of
Highly Scalable
Databases
1.
Traditional Databases
Recap of the ACID constraints
“Traditional” databases operate with the Transaction
paradigm that guarantees certain properties
� (A) Atomicity� (C) Consistency� (I) Isolation� (D) Durability
The ACID Guarantees
1. Atomicity
Each transaction must be “all or nothing” - if any part fails, the whole transaction must be rolled back as if it never happened.
2. Consistency
The end-state of a transaction must follow all the rules defined in the database: data constraints, cascades, triggers etc.
3. Isolation
The result of 2 concurrent operations should be the same as if they occurred in sequential order.
4. Durability
A transaction, once committed, will survive permanently even if the system fails.
This includes disk crashes, power outages, etc.
Locking
● Read / write / range locks
How do they do this?
Concurrency Control
● 2-phase commit (2PC), 3PC protocols
● Distributed locks
But then came the 2000s
And Scale Happened
Traditional RDBMSs were not designed for
the needs of modern web applications
Global Scale
Netflix knows which movies you watched, when, at what point(s) you paused and for how long, etc.
It then replicates that data across 3 global data centers.
Volume
In 2008, Facebook had only 100 million users and needed 8,000 shards of MySQL.
Today it has ~ 1.86 Billion users.
Speed
In 2013 Twitter was recording 150,000 new tweets/second every single day.
What to do?
Scale up! (?)- Increase memory, cores, CPU- Cache reads with memcached- Master-slave replication- Sharding
NOT
ENOUGH
2.
Redefining Constraints
Replacing ACID with BASE
“DMBS research is about ACID
(mostly). But we forfeit “C” and “I” for availability, graceful
degradation, and performance. This tradeoff is fundamental.
- Eric Brewer, 2000
Eric Brewer proposed a new set of properties: BASE
Soft StateBasically Available
Eventual consistency
System is always available for clients (but may not be consistent)
Database is no longer in charge of “valid” data state. The app is now responsible.
If all goes well, all clients will eventually see the same thing. Probably.
In the world of BASE parameters,
A different set of priorities rule
� Availability is most important� Weak consistency (i.e. stale data) is okay� Approximate answers are okay� Aggressive (optimistic) algorithms are okay� Simple, fast, easy evolution of the schema is important
A new set of constraints:
the CAP Theorem
It is impossible for a distributed computer system to simultaneously provide more than 2 of these 3 guarantees:� Consistency� Availability� Partition tolerance
(Eric Brewer, 1998-2000)
The CAP Parameters
1. Consistency*
All clients get the same view of the data, or they get an error
(i.e. every read receives the most recent write)
2. Availability
All clients can always read and always write
(i.e. every request receives a non-error response)
3. Partition tolerance
The system functions even if some nodes are unavailable
(i.e. system operates despite an arbitrary number of messages being dropped by the network between nodes)
All NoSQL databases live somewhere on this
spectrum, based on how they’re tuned
ACID BASE
● What levels of availability do you choose to provide?● What levels of consistency do you choose to provide?● What do you do when a partition is detected?● How do you recover from a partition event?
But wait…
we’re not
through yet
2010: Daniel Abadi (Yale) says CAP is misleading
The trade-offs defined by CAP’s “pick any 2” are misleading:
● The only time you need to make a trade-off is when there is a partition event (P)
● Systems that sacrifice C must do so all the time● But systems that sacrifice A only need to do so when
there’s a partition
Most importantly, you don’t give up C to gain A
You give up C to get another missing ingredient: L
LATENCY
Latency = how long must a client request wait for your response?
Imagine replicating data across global data centers
Data Center 1
Data Center 2
Data Center 3
Data Center 4Data Center n
Data Center 5
“A high availability requirement implies that the system must replicate data.
But as soon as a distributed system replicates data, a tradeoff between
consistency and latency arises.
- Abadi, 2010
The PACELC theorem (Abadi, 2010)
In a system that replicates data:� If a partition (P) is detected, how does the system trade off
○ (A) Availability or○ (C) Consistency
� Else (E) how does the system trade off○ (L) Latency or○ (C) Consistency
DDBS P+A P+C E+L E+C
Dynamo, Cassandra,
Riak
Mongo, H-Store, VoltDb
Yahoo! PNUTS
Comparing NoSQL databases using PACELC
References
� Images and title ideas from:○ http://blog.nahurst.com/visual-guide-to-nosql-systems ○ http://digbigdata.com/know-thy-cap-theorem-for-nosql/
� Detailed references at: ○ http://www.bardoloi.com/blog/2017/03/06/pacelc-theorem/
thanks!
Any questions?
You can find me at@bardoloi