Upload
vivien-york
View
226
Download
0
Tags:
Embed Size (px)
Citation preview
Copyright 2015 – Noah Mendelsohn
Consistency and Scalability
Noah MendelsohnTufts UniversityEmail: [email protected]: http://www.cs.tufts.edu/~noah
COMP 150-IDS: Internet Scale Distributed Systems (Spring 2015)
© 2010 Noah Mendelsohn2
What you should get from today’s session
You will explore challenges relating to maintaining data consistency in a computing system
You will learn about techniques used to make storage systems more reliable
You will learn about transactions and their implementation using logs
You will learn about the CAP theorem and why scaling and consistency tend not to come together
© 2010 Noah Mendelsohn3
A note about scope
The challenges & principles we cover today reappear at every level of system design– CPU Instruction set and memory– Parallel programming languages– Single machine databases– Distributed applications and databases
Today we will focus mainly on larger scale systems
© 2010 Noah Mendelsohn4
Why Worry About Consistency?
© 2010 Noah Mendelsohn5
Duplicate information in computing systems
Why complicated things?– Mirrored disks for reliability– Parallel processing higher throughput– Geographic distribution reduces network delay (one each in Europe, Asia, US)– Higher availability if network crashes, each “partition” may still have a copy
Inter-dependent data– Bank account records have total for each account– Bank record keeps total for all accounts
Memory Hierarchies– CPU Caches, file system caches, Web proxies, etc.
If we allow updates, then maintaining consistency is tricky
© 2010 Noah Mendelsohn6
Simple Examples:Parallel Disk Systems
© 2010 Noah Mendelsohn7
Mirrored disks
Logical disk
Mirrored Implementation
X
X X
Everything written twice
Better performance on reads (slower on writes)
© 2010 Noah Mendelsohn8
Duplicate data and crash recovery
Logical disk
Mirrored Implementation
X
X X
After a crash, data survives
Crash!
© 2010 Noah Mendelsohn9
Mirrored disks
Logical disk
Mirrored Implementation
X
X X
Replacement drive can be reconstructed in the
background
© 2010 Noah Mendelsohn
Unix Kernel
REVIEW: How is the disk used in Unix / Linux?
Sector
Ap
plicati
on
Access bycylinder/track/sector
Filesystem
Files/Dirssecurity, etc
Buffered block r/w: hides timing
Sector
In-memory BlockCache
Blo
ck D
evic
e D
river
Direct read/write of filesystem“blocks” (hides sector size anddevice geometry)
Raw
Devic
e D
river
© 2010 Noah Mendelsohn
Unix Kernel
We can use mirrored disks with UnixA
pp
licati
on
Filesystem
Files/Dirssecurity, etc
Buffered block r/w: hides timing
Sector
In-memory BlockCache
Blo
ck D
evic
e D
river
MIR
RO
RED
Devic
e D
river
Mir
rore
d Im
ple
men
tati
on
Abstraction:The mirrored disk provides
the same service as a single disk…just faster and more
reliable!
© 2010 Noah Mendelsohn12
Atomicity and update synchronziation
Logical disk
Mirrored Implementation
X
X X
Mirrored writes DO NOT happen at quite the
same timeQuestion: when is the update committed?
© 2010 Noah Mendelsohn13
Logical disk
RAID – Reliable Arrays of Inexpensive Disks
X
XX X
RAID Implementation
© 2010 Noah Mendelsohn14
RAID – Reliable Arrays of Inexpensive Disks
RAID Implementation
Y
XX
Y
X
XXOR(X,Y)
Logical disk
© 2010 Noah Mendelsohn15
RAID – Reliable Arrays of Inexpensive Disks
RAID Implementation
Y
XX
Y
X
XOR(X,Y,Z)
Z
Z
Much less space overhead than
mirroring…but typically slower
Logical disk
© 2010 Noah Mendelsohn16
RAID – Reliable Arrays of Inexpensive Disks
RAID Implementation
Y
XX
Y
X
XOR(X,Y,Z)
Z
ZCrash!
If any disk is lost…you can reconstruct from information on the
others!
Logical disk
© 2010 Noah Mendelsohn17
WhyConsistency
is Hard
© 2010 Noah Mendelsohn18
Synchronization problem
NA =Access Noah’s Bank accountBal = NA.Balance;NewBalance = Bal + $1000NA.Balance.Write NewBalance
Some code to add money to my account
NA =Access Noah’s Bank accountBal = NA.Balance;NewBalance = Bal + $1000NA.Balance.Write NewBalance
Some code to add money to my account
Let’s run code for two deposits in parallel
Can you see the problem?
There’s a risk that both copies will pick up X before either updates. If that happens, I only get $1000 not $2000!
© 2010 Noah Mendelsohn19
Solution - locking
Lock Noah’s Bank AccountNA =Access Noah’s Bank accountBal = NA.Balance;NewBalance = Bal + $1000NA.Balance.Write NewBalanceUnlock Noah’s Bank Account
Some code to add money to my account
Now the two copies can’t run at once on the same account…but if each locks a different bank account they can.
Only one transaction or thread can hold the lock at a time
© 2010 Noah Mendelsohn20
Consistency and Crash Recovery
NA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write Ybal
Some code to transfer money
Can you see the problem?
If the system crashes just after writing my balance, the bank loses $1000 (it’s still in your account too)
This gets lost during crash
© 2010 Noah Mendelsohn21
Transactions
© 2010 Noah Mendelsohn22
Transactions: automated consistency & crash recovery!
BEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION
Some code to transfer money
The system guarantees that either everything in the transaction happens, or nothing…and it guarantees more!
© 2010 Noah Mendelsohn23
ACID Properties of a Transaction
Atomicity– Everything happens or nothing
Consistency– If the database has rules they are obeyed at transaction end
(e.g. balance must be < $1,000,000)
Isolation– Any two parallel transactions act as if serial– Most transaction systems do the locking automatically!
Durability– Once committed, never lost
That seems almost magic…how can we achieve all this?
© 2010 Noah Mendelsohn24
How to implement transactions - logging The key idea: a shared log records information needed to undo any
change made by any transaction
When a transaction commits:– All data is written to the main data store– A commit record is written to the log. This is the atomic point at which the transaction
“happens”
After a crash, the log is “replayed”– For any transactions that did not commit, the undo operations are performed– After the crash, only commited operations have happened!
When combined with transaction driven locking, we can automatically support ACID properties with almost no application code complexity
This is all built into SQL databases like Oracle, Postgres, DB2, and SQL Server
Logging and transaction processing are two of the most important and beautiful data processing technologies
© 2010 Noah Mendelsohn25
Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION
Some code to transfer money
Noah.Bal = $100Your.Bal = $1300
© 2010 Noah Mendelsohn26
Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION
Some code to transfer money
Noah.Bal = $100Your.Bal = $1300
Begin Trans 1
Log
© 2010 Noah Mendelsohn27
Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION
Some code to transfer money
Noah.Bal = $1100Your.Bal = $1300
Begin Trans 1
Log
Old Noah Bal = $100
© 2010 Noah Mendelsohn28
Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Balance.Write YbalEND_TRANSACTION
Some code to transfer money
Noah.Bal = $1100Your.Bal = $1300
Begin Trans 1
Log
Old Noah Bal = $100 Old Your Bal = $1300
© 2010 Noah Mendelsohn29
Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Write YbalEND_TRANSACTION
Some code to transfer money
Noah.Bal = $1100Your.Bal = $1300
Begin Trans 1
Log
Old Noah Bal = $100 Old Your Bal = $1300 Commit Tr 1
© 2010 Noah Mendelsohn30
Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Write YbalEND_TRANSACTION
Some code to transfer money
Noah.Bal = $1100Your.Bal = $1300
Begin Trans 1
Log
Old Noah Bal = $100 Old Your Bal = $1300 Commit Tr 1
What if we crash while the data is inconsistent?
© 2010 Noah Mendelsohn31
Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION
Some code to transfer money
Noah.Bal = $100Your.Bal = $1300
© 2010 Noah Mendelsohn32
Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Write YbalEND_TRANSACTION
Some code to transfer money
Noah.Bal = $100Your.Bal = $1300
Begin Trans 1
Log
© 2010 Noah Mendelsohn33
Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Write YbalEND_TRANSACTION
Some code to transfer money
Noah.Bal = $1100Your.Bal = $1300
Begin Trans 1
Log
Old Noah Bal = $100
Crash!
© 2010 Noah Mendelsohn34
Recovery!
Noah.Bal = $1100Your.Bal = $1300
Begin Trans 1
Log
Old Noah Bal = $100
When system restarts, data is inconsistent…
…but we can play the log to restore consistency!
© 2010 Noah Mendelsohn35
Recovery!
Noah.Bal = $1100Your.Bal = $1300
Begin Trans 1
Log
Old Noah Bal = $100
We notice that Transaction 1never committed, so we
apply all of its undo entries
© 2010 Noah Mendelsohn36
Recovery!
Noah.Bal = $1100Your.Bal = $1300
Begin Trans 1
Log
Old Noah Bal = $100
We notice that Transaction 1never committed, so we
apply all of its undo entries
$100
© 2010 Noah Mendelsohn37
Logging – keeping consistency after crashes
The key idea: a shared log records information on how to undo any change to the main data
When a transaction commits:– All data is written to the main data store– A commit record is written to the log. This is the atomic point at which the transaction
“happens”
After a crash, the log is “replayed”– For any transactions that did not commit, the undo operations are performed– After the crash, only commited operations have happened!
When combined with locking, we can automatically support ACID properties with almost no application code complexity
This is all built into SQL databases like Oracle, Postgres, DB2, and SQL Server
Logging and transaction processing are two of the most important and beautiful data processing technologies
Full Disclosure
This explanation is highly simplified but the spirit is exactly right.
Examples of things not covered:
• Some databases use redo vs. undo logging or log both old and new values
• Transactions can abort (a ROLLBACK record is logged instead of COMMIT)
• Useful if programmer wants to give up• The system can abort a transaction if there is an error• The system can abort a transaction if locking has caused
deadlock• The same logs, if carefully designed, can be used to help with
backup, recovery from disk drive failure, and synchronization of distributed systems.
© 2010 Noah Mendelsohn38
Atomicity and hardware
Important: transactions are committed by an atomic hardware write to the log– Before the commit is written, the transaction has not happened– After it’s written all of its work is committed– It all happens at once: atomically
Principle: Almost any computing activity that is to be done atomically must be achieved in a single atomic hardware operation!– Store, Test_and_set or compare_and_swap CPU instructions– Write a disk block
When designing systems that require consistency, start by studying what your hardware can do atomically
© 2010 Noah Mendelsohn39
Consistency in Distributed Systems
© 2010 Noah Mendelsohn40
Problem
In a distributed system, we want to do work in lots of places
To get consistency, we need to do an atomic update to the system state
Challenge: can we get consistency in a distributed system?
© 2010 Noah Mendelsohn41
Can we get distributed consensus and consistency?
Yes! (but with some limitations)
First we need to think about how distributed systems fail…
…individual nodes can fail
…what if the network partitions?
In general, implementing transactions or otherconsistency guarantees in distributed systems is hard!
© 2010 Noah Mendelsohn42
Network Partition
This network is fully connected
© 2010 Noah Mendelsohn43
Network Partition
If these links break the networkis partitioned
All computers are still up!Updates in one partition
can’t be sent to the other.
© 2010 Noah Mendelsohn44
Questions about failures in distributed systems
Can we support replicated data and maintain consistency?
Can we run distributed transactions in which work (updating accounts) is spread through the network and achieve consistency?
How can we do crash recovery?
How do we continue running when the network partitions?
© 2010 Noah Mendelsohn45
Voting: a simple approach to replicated data
Copies of the same data can be kept at any or all nodes…but when reading you must use the value
stored at a majority of nodes!
© 2010 Noah Mendelsohn46
Network Partition All computers are still up!Updates in one partition
can’t be sent to the other.
During partition, only one group of nodes can be a majority…the other can’t proceed!
© 2010 Noah Mendelsohn47
The Famous CAP Theorem
© 2010 Noah Mendelsohn48
The Cap Theorem
When designing a system with distributed data youwould like to have:
Consistency: everyone agrees on the dataAvailability: nobody ever has to stop processingPartition tolerance: keep going even when the network partitions
The CAP theorem says: you can have any two simultaneously, but not all three!
If your network can partition, then either some nodes will have to stop working (no availability) or data may become
inconsistent (other partition doesn’t see the updates)
© 2010 Noah Mendelsohn49
Network Partition With the voting algorithm, only the orange
partition can do work.
The CAP theorem explains why we can never build a system that does better, unless we are willing to
sacrifice consistency.
© 2010 Noah Mendelsohn50
Distributed Transactions
© 2010 Noah Mendelsohn51
Distributed transactions: the challenge
What if our computation is distributed?
We still want ACID properties– Atomicity– Consistency– Isolation– Durability
Per the CAP theorem: let’s ignore partition for now
Amazingly, there are ways to do this:– Isolation and Consistency: distributed lock managers– Atomicity and Durability: Distributed Two Phase Commit (DTPC)
© 2010 Noah Mendelsohn52
Distributed two phase commit
Allows a single transaction to be spread across multiple nodes
Logging is done at each node as for traditional transactions
Special protocol ensures atomic commit of distributed work
One of the great achievements of 20th century distributed computing research
© 2010 Noah Mendelsohn53
Distributed Two Phase Commit
BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT
Node 1 logic
Noah.Bal = $100
Node 1 Log
Begin Trans 1
JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal
Node 2 Logic
Your.Bal = $1300
Node 2 Log
Join Trans 1
© 2010 Noah Mendelsohn54
Distributed Two Phase Commit
BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT
Node 1 logic
Noah.Bal = $1100
JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal
Node 2 Logic
Your.Bal = $300
Node 1 Log
Begin Trans 1
Node 2 Log
Join Trans 1
Old Noah Balance = $100
Old YourBalance = $1300
© 2010 Noah Mendelsohn55
Distributed Two Phase Commit
BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT
Node 1 logic
Noah.Bal = $1100
JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal
Node 2 Logic
Your.Bal = $300
Node 1 Log
Begin Trans 1
Node 2 Log
Join Trans 1
Old Noah Balance = $100
Old YourBalance = $1300
Prepared
Are you prepared to commit?
Prepared
Yes, I am prepared
© 2010 Noah Mendelsohn56
Distributed Two Phase Commit
BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT
Node 1 logic
Noah.Bal = $1100
JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal
Node 2 Logic
Your.Bal = $300
Node 1 Log
Begin Trans 1
Node 2 Log
Join Trans 1
Old Noah Balance = $100
Old YourBalance = $1300
Prepared
Are you prepared to commit?
Prepared
Yes, I am prepared
Prepared means: if you ask me later to commit or abortI will be able to do either!
© 2010 Noah Mendelsohn57
Distributed Two Phase Commit
BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT
Node 1 logic
Noah.Bal = $1100
JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal
Node 2 Logic
Your.Bal = $300
Node 1 Log
Begin Trans 1
Node 2 Log
Join Trans 1
Old Noah Balance = $100
Old YourBalance = $1300
Prepared
Commit!
Prepared
Done
Commit
Commit
© 2010 Noah Mendelsohn58
What happens if there is a crash?
If a node goes down before the commit, the master node writes an abort record and tells other nodes to abort
When any node comes up after a crash or after partition, it checks with master what has happened to any prepared transactions
Because prepared means it can go either way, that node can either record a commit or execute a rollback using data from the log
We can see the CAP theorem in action again: the algorithm stalls while the network is partitioned
© 2010 Noah Mendelsohn59
Does Everyone use Distributed 2 Phase Commit?
In the late 1990s everyone thought DTPC would be the key to distributed data
In practice, systems like Amazon can’t stop in case of network partition or master node crashe
Today:– Massive but non-critical data stores do not even attempt
perfect consistency: once in awhile your Amazon shopping cart may lose things you’ve parked there
– Critical transactions (e.g. when you place your order and charge your credit card) are often recorded in less scalable but fully consistent (usually relational) databases
© 2010 Noah Mendelsohn60
Summary
© 2010 Noah Mendelsohn
Summary
Keeping data consistent is important
Techniques like ACID transactions implemented with logs have been spectacularly successful
Consistency and scalability tend not to come together
Atomicity in software tends to require reduction to a single atomic operation in hardware
The CAP theorem says we can’t have Consistency, Availability and Parition tolerance
Techniques like Voting and Distributed Two Phase Commit can achieve distributed consistency at the cost of availability
Many modern systems sacrifice consistency to achieve availability at massive scale
61