Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: [email protected]@cs.tufts.edu Web: noah

Copyright 2015 – Noah Mendelsohn

Consistency and Scalability

Noah MendelsohnTufts UniversityEmail: [email protected]: http://www.cs.tufts.edu/~noah

COMP 150-IDS: Internet Scale Distributed Systems (Spring 2015)

mailto:[email protected]

© 2010 Noah Mendelsohn2

What you should get from today’s session

You will explore challenges relating to maintaining data consistency in a computing system

You will learn about techniques used to make storage systems more reliable

You will learn about transactions and their implementation using logs

You will learn about the CAP theorem and why scaling and consistency tend not to come together


A note about scope

The challenges & principles we cover today reappear at every level of system design– CPU Instruction set and memory– Parallel programming languages– Single machine databases– Distributed applications and databases

Today we will focus mainly on larger scale systems


Why Worry About Consistency?


Duplicate information in computing systems

Why complicated things?– Mirrored disks for reliability– Parallel processing higher throughput– Geographic distribution reduces network delay (one each in Europe, Asia, US)– Higher availability if network crashes, each “partition” may still have a copy

Inter-dependent data– Bank account records have total for each account– Bank record keeps total for all accounts

Memory Hierarchies– CPU Caches, file system caches, Web proxies, etc.

If we allow updates, then maintaining consistency is tricky


Simple Examples:Parallel Disk Systems


Mirrored disks

Logical disk

Mirrored Implementation

X

X X

Everything written twice

Better performance on reads (slower on writes)


Duplicate data and crash recovery

Logical disk


X

X X

After a crash, data survives

Crash!


Mirrored disks

Logical disk


X

X X

Replacement drive can be reconstructed in the

background

© 2010 Noah Mendelsohn

Unix Kernel

REVIEW: How is the disk used in Unix / Linux?

Sector

Ap

plicati

on

Access bycylinder/track/sector

Filesystem

Files/Dirssecurity, etc

Buffered block r/w: hides timing

Sector

In-memory BlockCache

Blo

ck D

evic

e D

river

Direct read/write of filesystem“blocks” (hides sector size anddevice geometry)

Raw

Devic

e D

river


Unix Kernel

We can use mirrored disks with UnixA

pp

licati

on

Filesystem

Files/Dirssecurity, etc

Buffered block r/w: hides timing

Sector

In-memory BlockCache

Blo

ck D

evic

e D

river

MIR

RO

RED

Devic

e D

river

Mir

rore

d Im

ple

men

tati

on

Abstraction:The mirrored disk provides

the same service as a single disk…just faster and more

reliable!


Atomicity and update synchronziation

Logical disk


X

X X

Mirrored writes DO NOT happen at quite the

same timeQuestion: when is the update committed?


Logical disk

RAID – Reliable Arrays of Inexpensive Disks

X

XX X

RAID Implementation



RAID Implementation

Y

XX

Y

X

XXOR(X,Y)

Logical disk



RAID Implementation

Y

XX

Y

X

XOR(X,Y,Z)

Z

Z

Much less space overhead than

mirroring…but typically slower

Logical disk



RAID Implementation

Y

XX

Y

X

XOR(X,Y,Z)

Z

ZCrash!

If any disk is lost…you can reconstruct from information on the

others!

Logical disk


WhyConsistency

is Hard


Synchronization problem

NA =Access Noah’s Bank accountBal = NA.Balance;NewBalance = Bal + $1000NA.Balance.Write NewBalance

Some code to add money to my account

NA =Access Noah’s Bank accountBal = NA.Balance;NewBalance = Bal + $1000NA.Balance.Write NewBalance


Let’s run code for two deposits in parallel

Can you see the problem?

There’s a risk that both copies will pick up X before either updates. If that happens, I only get $1000 not $2000!


Solution - locking

Lock Noah’s Bank AccountNA =Access Noah’s Bank accountBal = NA.Balance;NewBalance = Bal + $1000NA.Balance.Write NewBalanceUnlock Noah’s Bank Account


Now the two copies can’t run at once on the same account…but if each locks a different bank account they can.

Only one transaction or thread can hold the lock at a time


Consistency and Crash Recovery

NA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write Ybal

Some code to transfer money

Can you see the problem?

If the system crashes just after writing my balance, the bank loses $1000 (it’s still in your account too)

This gets lost during crash


Transactions


Transactions: automated consistency & crash recovery!

BEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION


The system guarantees that either everything in the transaction happens, or nothing…and it guarantees more!


ACID Properties of a Transaction

Atomicity– Everything happens or nothing

Consistency– If the database has rules they are obeyed at transaction end

(e.g. balance must be < $1,000,000)

Isolation– Any two parallel transactions act as if serial– Most transaction systems do the locking automatically!

Durability– Once committed, never lost

That seems almost magic…how can we achieve all this?


How to implement transactions - logging The key idea: a shared log records information needed to undo any

change made by any transaction

When a transaction commits:– All data is written to the main data store– A commit record is written to the log. This is the atomic point at which the transaction

“happens”

After a crash, the log is “replayed”– For any transactions that did not commit, the undo operations are performed– After the crash, only commited operations have happened!

When combined with transaction driven locking, we can automatically support ACID properties with almost no application code complexity

This is all built into SQL databases like Oracle, Postgres, DB2, and SQL Server

Logging and transaction processing are two of the most important and beautiful data processing technologies


Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Balance.Write NbalYA.Balance.Write YbalEND_TRANSACTION


Noah.Bal = $100Your.Bal = $1300





Begin Trans 1

Log





Begin Trans 1

Log

Old Noah Bal = $100


Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Balance.Write YbalEND_TRANSACTION



Begin Trans 1

Log

Old Noah Bal = $100 Old Your Bal = $1300


Logging in ActionBEGIN_TRANSACTIONNA =Access Noah’s Bank accountYA =Access Your Bank accountNBal = NA.Balance;Ybal = YA.Balance;Nbal += $1000;Ybal -= $1000;NA.Write NbalYA.Write YbalEND_TRANSACTION



Begin Trans 1

Log

Old Noah Bal = $100 Old Your Bal = $1300 Commit Tr 1





Begin Trans 1

Log

Old Noah Bal = $100 Old Your Bal = $1300 Commit Tr 1

What if we crash while the data is inconsistent?









Begin Trans 1

Log





Begin Trans 1

Log

Old Noah Bal = $100

Crash!


Recovery!


Begin Trans 1

Log

Old Noah Bal = $100

When system restarts, data is inconsistent…

…but we can play the log to restore consistency!


Recovery!


Begin Trans 1

Log

Old Noah Bal = $100

We notice that Transaction 1never committed, so we

apply all of its undo entries


Recovery!


Begin Trans 1

Log

Old Noah Bal = $100

We notice that Transaction 1never committed, so we

apply all of its undo entries

$100


Logging – keeping consistency after crashes

The key idea: a shared log records information on how to undo any change to the main data

When a transaction commits:– All data is written to the main data store– A commit record is written to the log. This is the atomic point at which the transaction

“happens”

After a crash, the log is “replayed”– For any transactions that did not commit, the undo operations are performed– After the crash, only commited operations have happened!

When combined with locking, we can automatically support ACID properties with almost no application code complexity

This is all built into SQL databases like Oracle, Postgres, DB2, and SQL Server

Logging and transaction processing are two of the most important and beautiful data processing technologies

Full Disclosure

This explanation is highly simplified but the spirit is exactly right.

Examples of things not covered:

• Some databases use redo vs. undo logging or log both old and new values

• Transactions can abort (a ROLLBACK record is logged instead of COMMIT)

• Useful if programmer wants to give up• The system can abort a transaction if there is an error• The system can abort a transaction if locking has caused

deadlock• The same logs, if carefully designed, can be used to help with

backup, recovery from disk drive failure, and synchronization of distributed systems.


Atomicity and hardware

Important: transactions are committed by an atomic hardware write to the log– Before the commit is written, the transaction has not happened– After it’s written all of its work is committed– It all happens at once: atomically

Principle: Almost any computing activity that is to be done atomically must be achieved in a single atomic hardware operation!– Store, Test_and_set or compare_and_swap CPU instructions– Write a disk block

When designing systems that require consistency, start by studying what your hardware can do atomically


Consistency in Distributed Systems


Problem

In a distributed system, we want to do work in lots of places

To get consistency, we need to do an atomic update to the system state

Challenge: can we get consistency in a distributed system?


Can we get distributed consensus and consistency?

Yes! (but with some limitations)

First we need to think about how distributed systems fail…

…individual nodes can fail

…what if the network partitions?

In general, implementing transactions or otherconsistency guarantees in distributed systems is hard!


Network Partition

This network is fully connected


Network Partition

If these links break the networkis partitioned

All computers are still up!Updates in one partition

can’t be sent to the other.


Questions about failures in distributed systems

Can we support replicated data and maintain consistency?

Can we run distributed transactions in which work (updating accounts) is spread through the network and achieve consistency?

How can we do crash recovery?

How do we continue running when the network partitions?


Voting: a simple approach to replicated data

Copies of the same data can be kept at any or all nodes…but when reading you must use the value

stored at a majority of nodes!


Network Partition All computers are still up!Updates in one partition

can’t be sent to the other.

During partition, only one group of nodes can be a majority…the other can’t proceed!


The Famous CAP Theorem


The Cap Theorem

When designing a system with distributed data youwould like to have:

Consistency: everyone agrees on the dataAvailability: nobody ever has to stop processingPartition tolerance: keep going even when the network partitions

The CAP theorem says: you can have any two simultaneously, but not all three!

If your network can partition, then either some nodes will have to stop working (no availability) or data may become

inconsistent (other partition doesn’t see the updates)


Network Partition With the voting algorithm, only the orange

partition can do work.

The CAP theorem explains why we can never build a system that does better, unless we are willing to

sacrifice consistency.


Distributed Transactions


Distributed transactions: the challenge

What if our computation is distributed?

We still want ACID properties– Atomicity– Consistency– Isolation– Durability

Per the CAP theorem: let’s ignore partition for now

Amazingly, there are ways to do this:– Isolation and Consistency: distributed lock managers– Atomicity and Durability: Distributed Two Phase Commit (DTPC)


Distributed two phase commit

Allows a single transaction to be spread across multiple nodes

Logging is done at each node as for traditional transactions

Special protocol ensures atomic commit of distributed work

One of the great achievements of 20th century distributed computing research


Distributed Two Phase Commit

BEGIN_DISTRIBUTED_TRANSACTIONNA =Access Noah’s Bank accountNBal = NA.Balance;Nbal += $1000;NA.Balance.Write NbalCOMMIT

Node 1 logic

Noah.Bal = $100

Node 1 Log

Begin Trans 1

JOIN_DISTRIBUTED_TRANSACTIONYA =Access Your Bank accountYbal = YA.Balance;Ybal -= $1000;YA.Balance.Write Ybal

Node 2 Logic

Your.Bal = $1300

Node 2 Log

Join Trans 1




Node 1 logic

Noah.Bal = $1100


Node 2 Logic

Your.Bal = $300

Node 1 Log

Begin Trans 1

Node 2 Log

Join Trans 1

Old Noah Balance = $100

Old YourBalance = $1300




Node 1 logic

Noah.Bal = $1100


Node 2 Logic

Your.Bal = $300

Node 1 Log

Begin Trans 1

Node 2 Log

Join Trans 1



Prepared

Are you prepared to commit?

Prepared

Yes, I am prepared




Node 1 logic

Noah.Bal = $1100


Node 2 Logic

Your.Bal = $300

Node 1 Log

Begin Trans 1

Node 2 Log

Join Trans 1



Prepared

Are you prepared to commit?

Prepared

Yes, I am prepared

Prepared means: if you ask me later to commit or abortI will be able to do either!




Node 1 logic

Noah.Bal = $1100


Node 2 Logic

Your.Bal = $300

Node 1 Log

Begin Trans 1

Node 2 Log

Join Trans 1



Prepared

Commit!

Prepared

Done

Commit

Commit


What happens if there is a crash?

If a node goes down before the commit, the master node writes an abort record and tells other nodes to abort

When any node comes up after a crash or after partition, it checks with master what has happened to any prepared transactions

Because prepared means it can go either way, that node can either record a commit or execute a rollback using data from the log

We can see the CAP theorem in action again: the algorithm stalls while the network is partitioned


Does Everyone use Distributed 2 Phase Commit?

In the late 1990s everyone thought DTPC would be the key to distributed data

In practice, systems like Amazon can’t stop in case of network partition or master node crashe

Today:– Massive but non-critical data stores do not even attempt

perfect consistency: once in awhile your Amazon shopping cart may lose things you’ve parked there

– Critical transactions (e.g. when you place your order and charge your credit card) are often recorded in less scalable but fully consistent (usually relational) databases


Summary


Summary

Keeping data consistent is important

Techniques like ACID transactions implemented with logs have been spectacularly successful

Consistency and scalability tend not to come together

Atomicity in software tends to require reduction to a single atomic operation in hardware

The CAP theorem says we can’t have Consistency, Availability and Parition tolerance

Techniques like Voting and Distributed Two Phase Commit can achieve distributed consistency at the cost of availability

Many modern systems sacrifice consistency to achieve availability at massive scale

61

Documents

Copyright 2015 – Noah Mendelsohn Consistency and Scalability Noah Mendelsohn Tufts University Email: [email protected]@cs.tufts.edu Web: noah