64
Distributed Systems Jeff Chase Duke University

Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Embed Size (px)

Citation preview

Page 1: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Distributed Systems

Jeff ChaseDuke University

Page 2: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Challenge: Coordination• The solution to availability and scalability is to decentralize

and replicate functions and data…but how do we coordinate the nodes?– data consistency– update propagation– mutual exclusion– consistent global states– group membership– group communication– event ordering– distributed consensus– quorum consensus

Page 3: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Overview

• The problem of failures• The fundamental challenge of consensus• General impossibility of consensus

– in asynchronous networks (e.g., the Internet)– safety vs. liveness: “CAP theorem”

• Manifestations– NFS cache consistency– Transaction commit– Master election in Google File System

Page 4: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Properties of nodes

• Essential properties typically assumed by model:– Private state

• Distributed memory: model sharing as messages

– Executes a sequence of state transitions• Some transitions are reactions to messages• May have internal concurrency, but hide

that– Unique identity– Local clocks with bounded drift

Page 5: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Node failures

• Fail-stop. Nodes/actors may fail by stopping.• Byzantine. Nodes/actors may fail without stopping.

– Arbitrary, erratic, unexpected behavior– May be malicious and disruptive

• Unfaithful behavior– Actors may behave unfaithfully from self-interest.– If it is rational, is it Byzantine?– If it is rational, then it is expected.– If it is expected, then we can control it.– Design in incentives for faithful behavior, or

disincentives for unfaithful behavior.

Page 6: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Example: DNS

.edu

unc

cs

duke

cs envmc

www(prophet)vmm01

cs

washington

comgov

orgnet

firmshop

artsweb

us

top-leveldomains(TLDs)

fr

generic TLDs

country-code TLDs

What can go wrong if a DNS server fails?How much damage can result from a server failure?Can it compromise the availability or integrity of the entire DNS system?

DNS roots

Page 7: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

“lookup www.nhc.noaa.gov”

DNS server fornhc.noaa.gov

localDNS server

“www.nhc.noaa.gov is140.90.176.22”

DNS Service 101WWW server for

nhc.noaa.gov(IP 140.90.176.22)

– client-side resolvers• typically in a library• gethostbyname,

gethostbyaddr– cooperating servers

• query-answer-referral model

• forward queries among servers

• server-to-server may use TCP (“zone transfers”)

Page 8: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Node recovery

• Fail-stopped nodes may revive/restart.– Retain identity– Lose messages sent to them while failed– Arbitrary time to restart…or maybe never

• Restarted node may recover state at time of failure.– Lose state in volatile (primary) memory.– Restore state in non-volatile (secondary)

memory.– Writes to non-volatile memory are expensive.– Design problem: recover complete states

reliably, with minimal write cost.

Page 9: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Distributed System Models

• Synchronous model– Message delay is bounded and the bound is known.– E.g., delivery before next tick of a global clock.– Simplifies distributed algorithms

• “learn just by watching the clock”• absence of a message conveys information.

• Asynchronous model– Message delays are finite, but unbounded/unknown– More realistic/general than synchronous model.

• “Beware of any model with stronger assumptions.” - Burrows

– Strictly harder/weaker than synchronous model.• Consensus is not always possible

Page 10: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Messaging properties

• Other possible properties of the messaging model:– Messages may be lost.– Messages may be delivered out of order.– Messages may be duplicated.

• Do we need to consider these in our distributed system model?

• Or, can we solve them within the asynchronous model, without affecting its foundational properties?

Page 11: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Network File System (NFS)

syscall layer

UFS

NFSserver

VFS

VFS

NFSclient

UFS

syscall layer

client

user programs

network

server

Page 12: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

NFS Protocol

• NFS is a network protocol layered above TCP/IP.– Original implementations (and some today)

use UDP datagram transport.• Maximum IP datagram size was increased to

match FS block size, to allow send/receive of entire file blocks.

• Newer implementations use TCP as a transport.

– The NFS protocol is a set of message formats and types for request/response (RPC) messaging.

Page 13: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

NFS: From Concept to Implementation

• Now that we understand the basics, how do we make it work in a real system?– How do we make it fast?

• Answer: caching, read-ahead, and write-behind.

– How do we make it reliable? What if a message is dropped? What if the server crashes?• Answer: client retransmits request until it

receives a response.– How do we preserve the failure/atomicity model?

• Answer: well...

Page 14: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

NFS as a “Stateless” Service

• The NFS server maintains no transient information about its clients.– The only “hard state” is the FS data on disk.– Hard state: must be preserved for correctness.– Soft state: an optimization, can discard safely.

• “Statelessness makes failure recovery simple and efficient.”– If the server fails client retransmits pending

requests until it gets a response (or gives up).– “Failed server is indistinguishable from a slow

server or a slow network.”

Page 15: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Drawbacks of a Stateless Service

• Classical NFS has some key drawbacks:– ONC RPC has execute-mostly-once semantics (“send

and pray”).– So operations must be designed to be idempotent.

• Executing it again does not change the effect.• Does not support file appends, exclusive name

creation, etc.– Server writes updates to disk synchronously.

• Slowww…– Does not preserve local single-copy semantics.

• Open files may be removed.• Server cannot help in client cache consistency.

Page 16: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

File Cache Consistency

• Caching is a key technique in distributed systems.– The cache consistency problem: cached data may become

stale if cached data is updated elsewhere in the network.

• Solutions:– Timestamp invalidation (NFS-Classic).

• Timestamp each cache entry, and periodically query the server: “has this file changed since time t?”; invalidate cache if stale.

– Callback invalidation (AFS).• Request notification (callback) from the server if the

file changes; invalidate cache on callback.

– Delegation “leases” [Gray&Cheriton89, NQ-NFS, NFSv4]

Page 17: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Timestamp Validation in NFS [1985]

• NFSv2/v3 uses a form of timestamp validation.– Timestamp cached data at file grain.– Maintain per-file expiration time (TTL)– Refresh/revalidate if TTL expires.

• NFS Get Attributes (getattr)• Similar approach used in HTTP

– Piggyback file attributes on each response.• What happens on server failure? Client failure?• What TTL to use?

Page 18: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Delegations or “Leases”• In NQ-NFS, a client obtains a lease on the file that permits

the client’s desired read/write activity.• “A lease is a ticket permitting an activity; the lease is

valid until some expiration time.”– A read-caching lease allows the client to cache clean

data.•Guarantee: no other client is modifying the file.

– A write-caching lease allows the client to buffer modified data for the file.•Guarantee: no other client has the file cached.• Allows delayed writes: client may delay issuing writes

to improve write performance (i.e., client has a writeback cache).

Page 19: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Using Delegations/Leases

1. Client NFS piggybacks lease requests for a given file on I/O operation requests (e.g., read/write).– NQ-NFS leases are implicit and distinct from file locking.

2. The server determines if it can safely grant the request, i.e., does it conflict with a lease held by another client. – read leases may be granted simultaneously to multiple

clients– write leases are granted exclusively to a single client

3. If a conflict exists, the server may send an eviction notice to the holder.– Evicted from a write lease? Write back.– Grace period: server grants extensions while client writes.– Client sends vacated notice when all writes are complete.

Page 20: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Failures

• Challenge:– File accesses complete if the server is up.– All clients “agree” on “current” file contents.– Leases are not broken (mutual exclusion)

• What if a client fails while holding a lease?• What if the server fails?• How can any node know that any other node has

failed?

Page 21: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Consensus

Unreliable multicast

Step 1Propose.

P1

P2 P3

v1

v3 v2

Consensus algorithm

Step 2Decide.

P1

P2 P3

d1

d3 d2

Generalizes to N nodes/processes.

Page 22: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Properties for Correct Consensus

• Termination: All correct processes eventually decide.

• Agreement: All correct processes select the same di.

• Or…(stronger) all processes that do decide select the same di, even if they later fail.

• Called uniform consensus: “Uniform consensus is harder then consensus.”

• Integrity: All deciding processes select the “right” value.– As specified for the variants of the consensus

problem.

Page 23: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Properties of Distributed Algorithms

• Agreement is a safety property.– Every possible state of the system has this

property in all possible executions. – I.e., either they have not agreed yet, or they

all agreed on the same value.• Termination is a liveness property.

– Some state of the system has this property in all possible executions.

– The property is stable: once some state of an execution has the property, all subsequent states also have it.

Page 24: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Variant I: Consensus (C)

Pi selects di from {v0, …, vN-1}.

All Pi select di as the same vk .

If all Pi propose the same v, then di = v, else di is arbitrary.

di = vk

Coulouris and Dollimore

Page 25: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Variant II: Command Consensus (BG)

Pi selects di = vleader proposed by designated leader node Pleader if the

leader is correct, else the selected value is arbitrary.

As used in the Byzantine generals problem.

Also called attacking armies.

di = vleader

vleader

leader orcommander

subordinate orlieutenant

Coulouris and Dollimore

Page 26: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Variant III: Interactive Consistency (IC)

Pi selects di = [v0 , …, vN-1] vector reflecting the values proposed by all correct participants.

di = [v0 , …, vN-1]

Coulouris and Dollimore

Page 27: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Equivalence of Consensus Variants

• If any of the consensus variants has a solution, then all of them have a solution.

• Proof is by reduction.

– IC from BG. Run BG N times, one with each Pi as leader.

– C from IC. Run IC, then select from the vector.– BG from C.

• Step 1: leader proposes to all subordinates.• Step 2: subordinates run C to agree on the proposed

value.– IC from C? BG from IC? Etc.

Page 28: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Fischer-Lynch-Patterson (1985)

• No consensus can be guaranteed in an asynchronous communication system in the presence of any failures.

• Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time.

• Consensus may occur recognizably, rarely or often.

• e.g., if no inconveniently delayed messages• FLP implies that no agreement can be

guaranteed in an asynchronous system with Byzantine failures either. (More on that later.)

Page 29: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Consensus in Practice I• What do these results mean in an asynchronous world?

– Unfortunately, the Internet is asynchronous, even if we believe that all faults are eventually repaired.

– Synchronized clocks and predictable execution times don’t change this essential fact.

• Even a single faulty process can prevent consensus.• Consensus is a practical necessity, so what are we to

do?

Page 30: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Consensus in Practice II• We can use some tricks to apply synchronous

algorithms:– Fault masking: assume that failed processes always

recover, and reintegrate them into the group.• If you haven’t heard from a process, wait

longer…• A round terminates when every expected

message is received.– Failure detectors: construct a failure detector that

can determine if a process has failed.• Use a failure detector that is live but not

accurate.• Assume bounded delay.• Declare slow nodes dead and fence them off.

• But: protocols may block in pathological scenarios, and they may misbehave if a failure detector is wrong.

Page 31: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Fault Masking with a Session Verifier

What if y == x?

How to guarantee that y != x?

What is the implication of re-executing A and B, and after C?

Some uses: NFS V3 write commitment, RPC sessions, NFS V4 and DAFS (client).

oops...

“Do A for me.”

“OK, my verifier is x.”

“B”

“x”

“C”

“OK, my verifier is y.”

“A and B”

“y”

S, x

S´, y

Page 32: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Delegation/Lease Recovery

• Key point: the bounded lease term simplifies recovery.– Before a lease expires, the client must renew the lease.

• Else client is deemed to have “failed”.– What if a client fails while holding a lease?

• Server waits until the lease expires, then unilaterally reclaims the lease; client forgets all about it.

• If a client fails while writing on an eviction, server waits for write slack time before granting conflicting lease.

– What if the server fails while there are outstanding leases?• Wait for lease period + clock skew before issuing new

leases.– Recovering server must absorb lease renewal requests

and/or writes for vacated leases.

Page 33: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Failure Detectors

• How to detect that a member has failed?– pings, timeouts, beacons, heartbeats– recovery notifications

• “I was gone for awhile, but now I’m back.”• Is the failure detector accurate?• Is the failure detector live (complete)?• In an asynchronous system, it is possible for a failure

detector to be accurate or live, but not both.– FLP tells us that it is impossible for an asynchronous

system to agree on anything with accuracy and liveness!

Page 34: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Failure Detectors in Real Systems

• Use a detector that is live but not accurate.– Assume bounded processing delays and delivery

times.– Timeout with multiple retries detects failure

accurately with high probability. Tune it to observed latencies.

– If a “failed” site turns out to be alive, then restore it or kill it (fencing, fail-silent).

– Example: leases and leased locks• What do we assume about communication failures?

How much pinging is enough? What about network partitions?

Page 35: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Variant II: Command Consensus (BG)

Pi selects di = vleader proposed by designated leader node Pleader if the

leader is correct, else the selected value is arbitrary.

As used in the Byzantine generals problem.

Also called attacking armies.

di = vleader

vleader

leader orcommander

subordinate orlieutenant

Coulouris and Dollimore

Page 36: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

A network partition

Page 37: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

37

C

A P

Fox&Brewer “CAP Theorem”:C-A-P: choose two.

consistency

Availability Partition-resilience

Claim: every distributed system is on one side of the triangle.

CA: available, and consistent, unless there is a partition.

AP: a reachable replica provides service even in a partition, but may be inconsistent if there is a failure.

CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others (e.g., quorum).

Page 38: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Two Generals in practice

Deduct $300

Issue $300

How do banks solve this problem?

Keith Marzullo

Page 39: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Careful ordering is limited

Transfer $100 from Melissa’s account to mine1. Deduct $100 from Melissa’s account2. Add $100 to my account

Crash between 1 and 2, we lose $100 Could reverse the ordering

1. Add $100 to my account2. Deduct $100 from Melissa’s account

Crash between 1 and 2, we gain $100 What does this remind you of?

Page 40: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Transactions

Fundamental to databases (except MySQL, until recently)

Several important properties “ACID” (atomicity, consistent, isolated,

durable) We only care about atomicity (all or

nothing) BEGIN disk write 1 … disk write nEND

Called “committing” the transaction

Page 41: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Transactions: logging

1. Begin transaction2. Append info about modifications to a log3. Append “commit” to log to end x-action4. Write new data to normal database Single-sector write commits x-action (3)

Invariant: append new data to log before applying to DBCalled “write-ahead logging”

Transaction CompleteTransaction Complete

Page 42: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Transactions: logging

1. Begin transaction2. Append info about modifications to a log3. Append “commit” to log to end x-action4. Write new data to normal database Single-sector write commits x-action (3)

What if we crash here (between 3,4)?On reboot, reapply committed updates in log order.

Page 43: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Transactions: logging

1. Begin transaction2. Append info about modifications to a log3. Append “commit” to log to end x-action4. Write new data to normal database Single-sector write commits x-action (3)

What if we crash here?On reboot, discard uncommitted updates.

Page 44: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Committing Distributed Committing Distributed TransactionsTransactions

• Transactions may touch data at more than one site.• Problem: any site may fail or disconnect while a

commit for transaction T is in progress.– Atomicity says that T does not “partly commit”, i.e.,

commit at some site and abort at another.– Individual sites cannot unilaterally choose to abort

T without the agreement of the other sites.– If T holds locks at a site S, then S cannot release

them until it knows if T committed or aborted.– If T has pending updates to data at a site S, then S

cannot expose the data until T commits/aborts.

Page 45: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Commit is a Consensus Problem

• If there is more than one site, then the sites must agree to commit or abort.

• Sites (Resource Managers or RMs) manage their own data, but coordinate commit/abort with other sites.– “Log locally, commit globally.”

• We need a protocol for distributed commit.– It must be safe, even if FLP tells us it might not

terminate.• Each transaction commit is led by a coordinator

(Transaction Manager or TM).

Page 46: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Two-Phase Commit (2PC)

“commit or abort?” “here’s my vote” “commit/abort!”TM/C

RM/P

precommitor prepare

vote decide notify

RMs validate Tx and prepare by logging their

local updates and decisions

TM logs commit/abort (commit point)

If unanimous to commitdecide to commit

else decide to abort

Page 47: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Handling Failures in 2PCHandling Failures in 2PC

How to ensure consensus if a site fails during the 2PC protocol?

1. A participant P fails before preparing.

Either P recovers and votes to abort, or C times out and aborts.

2. Each P votes to commit, but C fails before committing.

• Participants wait until C recovers and notifies them of the decision to abort. The outcome is uncertain until C recovers.

Page 48: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Handling Failures in 2PC, continued

3. P or C fails during phase 2, after the outcome is determined.

• Carry out the decision by reinitiating the protocol on recovery.

• Again, if C fails, the outcome is uncertain until C recovers.

Page 49: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

C

A P

Fox&Brewer “CAP Theorem”:C-A-P: choose two.

consistency

Availability Partition-resilience

Claim: every distributed system is on one side of the triangle.

CA: available, and consistent, unless there is a partition.

AP: a reachable replica provides service even in a partition, but may be inconsistent.

CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others (e.g., quorum).

Page 50: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Parallel File Systems 101

Manage data sharing in large data stores

[Renu Tewari, IBM]

Asymmetric• E.g., PVFS2, Lustre, High Road

Symmetric• E.g., GPFS, Polyserve

Page 51: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Parallel NFS (pNFS)

pNFSClients

Block (FC) /Object (OSD) /

File (NFS)StorageNFSv4+ Server

data

metadatacontrol

[David Black, SNIA]

Page 52: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

pNFS architecture

pNFSClients

Block (FC) /Object (OSD) /

File (NFS)StorageNFSv4+ Server

data

metadatacontrol

• Only this is covered by the pNFS protocol• Client-to-storage data path and server-to-storage control path are

specified elsewhere, e.g.– SCSI Block Commands (SBC) over Fibre Channel (FC)– SCSI Object-based Storage Device (OSD) over iSCSI– Network File System (NFS)

[David Black, SNIA]

Page 53: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

pNFS basic operation

• Client gets a layout from the NFS Server• The layout maps the file onto storage devices and

addresses• The client uses the layout to perform direct I/O to storage• At any time the server can recall the layout

(leases/delegations)• Client commits changes and returns the layout when it’s

done• pNFS is optional, the client can always use regular NFSv4

I/O

Clients

Storage

NFSv4+ Server

layout

[David Black, SNIA]

Page 54: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Google GFS: Assumptions• Design a Google FS for Google’s distinct

needs• High component failure rates

– Inexpensive commodity components fail often

• “Modest” number of HUGE files– Just a few million– Each is 100MB or larger; multi-GB files

typical• Files are write-once, mostly appended to

– Perhaps concurrently• Large streaming reads• High sustained throughput favored over low

latency

[Alex Moschuk]

Page 55: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

GFS Design Decisions• Files stored as chunks

– Fixed size (64MB)• Reliability through replication

– Each chunk replicated across 3+ chunkservers• Single master to coordinate access, keep metadata

– Simple centralized management• No data caching

– Little benefit due to large data sets, streaming reads

• Familiar interface, but customize the API– Simplify the problem; focus on Google apps– Add snapshot and record append operations

[Alex Moschuk]

Page 56: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

GFS Architecture• Single master• Mutiple chunkservers

…Can anyone see a potential weakness in this design?

[Alex Moschuk]

Page 57: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Single master• From distributed systems we know this is

a:– Single point of failure– Scalability bottleneck

• GFS solutions:– Shadow masters– Minimize master involvement

• never move data through it, use only for metadata– and cache metadata at clients

• large chunk size• master delegates authority to primary replicas in data

mutations (chunk leases)

• Simple, and good enough!

Page 58: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Fault Tolerance

• High availability– fast recovery

• master and chunkservers restartable in a few seconds

– chunk replication• default: 3 replicas.

– shadow masters

• Data integrity– checksum every 64KB block in each chunk

What is the consensus problem here?

Page 59: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Google Ecosystem

• Google builds and runs services at massive scale.– More than half a million servers

• Services at massive scale must be robust and adaptive.– To complement a robust, adaptive infrastructure

• Writing robust, adaptive distributed services is hard.• Google Labs works on tools, methodologies, and

infrastructures to make it easier.– Conceive, design, build– Promote and transition to practice– Evaluate under real use

Page 60: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Google Systems• Google File System (GFS) [SOSP 2003]

– Common foundational storage layer• MapReduce for data-intensive cluster computing [OSDI 2004]

– Used for hundreds of google apps– Open-source: Hadoop (Yahoo)

• BigTable [OSDI 2006]– a spreadsheet-like data/index model layered on GFS

• Sawzall– Execute filter and aggregation scripts on BigTable servers

• Chubby [OSDI 2006]– Foundational lock/consensus/name service for all of the above– Distributed locks– The “root” of distributed coordination in Google tool set

Page 61: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

What Good is “Chubby”?

• Claim: with a good lock service, lots of distributed system problems become “easy”.– Where have we seen this before?

• Chubby encapsulates the algorithms for consensus.– Where does consensus appear in Chubby?

• Consensus in the real world is imperfect and messy.– How much of the mess can Chubby hide? – How is “the rest of the mess” exposed?

• What new problems does such a service create?

Page 62: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Chubby Structure

• Cell with multiple participants (replicas and master)– replicated membership list– common DNS name (e.g., DNS-RR)

• Replicas elect one participant to serve as Master – master renews its Master Lease periodically– elect a new master if the master fails– all writes propagate to secondary replicas

• Clients send “master location requests” to any replica– returns identity of master

• Replace replica after long-term failure (hours)

Page 63: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

Master Election/Fail-over

Page 64: Distributed Systems Jeff Chase Duke University. Challenge: Coordination The solution to availability and scalability is to decentralize and replicate

C

A P

Fox&Brewer “CAP Theorem”:C-A-P: choose two.

consistency

Availability Partition-resilience

Claim: every distributed system is on one side of the triangle.

CA: available, and consistent, unless there is a partition.

AP: a reachable replica provides service even in a partition, but may be inconsistent.

CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others (e.g., quorum).