24
SPANNER: GOOGLE’S GLOBALLYDISTRIBUTE D DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford Google, Inc

SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

Embed Size (px)

Citation preview

Page 1: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

SPANNER: GOOGLE’S GLOBALLYDISTRIBUTE

D DATABASE

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford Google, Inc

Page 2: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

OUTLINE

Introduction

Implementation

Data model

True Time

Concurrency Control

Conclusion

Page 3: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

INTRODUCTION

•First multi version, globally distributed, synchronously replicated database

•Big table like versioned key-value store

•Megastore like schematized semi-relational tables

•Interesting Features: Data replication can be configured at a fine grain by applications

•Externally consistent read and writes

•Globally consistent reads across the database at a timestamp

•Assigns timestamps to txn’s even though distributed

Page 4: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

IMPLEMENTATION

Page 5: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

•A Spanner deployment is called a universe.

•Spanner is organized as a set of zones

•Zones are the unit of administrative deployment.

•The set of zones is also the set of locations across which data can be replicated

•Zone has zone master and between 100 to 1000 span servers.

•Zone master assigns data to span server, span server serves data to user

•Universe masters displays status. (for debugging and more)

•The placement driver handles automated movement of data across zones by communicating with span servers

Page 6: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

STATE MACHINE REPLICATION AND PAXOS State Machine Approach:

1. Place copies of the State Machine on multiple, independent servers.

2. Receive client requests, interpreted as Inputs to the State Machine.

3. Choose an ordering for the Inputs.

4. Execute Inputs in the chosen order on each server.

5. Respond to clients with the Output from the State Machine.

6. Monitor replicas for differences in State or Output.

Page 7: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

An order of Inputs may be defined using a voting protocol

The problem of voting for a single value by a group of independent entities is called Consensus

Inputs may be ordered by their position in the series of consensus instances (Consensus Order)

Paxos[6] is a protocol for solving consensus, and may be used as the protocol for implementing Consensus Order.

Paxos requires a single leader to ensure liveness

That is, one of the replicas must remain leader long enough to achieve consensus on the next operation of the state machine

requirement is that one replica remain leader long enough to move the system forward.

Page 8: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

SPAN SERVER SOFTWARE STACK

Page 9: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

•Tablet implements a bag of mappings similar to big table’s mappings

•(key:string, timestamp:int64) -> string

•Tablet’s state is stored in form of b-tree files stored in colossus

•To support replication, each span server implements a paxos machine on top of each tablet.

•Paxos state machines are used to implement a consistently replicated bag of mappings

•Writes must initiate paxos protocol at the leader. Reads access state directly from tablet at any replica that is up to date

•Set of replicas is collectively a paxos group

Page 10: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

At each replica that is a leader, span server implements a lock table to implement concurrency control

Long lived paxos leader are critical to implement concurrency control

Transaction managers are used to support distributed transaction

Replica that is used to implenet txn manager is participant leader. Other replicas are participant slaves.

If a txn involves more than one paxos group, the group leaders must coordinate to perform two phase lock.

One group leader is chosen as coordinate leader and the others are slaves.

Page 11: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

DIRECTORIES AND PLACEMENT

Page 12: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

Directory is a bucketing abstraction implemented on top of key-value mappings.

Allows application to control locality of their data

Directory is a unit of data placement. Data in a directory has same replication configuration

Movedir is the background task used to move data from a directory to directory.

For data to be transfere from a paxos group to another, the transfer occurs through directories.

Page 13: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

DATA MODEL

• Support schematized semi-relational tables

•Data model is layered on top of the directory-bucketed key-value mappings

•An application creates one or more databases in a universe.

• Each database can contain an unlimited number of schematized tablesdirectory-bucketed key-value mappings

•Spanner’s data model is not purely relational, in that rows must have names

•Spanner’s data model is not purely relational, in that rows must have names

•Spanner database must be partitioned by clients into one or more hierarchiesof tables

Page 14: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,
Page 15: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

TRUE TIME

Notations:

TTinterval – Time interval with bounded time uncertainity. [earliest,latest]

TTstamp- Endpoints of TTinterval

TT.now() – Returns a TTinterval

TT.after(t)– true if t has definitely passed

TT.before(t) – true if has definitely not passed

Tabs(e) – absolute time of an event e

Page 16: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

WHAT GOOD DOES TRUE TIME DO? Guarantees:

For an invocation tt=TT.now(), tt.earliest <= tabs(enow) <= tt.latest

•It makes sure that an event occurs during or after earliest time and ends before the end of the time interval

•Guarantees correctness properties around concurrency control

•Implements features like:

I. Externally consistent transactions

II. Lock-free read only transactions

III. Non-blocking reads in the past

These features guarantee that a audit read at timestamp t will see the effects of every transaction committed as of t.

Page 17: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

CONCURRENCY CONTROL

TimeStamp Management:

Spanner supports read-write, read-only transcations and snapshot reads

•Read-Only:

Read-only transactions should declare that it has not writes

Reads in a read-only transaction execute at a system-chosen timestamp without locking, so that incoming writes are not blocked.

Reads in read-only can execute on any replica that is up-to-date

•SnapShot reads:

A read in the past that executes without locking.

Client can specify the timestamp or let the spanner choose one given a upper bound

Execution takes place on any replica that is up to date

Page 18: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

A commit is mandatory once a timestamp is chosen for either read-only or snapshot txns.

When a server fails, clients can internally continue the query on a different server by repeating the timestamp and the current read

Page 19: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

DETAILS

Spanner requires a scope expression for every read-only transaction, which is an expression that summarizes the keys that will be read by the entire transaction

If the scope’s values are served by a single Paxos group:

•The client issues the read-only transaction to that group’s leader.

•leader assigns sread and executes the read

•LastTS(): timestamp of the last committed write at a Paxos group

•the assignment sread = LastTS() trivially satisfies external consistency

If Scope values are served multiple paxos group:

•Spanner has its reads execute at sread = TT.now().latest

Page 20: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

oAssigning timestamps to Read-Write transactions:

Uses two-Phase locking. Timestamps can be assigned after acquiring locks and before releasing.

For a txn, spanner assigns the timestamp as the same which paxos assigns to paxos write that represents the txn commit

Depends on Monotonicity In-variant:

Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders

A single leader replica can trivially assign timestamps in monotonically increasing order.

leader must only assign timestamps within the interval of its leader lease

Page 21: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

Enforcing External-consistency invariant:

If the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1.

tabs(e1commit ) < tabs(e2

start) ) ->s1 < s2

The protocol for executing transactions and assigning timestamps obeys two rules:

Start: Coordinate leader assigns timestamp si Txn Ti no less than value TT.now()

Commit Wait: coordinator leader ensures that clients cannot see any data committed by Ti until TT.after(si) is true

Page 22: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

DETAILS

Reads within RW Txn uses Wound-Wait to avoid dead locks

Page 23: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

CONCLUSION

Distributed transaction consistency

Eventual consistency replication support across data centers

Closing the gap between big table and Megastore while preserving

their postives

Page 24: SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

THANK YOU!