C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

C-Store: Updates

Jianlin FengSchool of SoftwareSUN YAT-SEN UNIVERSITYMay. 15, 2009

Architecture of C-Store (Vertica)On a Single Node (or Site)

Write Store (WS)

WS is also a Column Store, and implements the identical physical DBMS design as RS.

WS is horizontally partitioned in the same way as RS. There is a 1:1 mapping between RS segments and WS seg

ments. Note such mapping only exists for “HOT” RS segments.

A tuple is identified by a (sid, storage_key) pair in either RS or WS sid: segment ID Storage_key: the Storage Key of the tuple

Join Indexes

Every projection is represented as a collection of pairs of segments. one in WS and one in RS.

For each tuple in the “sender”, we must store the sid and storage key of a corresponding tuple in the “receiver”.

Storage Key in WS

The Storage Key, SK, for each tuple is explicitly stored in each WS segment. Columns in WS only keep a logical SORT KEY or

der via SKs.

A unique SK is given to each insert of a logical tuple r in a table T. The SK of r must be recorded in each projection t

hat stores data for r. This SK is an interger.

Storage Representation of Columnsin WS Every column in a Projection

Represented as a collection of (v, sk) pairs v : a data value in the column sk : the storage key (explicitly stored)

Build a B-Tree over the (v, sk) pairs Use the second field of each pair, sk, as the KEY

Sort Keys of Each Projection in WS Represented as a collection of (s, sk) pairs

s : a sort key value sk : the storage key describing where s first appea

rs.

Build a B-Tree over the (s, sk) pairs Use the first field of each pair, s, as the KEY

Storage Management

This issue is the allocation of segments to nodes in a grid (or cloud computing) system. C-Store uses a storage allocator.

Some guidelines All columns in a single segment of a projection should be c

o-located, i.e., put at the same node. Join indexes should be co-located with their “sender” segm

ents. Each WS segment should be co-located with the RS segm

ents that contain the same (sort) key range.

Updates

An update is either an insert or a delete Insert a (new) tuple Delete an (existing) tuple Modify an existing tuple

Delete the existing version of the tuple. Insert the new version of the tuple.

Allocating a Storage Key in a Grid Background

All inserts corresponding to a single logical tuple have the same storage key.

Where to allocate a SK The node at which the insert is received.

Globally Unique Storage Key Each node maintains a locally unique counter. The initial value of the counter = 1 + the largest key in RS. Global SK = Local SK + Node ID.

Realizing Inserts in WS

WS is built on top of BerkeleyDB Using B-Tree in the package to support inserts.

Every insert to a projection results in a collection of physical inserts on different disk pages. One insert per column per projection. Accessing disk pages is expensive.

The solution is using a very large memory buffer to hold “HOT” WS part.

Transaction Framework in C-Store Large number of read-only transactions,

interspersed with a small number of update transactions covering few tuples.

To avoid substantial lock contention, use snapshot isolation to isolate read-only transactions.

Update transactions continue to set read and write locks and obey strict two-phase locking.

Snapshot Isolation

Basic idea Allowing read-only transactions to read the snapshots of the

database as of some time t in the recent past, provided before which we can guarantee that there are no u

ncommitted transactions. t: called the effective time.

The Key Problem Determining which of the tuples in WS and RS should be visi

ble to a read-only transaction running at effective time ET. A tuple is visible if it was inserted before ET and deleted afte

r ET.

Water Marks of Effective Time High Water Mark (HWM)

The most recent effective time in the past at which snapshot isolation can run.

Low Water Mark (LWM) The earliest effective time at which snapshot

isolation can run.

LWM <= Any Effective Time <= HWM

Insertion Vector (IV)

Maintain an insertion vector for each segment in WS For each tuple in the segment , the insertion vecto

r contains the epoch in which the tuple was inserted.

Use Tuple Mover to assure that no tuples in RS were inserted after the LWM. RS does not have insertion vectors.

Deleted Record Vector (DRV)

Maintain also a deleted record vector for each segment in WS For each tuple, the DRV has one entry, containing 0, if the tuple has not been deleted; otherwise, the epoch in which the tuple was deleted.

DRV is very sparse (mostly 0s) Can be compressed BY Run-Length Encoding.

The runtime system can consult IV and DRV to make the visibility calculation for each query on a tuple-by-tuple basis.

Maintaining the High Water Mark :Some Defintions the timestamp authority (TA)

one node designated with the responsibility of allocating timestamps to other nodes.

Time is divided into a number of epochs, each epoch is relatively long (e.g., many seconds each).

Epoch number: The number of epochs that have elapsed since the

beginning of time.

HWM Selection Algorithm

1. Define the initial HWM to be epoch 0; and start current epoch at 1.

2. Periodically, the TA decides to move the system to the next epoch: The TA sends a end of epoch message to each node; Each node increments current epoch from e to e+1, thus causing new transactions that arrive to be run with a

timestamp e+1.

3. Each node waits for all the transactions that began in epoch e (or an earlier epoch) to complete; and then sends an epoch complete message to the TA.

4. Once the TA has received epoch complete messages from all nodes for epoch e, it sets the HWM to be e, and sends this value to each node.

LWM “chases” HWM

Periodically, the timestamp authority (TA) sends out to each node a new LWM epoch number. By fixing a delta between LWM and HWM.

The delta is chosen to mediate between the needs of users who want historical access and the WS space constraint.

Tuple Mover

The job of the tuple mover to move blocks of tuples in a WS segment to the c

orresponding RS segment, Updating any join indexes in the process.

It operates as a background task looking for worthy segment pairs. When it finds one, it performs a merge-out proces

s, MOP on this (RS, WS) segment pair.

The Merge-Out Process (MOP)

In the chosen WS segment, MOP will find all tuples with an insertion time at or before the LWM. When the LWM moves on, tuples become “old” enough.

then divides the “old” enough tuples into two groups: Ones deleted at or before LWM.

These are discarded, because the user cannot run queries as of a time when they existed.

Ones that were not deleted, or deleted after LWM. These are moved to RS.

Detailed Steps of MOP First

MOP will create a new RS segment that we name RS'.

Then it reads in blocks from columns of the RS segment, deletes any RS tuples with a value in the DRV less than or equal

to the LWM, and merges in column values from WS. The merged data is then written out to the new segment RS'. Tuples receive new storage keys in RS', thereby requiring join in

dexes maintenance.

Once RS' contains all the WS data and join indexes are modified on RS' , the system cuts over from RS to RS'.

References

Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran and Stan Zdonik. C-Store: A Column Oriented DBMS VLDB, pages 553-564, 2005.

VERTICA DATABASE TECHNICAL OVERVIEW WHITE PAPER. http://www.vertica.com/php/pdfgateway?file=VerticaArchitectureWhitePaper.pdf

Documents

C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009