24
C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Embed Size (px)

Citation preview

Page 1: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

C-Store: Updates

Jianlin FengSchool of SoftwareSUN YAT-SEN UNIVERSITYMay. 15, 2009

Page 2: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Architecture of C-Store (Vertica)On a Single Node (or Site)

Page 3: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Write Store (WS)

WS is also a Column Store, and implements the identical physical DBMS design as RS.

WS is horizontally partitioned in the same way as RS. There is a 1:1 mapping between RS segments and WS seg

ments. Note such mapping only exists for “HOT” RS segments.

A tuple is identified by a (sid, storage_key) pair in either RS or WS sid: segment ID Storage_key: the Storage Key of the tuple

Page 4: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Join Indexes

Every projection is represented as a collection of pairs of segments. one in WS and one in RS.

For each tuple in the “sender”, we must store the sid and storage key of a corresponding tuple in the “receiver”.

Page 5: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Storage Key in WS

The Storage Key, SK, for each tuple is explicitly stored in each WS segment. Columns in WS only keep a logical SORT KEY or

der via SKs.

A unique SK is given to each insert of a logical tuple r in a table T. The SK of r must be recorded in each projection t

hat stores data for r. This SK is an interger.

Page 6: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Storage Representation of Columnsin WS Every column in a Projection

Represented as a collection of (v, sk) pairs v : a data value in the column sk : the storage key (explicitly stored)

Build a B-Tree over the (v, sk) pairs Use the second field of each pair, sk, as the KEY

Page 7: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Sort Keys of Each Projection in WS Represented as a collection of (s, sk) pairs

s : a sort key value sk : the storage key describing where s first appea

rs.

Build a B-Tree over the (s, sk) pairs Use the first field of each pair, s, as the KEY

Page 8: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Storage Management

This issue is the allocation of segments to nodes in a grid (or cloud computing) system. C-Store uses a storage allocator.

Some guidelines All columns in a single segment of a projection should be c

o-located, i.e., put at the same node. Join indexes should be co-located with their “sender” segm

ents. Each WS segment should be co-located with the RS segm

ents that contain the same (sort) key range.

Page 9: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Updates

An update is either an insert or a delete Insert a (new) tuple Delete an (existing) tuple Modify an existing tuple

Delete the existing version of the tuple. Insert the new version of the tuple.

Page 10: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Allocating a Storage Key in a Grid Background

All inserts corresponding to a single logical tuple have the same storage key.

Where to allocate a SK The node at which the insert is received.

Globally Unique Storage Key Each node maintains a locally unique counter. The initial value of the counter = 1 + the largest key in RS. Global SK = Local SK + Node ID.

Page 11: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Realizing Inserts in WS

WS is built on top of BerkeleyDB Using B-Tree in the package to support inserts.

Every insert to a projection results in a collection of physical inserts on different disk pages. One insert per column per projection. Accessing disk pages is expensive.

The solution is using a very large memory buffer to hold “HOT” WS part.

Page 12: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Transaction Framework in C-Store Large number of read-only transactions,

interspersed with a small number of update transactions covering few tuples.

To avoid substantial lock contention, use snapshot isolation to isolate read-only transactions.

Update transactions continue to set read and write locks and obey strict two-phase locking.

Page 13: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Snapshot Isolation

Basic idea Allowing read-only transactions to read the snapshots of the

database as of some time t in the recent past, provided before which we can guarantee that there are no u

ncommitted transactions. t: called the effective time.

The Key Problem Determining which of the tuples in WS and RS should be visi

ble to a read-only transaction running at effective time ET. A tuple is visible if it was inserted before ET and deleted afte

r ET.

Page 14: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Water Marks of Effective Time High Water Mark (HWM)

The most recent effective time in the past at which snapshot isolation can run.

Low Water Mark (LWM) The earliest effective time at which snapshot

isolation can run.

LWM <= Any Effective Time <= HWM

Page 15: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Insertion Vector (IV)

Maintain an insertion vector for each segment in WS For each tuple in the segment , the insertion vecto

r contains the epoch in which the tuple was inserted.

Use Tuple Mover to assure that no tuples in RS were inserted after the LWM. RS does not have insertion vectors.

Page 16: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Deleted Record Vector (DRV)

Maintain also a deleted record vector for each segment in WS For each tuple, the DRV has one entry, containing 0, if the tuple has not been deleted; otherwise, the epoch in which the tuple was deleted.

DRV is very sparse (mostly 0s) Can be compressed BY Run-Length Encoding.

The runtime system can consult IV and DRV to make the visibility calculation for each query on a tuple-by-tuple basis.

Page 17: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Maintaining the High Water Mark :Some Defintions the timestamp authority (TA)

one node designated with the responsibility of allocating timestamps to other nodes.

Time is divided into a number of epochs, each epoch is relatively long (e.g., many seconds each).

Epoch number: The number of epochs that have elapsed since the

beginning of time.

Page 18: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

HWM Selection Algorithm

1. Define the initial HWM to be epoch 0; and start current epoch at 1.

2. Periodically, the TA decides to move the system to the next epoch: The TA sends a end of epoch message to each node; Each node increments current epoch from e to e+1, thus causing new transactions that arrive to be run with a

timestamp e+1.

3. Each node waits for all the transactions that began in epoch e (or an earlier epoch) to complete; and then sends an epoch complete message to the TA.

4. Once the TA has received epoch complete messages from all nodes for epoch e, it sets the HWM to be e, and sends this value to each node.

Page 19: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009
Page 20: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

LWM “chases” HWM

Periodically, the timestamp authority (TA) sends out to each node a new LWM epoch number. By fixing a delta between LWM and HWM.

The delta is chosen to mediate between the needs of users who want historical access and the WS space constraint.

Page 21: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Tuple Mover

The job of the tuple mover to move blocks of tuples in a WS segment to the c

orresponding RS segment, Updating any join indexes in the process.

It operates as a background task looking for worthy segment pairs. When it finds one, it performs a merge-out proces

s, MOP on this (RS, WS) segment pair.

Page 22: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

The Merge-Out Process (MOP)

In the chosen WS segment, MOP will find all tuples with an insertion time at or before the LWM. When the LWM moves on, tuples become “old” enough.

then divides the “old” enough tuples into two groups: Ones deleted at or before LWM.

These are discarded, because the user cannot run queries as of a time when they existed.

Ones that were not deleted, or deleted after LWM. These are moved to RS.

Page 23: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

Detailed Steps of MOP First

MOP will create a new RS segment that we name RS'.

Then it reads in blocks from columns of the RS segment, deletes any RS tuples with a value in the DRV less than or equal

to the LWM, and merges in column values from WS. The merged data is then written out to the new segment RS'. Tuples receive new storage keys in RS', thereby requiring join in

dexes maintenance.

Once RS' contains all the WS data and join indexes are modified on RS' , the system cuts over from RS to RS'.

Page 24: C-Store: Updates Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 15, 2009

References

Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran and Stan Zdonik. C-Store: A Column Oriented DBMS VLDB, pages 553-564, 2005.

VERTICA DATABASE TECHNICAL OVERVIEW WHITE PAPER. http://www.vertica.com/php/pdfgateway?file=VerticaArchitectureWhitePaper.pdf