A Framework for Access Methods for Versioned Data

A Framework for Access Methods for

Versioned Data

B. Salzberg, L. Jiang, D. Lomet, M. Barrena, J. Shan & E. Kanoulas

Outline• Motivation• Introducing versions through the

examples• Versions and version ranges• Data pages

– Page splitting and consolidation– Efficiency guarantee

• Index pages• Conclusions

Motivation

• Historical archives need to be retained– Medical, banking, …

• Different historical versions created along different branches must be reconstructed– Software libraries, design, …

• Access methods for versioned data have been proposed

Motivation

• We present a framework for constructing and understanding versioned access methods

• Central point: the study of version splitting of units of data storage (disk pages)

• Main goal: to make the stabbing query efficient:

“Find all data alive at this version”

The two-version example

• Records format: <vi,kj,dl>

<v1,k1,d1><v1,k2,d2><v1,k3,d3>

version v1

<v2,k1,d’1><v2,k2,d2><v2,k3,d3>

version v2

Redundancy

<v1,k1,d1> <v2,k1,d’1><{v1,v2},k2,d2><{v1,v2},k3,d3>

What if k2 does not change for a large number of versions?

Could we use start and end versions labels?

Set of versions for which record k2 does not change

branch b2

branch b1

The three-version example

v1

v3

v2

<v1,k1,d1><v1,k2,d2><v1,k3,d3>

version v1

<v2,k1,d’1><v2,k2,d2><v2,k3,d3>

version v2

<v3,k1,d1><v3,k2,d’2><v3,k3,d3>

version v3

branch b2

branch b1

time

Key space

v1 v3

k1

k2

k3

now

d1

d’2d2

d3

branch b2

time

Key space

v1 v2

k1

k2

k3

now

d1 d’1d2

d3

branch b1

<{v1, v3},k1,d1> <v2,k1,d’1><{v1, v2},k2,d2> <v3,k2,d’2><{v1, v2, v3},k3,d3>

The three-version example

time

Key space

v1 v3

k1

k2

k3

now

d1

d’2d2

d3

branch b2

time

Key space

v1 v2

k1

k2

k3

now

d1 d’1d2

d3

branch b1

When there is branching, we cannot express a unique end version for a set of versions

We might keep the start and the end version on each branch

What if k3 is never updated in branch b1? Should we keep updating the end version for k3 as new versions appear on b1?

Versions

• The initial version set: V = {v1}• New versions are obtained by updating,

inserting or deleting records from old versions of V

• V can be represented by a tree: the version tree

• There is a partial order on the nodes of the version tree– Ancestors: anc(v)={a V/ a < v}– Descendents: desc(v) = {d V/ d > v}

Version Ranges• Records correspond to sets of versions over

which they do not change• Such a set forms a subtree called the version

range. We have:– One start version: the root of the subtree– A set of end versions: the leaves of the subtree

(one on each branch)

• The main objection:– To have to update end versions for every new

version for which the record does not change

• The solution: – To take apart end versions from the version range

Consider a record R inserted at v1. Suppose that R remains unchanged at v2 and v3.

Version Ranges

v1

v3

v2

v4Assume now that R is updated at version v4. We could say v3 is an end version for R

Assume that a new version v5 appears but R is not touched by v5

v5

Version Ranges

v1

v3

v2

v6

v4Now version v3 is no longer an end version for R. R remains unchanged at {v1, v2, v3, v5 }

v5

We choose the end version for R to be a “stop sign” along a branch. The end version v4, does not belong to the version range for R

Later, any number of descendent in VR could be created. If these versions do not change R, the VR expands automatically

Version Ranges

v1

v3

v2

v4

v5

v6

Later, any number of descendent in VR could be created. If this versions do not change R, the VR expands automatically

Version Ranges

• The version range vr = (start(vr), end(vr)), where:

– start(vr) is an individual version– end(vr) is the minimal set of versions ev

with the property:

v vr iif 1. start(vr) v

2. ev end(vr) (ev v)

<{v1, v3},k1,d1> <v2,k1,d’1><{v1, v2},k2,d2> <v3,k2,d’2><{v1, v2, v3},k3,d3>

The three-version example revisited

<(v1, {v2}),k1,d1> <(v2, { }),k1,d’1><(v1, {v3}),k2,d2> <(v3, { }),k2,d’2><(v1, { }),k3,d3>

Data pages

• Data pages (P) delimit one version range (vr) and one key range (kr)

• We define KVR(P) = (kr(P),vr(P))– A data page with KVR(P) = (kr,vr) stores

all records <vr’,k,d> such that:

1. k kr and

2. vr vr’

Compact record representation

• To store records in data pages we use the compact record representation

<(v1, {v2}),k1,d1> <(v2, { }),k1,d’1><(v1, {v3}),k2,d2> <(v3, { }),k2,d’2><(v1, { }),k3,d3>

<v1,k1,d1> <v1,k2,d2> <v1,k3,d3><v2,k1,d’1><v3,k2,d’2><v4,k2,null>

Deletion events do not cause lose of content, they are stated by means of compact null records

Looking for the efficiency

• To make the stabbing query efficient, a substantial percentage of the records in an accessed page must be alive for a version v

• The splitting page policy– When a page P gets full, a version splitting of P

must be done (here current version vn is used)

– A new page P’ is allocated with VR(P’) = (vn,)– Records from P can be moved or copied to P’

Page splitting policy

• Records created by vn which are not null are moved from P to P’

• Records whose version range lie in VR(P) VR(P’) which are not null are copied to P’

Page splitting policy

• Some kind of key splits are allowed in our framework (similar to B-tree page splits)– After a version split if the new page has more than

a certain threshold value Tk (we call version-and-key split)

– When a full page has version range (current_version, ) (we call restricted-key split)

• Pure key splits cannot guarantee a minimun number of records alive for a given version

Consolidation• Delete operations may damage the stabbing

query efficiency• When the number of records alive in P at vn

fall below a threshold Tc, a consolidation process is triggered

• A sparse page and a proper sibling are current-version split, and the results are combined in one page

• Transactions with a large number of deletions may generate ghost pages

Efficiency guarantee

• We start with a page D at version v1 having n alive records

• Our framework guarantees a minimum number of records in a data page D in answering a stabbing query (v VR(D)) under different scenarios

Efficiency guaranteeAssertions

1. No deletes and only version splits:

at least n

2. No deletes and only current-version or version-and-key or restricted-key splits:

at least min(n,Tk/2)

3. Any kind of transactions and version splits, version-and-key splits, restricted-key splits and node consolidation:

at least min(Tc,n)

Index pages• Index pages + data pages form a DAG• Index pages also correspond to key-version

ranges• Index page entries contain for every child C:

<VR(C), KR(C), Disk_address(C)>• Index page splits and consolidations follow the

same policy as for data pages• Additional details about properties and

treatment of index pages can be seen in the paper

Conclusions

• Version data are not trivial to deal with

• Our framework – contributes to understand the implications

of managing and retrieving version data– gives clear cues to represent in a compact

and robust way this kind of data– supports realistic assumptions on

transactions

A Framework for Access Methods for

Versioned Data

B. Salzberg, L. Jiang, D. Lomet, M. Barrena, J. Shan & E. Kanoulas

Documents

A Framework for Access Methods for Versioned Data