ORDB Implementation Discussion. From RDB to ORDB Issues to address when adding OO extensions to DBMS system

ORDB ImplementationDiscussion

From RDB to ORDB

Issues to address whenadding OO extensions to DBMS system

Layout of DataDeal with large data types : ADTs/blobs– special-purpose file space for such data, with special access

methodsLarge fields in one tuple :– One single tuple may not even fit on one disk page– Must break into sub-tuples and link via disk pointers

Flexible layout : – constructed types may have flexible sized sets, , e.g., one

attribute can be a set of strings.– Need to provide meta-data inside each type concerning layout of

fields within the tuple– Insertion/deletion will cause problems when contiguous layout of

‘tuples’ is assumed

Layout of Data

More layout design choices (clustering on disk):

– Lay out complex object nested and clustered on disk (if nested and not pointer based)

– Where to store objects that are referenced (shared) by possibly several other and different structures

– Many design options for objects that are in a type hierarchy with inheritance

– Constructed types such as arrays require novel methods, like array chunking into (4x4) subarrays for non-continuous access

Objects/OIDs

OID generation : uniqueness across time and systemObject reference handling : – must avoid dangling references– semantics for object manipulation for shared objects

ADTs

– Type representation: size/storage– Type access : import/export– Type manipulation: special methods to serve as

filter predicates and join predicates– Special-purpose index structures : efficiency

ADTs

Mechanism to add index support along with ADT:– External storage of index file outside DBMS– Provide “access method interface” a la:

• Open(), close(), search(x), retrieve-next()• Plus, statistics on external index

– Or, generic ‘template’ index structure • Generalized Search Tree (GiST) – user-extensible• Concurrency/recovery provided

Query Processing

Query Parsing :– Type checking for methods– Subtyping/Overriding

Query Rewriting:– May translate path expressions into join operators– Deal with collection hierarchies (UNION?)– Indices or extraction out of collection hierarchy

Query Optimization Core

– New algebra operators must be designed :• such as nest, unnest, array-ops, values/objects, etc.

– Query optimizer must integrate them into optimization process :

• New Rewrite rules• New Costing• New Heuristics

Query Optimization Revisited

– Existing algebra operators revisited : SELECT– Where clause expressions can be expensive– So SELECT pushdown may be bad heuristic

Selection Condition RewritingEXAMPLE:(tuple.attribute < 50) – Only CPU time (on the fly)

(tuple.location OVERLAPS lake-object)– Possibly complex CPU-heavy computations – May Involve both IO and CPU costs

State-of-art: – consider reduction factor only

Now, we must consider both factors:– Cost factor : dramatic variations – Reduction factor: unrelated to cost factor

Operator Ordering

op1

op2

Ordering of SELECT Operators

– Cost factor : dramatic variations – Reduction factor: orthogonal to cost factor– We want: maximal reduction and minimal cost– Rank ( operator ) = (reduction) * ( 1/cost ) – Order operators by increasing ‘rank’

– High rank (good) -> low in cost, and large reduction– Low rank (bad) -> high in cost, and small reduction

Access Methods ( on what ?)

Indexes that are ADT specificIndexes on navigation pathIndexes on methods, not just on columnsIndexes over collection hierarchies (trade-offs)Indexes for new WHERE clause expressions not just =, <, > ; but also “overlaps”

Registering New Index (to Optimizer)What WHERE conditions it supportsEstimated cost for “matching tuple”– Given by index designer (user?)– Monitor statistics; even construct test plans

Estimation of reduction factors/join factors:Register auxiliary function to estimate factorProvide simple defaultsEstimation of method costs (~IO/CPU)

MethodsDynamic linking of methods (outside DB)Overwriting methods for type hierarchyUse of “methods” with implied semanticsIncorporation of methods into query process : termination? “untrusted” methods : methods corrupt server or modify DB content (side effects)Handling of “untrusted” methods :– restrict language; interpret vs compile, separate address space

as DB server

Query Optimization with MethodsEstimation of “costs” of method predicatesOptimization of Method execution:– Similar idea as handling correlated nested subqueries; must

recognize repetition and rewrite physical plan.– Provide some level of precomputation and reuse

Optimization of Method execution:– 1. If called on same input, cache that one result– 2. If on full column, presort column first (groupby)– 3. Or, precompute results of methods for each possible value in

domain; and put in hash-table : fct (val ); Look up in hash-table during query processing or even join with it,

instead of recomputing : val fct (val)

Query ProcessingUser-defined aggregate functions:– E.g., “second largest” or “second yellowest”

Distributive aggregates: incremental computation Provide:– Initialize(): set up state space– Iterate(): per tuple update the state– Terminate(): compute final result based on state; and cleanup state

For example : “second largest” – Initialize(): 2 fields– Iterate(): per tuple compare numbers– Terminate(): remove 2 fields

Following Disk Pointers?

Complex object structures with object pointers may exist (~ disk pointers)Navigate complex object into memory for a long-running transaction like in CAD designWhat to do about “pointers” between subobjects or related objects ?– Swizzle = replace OIDs dereferences by in-memory pointers, and

unswizzle back at end.– Issues : In-memory table of OIDs and their state; indicate in each

object pointer via a bit.– Different policies for swizzling: on access, attached to object brought

in, etc.

Models of PersistenceDifferent models of persistence for OODB implementations:Parallel type systems: – E.g., int and dbint– User must make decision at object creation time– Allow for user control by “casting” types

Persistence by container management:– Objects must be placed into “persistent containers” such as relations in order

to stay around– Eg., Insert o into Collection MyBooks;– Could be rather dynamic control without casting

Persistence by reachability :– Use global variable names to objects and structures– Objects being referenced by other objects that are reachable by application,

they by transitivity are also persistent.– need garbage collection

Summary

A lot of work to get there: From physical database design/layout issues up

to logical query optimizer extensions

ORDB: reuses existing implementation base and incrementally adds new features on (but relation is first-class citizen)

Documents

ORDB Implementation Discussion. From RDB to ORDB Issues to address when adding OO extensions to DBMS system