Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha (1000578539) Deepak Anand (1000603813) By:

Joseph M. Hellerstein

Peter J. Haas

Helen J. Wang

Presented by:

Calvin R Noronha (1000578539)

Deepak Anand (1000603813)

By:

AGENDAMotivationOnline Aggregation

Basic ApproachGoalsBuilding an Online Aggregation system

OptimizationRunning confidence intervalsConclusionFuture work

MotivationAggregation in traditional databases

Long delay in query execution and user is forced to wait without feedback till query completes execution.

Users want to see the aggregation information right away. Aggregation queries are typically used to get a ‘rough picture” but they are

computed with painstaking precision.

This paper suggests the following changes: Perform aggregation online so that:

Progress can be observed. execution of the queries can be controlled on the fly.

An ExampleConsider the following example:

SELECT AVG(final_grade)

FROM grades

WHERE course_name = ‘CS186’

If there is no index on the course_name attribute, then this query scans the entire grades table before returning the result.

AVG--------------| 2.631046 |--------------

An alternative approach

Running aggregate An estimate of the

final result based on the records retrieved

so farRunning confidence interval

2.6336 +/- 0.0652539 with 95% probability

Progress Bar

Online Aggregation Interface with Groups If the records are retrieved in the random order, a good approximate result can be

obtained We can stop sampling once the length of the confidence interval becomes

sufficiently small.

Consider a GROUP BY query with 6 groups in the output

The user is presented with 6 outputs and 6 “Stops-sign” buttons

Stopping condition can be set on the fly

Easy to understand for non-statistical user

Stop Button

Usability goalsContinuous observation: Users can observe the processing in

the GUI and get a sense of the current level of precision.

Control of Time/Precision: Users can terminate processing at any time at a fine granularity(trade-off between time and precision)

Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.

Performance goalsMinimum time to accuracy: Minimize time required to

produce a useful estimate of the final answer.

Minimum time to completion: Minimize time required to produce the final answer.

Pacing: The running aggregates are updated at a regular rate, to guarantee a smooth and continuously improving display.

Building an Online Aggregation System

There are two approaches that can be taken:

1. A Naive approach:

• Trivial implementation without modification to POSTGRES.• User defined functions can be written in C.• Cannot be used with GROUP BY clause.

2. Modifying the DBMS: • Difficult to implement online aggregation as user level addition.• Modifying the database engine to support Online Aggregation.

SELECT running_avg(final_grade) running_confidence(final_grade)

running_interval(final_grade)FROM grades

Estimates of the running aggregates is accurate when records are retrieved randomly.

1. Heap Scans• Simple heap scans can be effective in traditional heap file access methods where

records are stored in unspecified order.• Need to choose different method for the aggregate attributes, which are correlated to

the logical order of formation of heap.

2. Index Scans• Can be used if aggregate attributes are not used for indexing.

3. Sampling from Indices• Techniques for pseudo random sampling from various index structures can be used.

[Olken’s work]

Random Access to Data

Non-blocking/Fair access GROUP BY and DISTINCT

Groups should receive updates in fair manner

Solution: Sorting ??

No, because sorting blocksMust use hash based techniques

Pros: Non-blocking Cons: Does not perform well as the number of groups

grow.Solution: Hybrid hashing.

Optimized version: Hybrid cache

For DISTINCT columns, a similar hashing technique can be used.

Index Striding

Updates for the groups with few members will be very infrequent.

For fair group byRead tuples in round robin fashion (a tuple from group 1, a tuple

from group 2, …)Supported by technique index striding

What is Index Striding ?

Additional advantagesGroup updating rate can be controlledParticular group processing can be stopped

POSTGRES with index striding

Speed control

Non-blocking Join Algorithms

For interactive display of online aggregation, avoid algorithms that block.

Sort-merge joinUnacceptable as sorting is blocking operation

Merge JoinOK but produces sorted output

Hybrid hash joinNot good if inner relation is large

Nested loops join is always good, In case of large un-indexed inner relation its too slow

An optimizer must be used to choose between these strategies.

Optimization

Avoid sorting unless explicitly requested by the user.

Blocking sub-operations have costs and appropriate costs should be considered.

Cost function = f(to) + g(td )

There are 2 components in cost function:dead time (td ): time spent doing “invisible” workoutput time (to ): time spent producing output

Preferences to the plans that maximize user control (index striding)

Extended aggregate functions

Standard set of aggregate functions must be extended

Aggregate functions must be written that provides running estimates

Running computation SUM, COUNT, AVG – straight forwardVAR, STD DEV – can be implemented using algorithms

Aggregate functions returning running confidence must be defined.

API

Current API uses built-in methods e.g., StopGroup(cursor,groupval) speedUpGroup(cursor,groupval)

slowDownGroup(cursor,groupval)

setSkipFactor(cursor_name,integer)

Skip Factor

Statistical IssuesRunning confidence interval

Given an estimate, probability p that we’re within of the right answer Mu

A large value of means that the records seen so far may not be sufficiently representative of the entire database and the current estimate of the result may be far from the final result.

Types of running confidence interval s:Conservative confidence interval

For n (no of tuples retrieved) >= 1 Answer guaranteed to be >= probability p [based on Hoeffding’s inequality]

Large-sample confidence intervalsDeterministic confidence intervals

Running confidence interval can be dynamically adjusted depending on the value of n.

Performance Issues

Conclusion

An interactive, intuitive and user-controllable approach to aggregation is needed.

This can be achieved by significant extensions to the database engine.

These extensions satisfy the usability and performance goals.

Ability to produce statistical confidence intervals for running aggregates.

Future work

Better UI

Nested Queries

Control without Indices

Checkpointing / Continuation

TIME TO ASK QUESTIONS

Documents

Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha (1000578539) Deepak Anand (1000603813) By: