A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees

Shimin Chen* Phillip B. Gibbons* Suman Nath+

*Intel Labs Pittsburgh +Microsoft Research

PR-Join:

2

Online Aggregation

• Data warehouse and business intelligence– Fast growing multi-billion dollar market

• Interactive ad-hoc queries on big data– Important for detecting new trends– Fast response time hard to achieve

• One promising approach: Online Aggregation (OLA)

– Provides early representative results for aggregate queries (sum, count, avg, etc.)

– For example, “average is 123.4 ± 5.6 with 95% confidence”

• Essential to OLA: non-blocking join algorithm

PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath

[Hellerstein et al. 97]

3

Non-Blocking Join for OLA

• OLA assumption: relations are in random order

Relation A

Relation BMain

memory

Temporary storage

Spill Read back

Estimates based on current results


4

Design Goals of Non-Blocking Joins

• Fast, representative early results

• Good end-to-end performance

Wrong query: stop earlyAccurate enough: stop earlySlow convergence: wait longer• High variance, high selectivity, high group counts,

data skews …

Need the full, accurate result: finish query

User may find

Design Goals


5

Two Metrics in Algorithm Analysis

• Good end-to-end performance:

• Fast early results:

Result Rate =Newly covered area x selectivity

I/Os for covering the new area

new

new

records from B

reco

rds fro

m A

Join: check all pairs of records from A and B

Early : before completely reading A and BPR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath

Total I/Os

6

Design Space

HighLow

Hig

hLo

w

Total I/O Cost

Early

Rep

rese

ntati

ve

Resu

lt Ra

te

Hash Ripple [Luo, et al’02]

SMS [Jermaine, et al’05]

GRACE [Kitsuregawa, et al ’83]

Ripple

PR-Join targets

Ideal

DBO [Jermaine, et al’07]

[Haas & Hellerstein’99]


7

Performance Result Preview


Near-optimal total I/O cost

Higher early result rate

8

Outline

• Introduction

• PR-Join (Partitioned expanding Ripple Join) Algorithm

• Evaluation

• Conclusion


Background: Ripple Join

records from B

records from A

spilled new

sp

illed

new

For each ripple:

• Read new records from A and B; check for matches

• Read spilled records; check for matches with new records

• Spill new to disk

9

[Haas & Hellerstein’99]


10

Observations of Ripple Join

• Total I/Os: O(N2)– N = total # of input pages in A and B– I/Os of ripples form an arithmetic series

• Result rate of a ripple is higher if wider ripple– Increase ripple width

• But ripple width limited by the memory size


Result Rate =Newly covered area x selectivity

I/Os for covering the new area

Super linear growth

Grows linearly

11

PR-Join Idea 1: Multiplicatively Expanding Ripples

• Total I/Os: O(N) linear– I/Os of ripples form a geometric series

• Higher result rate:– Wider ripple leads to higher result rate

But must overcome memory size limitation


12

PR-Join Idea 2: Hash Partitioning

• Each partition < memory

• Every join invocation performs a ripple on a partition– Estimation is updated after every join invocation– Much faster user responses Statistically sound

empty

empty

Partitioned on Join key


13

Statistical Guarantees

• Idea: hash partitioning disjoint sub-spaces– Stratified sampling in statistics

• Statistical estimate:1) Ripple join formula for every partition2) Stratified sampling formula to combine estimates

from partitioned ripples

empty

empty

Partitioned on Join key


14

Comparing Analytical Performance

Early Result RateSymmetric Hash 1 (when data fit in memory)Hash Ripple 0.5SMS 0.6Two-Way DBO 1.2Ripple 1, 1.25, 1.40, 1.50, …, 2PR-Join 1, 1.7, 3.2, 6.2, 12.2, … …

(Parameter setting details in paper)PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath

15

Outline

• Introduction

• PR-Join Algorithm

• Evaluation

• Conclusion


16

Non-Blocking Join for OLA

Relation A

Relation BMain

memory

Temporary storage

Spill Read back

Estimates based on current results


Hard disk or SSD

Hard disks

17

Disk as Temp Storage

• 10GB joins 10GB

• 500MB memory

PR-Join achieves much better end-to-end performance than Ripple Join


18

Marginal Result Rate

PR-Join achieves an order of magnitude higher result rate than Ripple Join


Disk as temp

storage

19

SSD as Temp Storage

Using SSD, PR-Join achieves near optimal I/O costs

• 10GB joins 10GB

• 500MB memory

Temp I/Os are almost completely overlapped with I/Os to read input


20

More Details in Paper

• Joining finite data streams:– PR-Join can be easily used for joining finite data streams– Compared with state-of-the-art algorithm (RPJ [Tao et al.’05])– PR-Join achieves better performance

• Analysis of non-blocking join algorithms for OLA

• PR-Join parameter choices

• Handling skews

•More experimental results

(see us at the plenary session)


21

Conclusions

• In this paper, we propose a new non-blocking join algorithm: PR-Join (Partitioned expanding Ripple Join)

• PR-Join for Online Aggregation:– Provides statistical guarantee– An order of magnitude higher result rate than prior approach– Near optimal total I/O cost

• PR-Join for finite data streams:– Better performance than state-of-the-art algorithm


22

Thank you!

[email protected]


Documents

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh