Upload
lilian-hammond
View
224
Download
4
Embed Size (px)
Citation preview
A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees
Shimin Chen* Phillip B. Gibbons* Suman Nath+
*Intel Labs Pittsburgh +Microsoft Research
PR-Join:
2
Online Aggregation
• Data warehouse and business intelligence– Fast growing multi-billion dollar market
• Interactive ad-hoc queries on big data– Important for detecting new trends– Fast response time hard to achieve
• One promising approach: Online Aggregation (OLA)
– Provides early representative results for aggregate queries (sum, count, avg, etc.)
– For example, “average is 123.4 ± 5.6 with 95% confidence”
• Essential to OLA: non-blocking join algorithm
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
[Hellerstein et al. 97]
3
Non-Blocking Join for OLA
• OLA assumption: relations are in random order
Relation A
Relation BMain
memory
Temporary storage
Spill Read back
Estimates based on current results
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
4
Design Goals of Non-Blocking Joins
• Fast, representative early results
• Good end-to-end performance
Wrong query: stop earlyAccurate enough: stop earlySlow convergence: wait longer• High variance, high selectivity, high group counts,
data skews …
Need the full, accurate result: finish query
User may find
Design Goals
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
5
Two Metrics in Algorithm Analysis
• Good end-to-end performance:
• Fast early results:
Result Rate =Newly covered area x selectivity
I/Os for covering the new area
new
new
records from B
reco
rds fro
m A
Join: check all pairs of records from A and B
Early : before completely reading A and BPR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
Total I/Os
6
Design Space
HighLow
Hig
hLo
w
Total I/O Cost
Early
Rep
rese
ntati
ve
Resu
lt Ra
te
Hash Ripple [Luo, et al’02]
SMS [Jermaine, et al’05]
GRACE [Kitsuregawa, et al ’83]
Ripple
PR-Join targets
Ideal
DBO [Jermaine, et al’07]
[Haas & Hellerstein’99]
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
7
Performance Result Preview
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
Near-optimal total I/O cost
Higher early result rate
8
Outline
• Introduction
• PR-Join (Partitioned expanding Ripple Join) Algorithm
• Evaluation
• Conclusion
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
Background: Ripple Join
records from B
records from A
spilled new
sp
illed
new
For each ripple:
• Read new records from A and B; check for matches
• Read spilled records; check for matches with new records
• Spill new to disk
9
[Haas & Hellerstein’99]
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
10
Observations of Ripple Join
• Total I/Os: O(N2)– N = total # of input pages in A and B– I/Os of ripples form an arithmetic series
• Result rate of a ripple is higher if wider ripple– Increase ripple width
• But ripple width limited by the memory size
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
Result Rate =Newly covered area x selectivity
I/Os for covering the new area
Super linear growth
Grows linearly
11
PR-Join Idea 1: Multiplicatively Expanding Ripples
• Total I/Os: O(N) linear– I/Os of ripples form a geometric series
• Higher result rate:– Wider ripple leads to higher result rate
But must overcome memory size limitation
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
12
PR-Join Idea 2: Hash Partitioning
• Each partition < memory
• Every join invocation performs a ripple on a partition– Estimation is updated after every join invocation– Much faster user responses Statistically sound
empty
empty
Partitioned on Join key
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
13
Statistical Guarantees
• Idea: hash partitioning disjoint sub-spaces– Stratified sampling in statistics
• Statistical estimate:1) Ripple join formula for every partition2) Stratified sampling formula to combine estimates
from partitioned ripples
empty
empty
Partitioned on Join key
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
14
Comparing Analytical Performance
Early Result RateSymmetric Hash 1 (when data fit in memory)Hash Ripple 0.5SMS 0.6Two-Way DBO 1.2Ripple 1, 1.25, 1.40, 1.50, …, 2PR-Join 1, 1.7, 3.2, 6.2, 12.2, … …
(Parameter setting details in paper)PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
15
Outline
• Introduction
• PR-Join Algorithm
• Evaluation
• Conclusion
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
16
Non-Blocking Join for OLA
Relation A
Relation BMain
memory
Temporary storage
Spill Read back
Estimates based on current results
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
Hard disk or SSD
Hard disks
17
Disk as Temp Storage
• 10GB joins 10GB
• 500MB memory
PR-Join achieves much better end-to-end performance than Ripple Join
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
18
Marginal Result Rate
PR-Join achieves an order of magnitude higher result rate than Ripple Join
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
Disk as temp
storage
19
SSD as Temp Storage
Using SSD, PR-Join achieves near optimal I/O costs
• 10GB joins 10GB
• 500MB memory
Temp I/Os are almost completely overlapped with I/Os to read input
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
20
More Details in Paper
• Joining finite data streams:– PR-Join can be easily used for joining finite data streams– Compared with state-of-the-art algorithm (RPJ [Tao et al.’05])– PR-Join achieves better performance
• Analysis of non-blocking join algorithms for OLA
• PR-Join parameter choices
• Handling skews
•More experimental results
(see us at the plenary session)
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
21
Conclusions
• In this paper, we propose a new non-blocking join algorithm: PR-Join (Partitioned expanding Ripple Join)
• PR-Join for Online Aggregation:– Provides statistical guarantee– An order of magnitude higher result rate than prior approach– Near optimal total I/O cost
• PR-Join for finite data streams:– Better performance than state-of-the-art algorithm
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath
22
Thank you!
PR-Join: A Non-Blocking Join Achieving Higher Early Result RateShimin Chen, Phillip B. Gibbons, Suman Nath