View
226
Download
0
Embed Size (px)
Citation preview
1
Continuously Adaptive Continuous Queries (CACQ)
over Streams
Samuel Madden, Mehul Shah, Joseph Hellerstein, and Vijayshankar Raman
Presented by: Bhuvan Urgaonkar
2
CACQ Introduction
Proposed continuous query (CQ) systems are based on static plans– But, CQs are long running– Initially valid assumptions less so over time– Static optimizers at their worst!
CACQ insight: apply continuous adaptivity of eddies to continuous queries– Dynamic operator ordering avoids static
optimizer danger– Process multiple queries simultaneously– Interestingly, enables sharing of work & storage
3
Outline
Background– Motivation– Continuous Queries– Eddies
CACQ– Contributions- Example driven explanation
Results & Experiments
4
Outline
Background– Motivation– Continuous Queries– Eddies
CACQ– Contributions- Example driven explanation
Results & Experiments
5
Motivating Applications
“Monitoring” queries look for recent events in data streams– Sensor data processing– Stock analysis– Router, web, or phone events
In CACQ, we confine our view to queries over ‘recent-history’
Only tuples currently entering the system Stored in in-memory data tables for time-windowed joins
between streams
6
Continuous Queries
Long running, “standing queries”, similar to trigger systems
Installed; continuously produce results until removed
Lots of queries, over the same data sources– Opportunity for work sharing!– Idea: adaptive heuristics
7
Eddies & Adaptivity
Eddies (Avnur & Hellerstein, SIGMOD 2000): Continuous Adaptivity
No static ordering of operators
Policy dynamically orders operators on a per tuple basis
done and ready bits encode where tuple has been, where it can go
8
Outline
Background– Motivation– Continuous Queries– Eddies
CACQ– Contributions- Example driven explanation
Results & Experiments
9
CACQ Contributions
Adaptivity– Policies for continuous queries– Single eddy for multiple queries
Tuple Lineage– In addition to ready and done, encode output
history in tuple in queriesCompleted bits Enables flexible sharing of operators between queries
Grouped Filter– Efficiently compute selections over multiple
queries Join Sharing through State Modules (SteMs)
10
Explication By Example
First, example with just one query and only selections
Then, add multiple queries
Then, (briefly) discuss joins
11
Eddies & CACQ : Single Query, Single Source
• Use ready bits to track what to do next• All 1’s in single
source
• Use done bits to track what has been done• Tuple can be output
when all bits set
• Routing policy dynamically orders tuples R
(R.a > 10)
Eddy
(R.b < 15) R1
R1
R1
R1
a 5
b 25
R2
a 15
b 01 1 0 01 1 0 11 1 0 01 1 1 01 1 11Ready
Done
ab ab R
(R.a > 10)
Eddy
(R.b < 15)
R2
R2R2R2 R2
R2
SELECT *
FROM R
WHERE R.a > 10 AND R.b < 15
12
a
b
R
a
b
R
a
b
R
Q1 Q2 Q3
Multiple Queries
R.a > 10
R.a > 20
R.a = 0
R.b < 15
R.b = 25
R.b <> 50b
a
R
R1
R1
R1
R1
R1
Grouped Filters
R1
a 5
b 25
SELECT *FROM RWHERE R.a > 10 AND R.b < 15
Q1
SELECT *FROM RWHERE R.a > 20AND R.b = 25
Q2
SELECT *
FROM R
WHERE R.a = 0
AND R.b <> 50
Q3
R1R1
R1 R1R1
R1R1
R1
0 0 0 0 00 0 1 0 00 1 1 0 00 1 1 1 11 1 1 1 1a b Q1 Q2Q3
Done
QueriesCompleted
13
a
b
R
a
b
R
a
b
R
Q1 Q2 Q3
Multiple Queries
R.a > 10
R.a > 20
R.a = 0
R.b < 15
R.b = 25
R.b <> 50b
a
R
R2
R2
R2
R2
R2
R2
Grouped Filters
R2
a 15
b 0
SELECT *FROM RWHERE R.a > 10 AND R.b < 15
Q1
SELECT *FROM RWHERE R.a > 20AND R.b = 25
Q2
SELECT *
FROM R
WHERE R.a = 0
AND R.b <> 50
Q3
0 0 0 0 0
b
a
R
b
a
R
b
a
R
Q1 Q2 Q3
R2 R2 R2
0 0 0 1 11 0 0 1 11 1 0 1 11 1 1 1 1a b Q1 Q2Q3
Done
QueriesCompleted
R1 R1
R2
Reorder Operators!
14
Outputting Tuples
Store a completionMask bitmap for each query
– One bit per operator– Set if the operator in the
query To determine if a tuple t can
be output to query q:– Eddy ANDs q’s
completionMask with t’s done bits
– Output only if q’s bit not set in t’s queriesCompleted bits
Every time a tuple returns from an operator
Q2: 0111
Q1: 1100 & Done == 1100
& Done == 0111
completionMasks&& QueriesCompleted[0] == 0
SELECT * FROM R WHERE R.a > 10 AND R.b < 15
Query 1
SELECT * FROM R WHERE R.b < 15 AND R.c <> 5 AND R.d = 10
Query 2
completionMasks
a b c d
Q1 1 1 0 0
Q2 0 1 1 1
Done
QC
a Q1 Q2b c d
Tuple
1 1 0 0 0 01 1 0 0 1 01 1 0 0 1 0
15
Grouped Filter Use binary trees to efficiently index range predicates
– Two trees (LT & GT) per attribute– Insert constant
When tuple arrives– Scan everything to right (for GT) or left (for LT) of the tuple-attribute in the tree– Those are the queries that the tuple does not pass
Hash tables to index equality, inequality predicates
Greater-than tree over S.a
S.a > 1
S.a > 7
S.a > 11
8S.a
>11
7
1 11
>7>1 Q1 Q2 Q3
16
Work Sharing via Tuple Lineage
A
B
C D
B
A
Data Stream S
s
sc
sBC
sD
sBD
Query 1
Query 2
Conventional Queries
s
s
sC
sCD
sCDB
CACQ - Adaptivity
A
CD
B
Data Stream S
A
C D
B
Data Stream S
Query 1
Query 2
Shared Subexpressions
sB
sAB sABReject?
sCDBA
s
Q1: SELECT * FROM s WHERE A, B, C Q2: SELECT * FROM s WHERE A, B, D
Intersection of
CD goes through AB an extra time!
AB must be
applied first!
Lineage (Queries
Completed) Enables
Any Ordering!
0 | 0QC
0 or 1 | 0QC
1 | 1QC0 or 1 | 0 or 1
QC
0 or 1 | 0 or 1QC
C D
0 or 1 | 0 or 1QC
17
Tradeoff: Overhead vs. Shared Work
Overhead in additional bits per tuple– Experiments studying performance, size in
paper– Bit / query / tuple is most significant
Trading accounting overhead for work sharing– 100 bits / tuple allows a tuple to be processed
once, not 100 times Reduce overhead by not keeping state
about operators tuple will never pass through
18
Joins in CACQ
Use symmetric hash join to avoid blocking
Use State Modules (SteMs) to share storage between joins with a common base relation
Detail about effect on implementation & benefit in paper
See Raman, UC Berkeley Ph.D. Thesis, 2002.
19
Routing Policies
Previous system provides correctness; policy responsible for performance
Consult the policy to determine where to route every tuple that:– Enters the system– Returns from an operator
Basic Ticket Policy– Give operators tickets for consuming tuples, take away tickets for
producing them– To choose the next operator to route, run a lottery– More selective operators scheduled earlier
Modification for CACQ– Give more tickets to operators shared by multiple queries (e.g.
grouped filters)– When a shared operator outputs a tuple, charge it multiple tickets– Intuition: cardinality reducing shared operators reduce global work
more than unshared operators Not optimizing for the throughput of a single query!
20
Outline
Background– Motivation– Continuous Queries– Eddies
CACQ– Contributions- Example driven explanation
Results & Experiments
21
Evaluation
Real Java implementation on top of Telegraph QP– 4,000 new lines of code in 75,000 line codebase
Server Platform– Linux 2.4.10– Pentium III 733, 756 MB RAM
Queries posed from separate workstation– Output suppressed
Lots of experiments in paper, just a few here
22
Results: Routing Policy
From S select index where a > 90
From S select index where a > 90 and b > 70
From S select index where a > 90 and b > 70 and c > 50
From S select index where a > 90 and b > 70 and c > 50 and d > 30
From S select index where a > 90 and b > 70 and c > 50 and d > 30 and e > 10
Routing Scheme vs. Filters Per Tuple
0
1
2
3
4
5
6
Routing Policy
Avg
. F
ilte
rs P
er
Tu
ple
Avg. Filters 1.15 5 3.1 1.37
Optimal Worst Case Random Tickets
All attributes uniformly distributed over [0,100]
Query12345
23
CACQ vs. NiagaraCQ*
Performance Competitive with Workload from NCQ Paper
Different workload where CACQ outperforms NCQ:
Stocks.sym = Articles.sym
UDF(stocks.sym)
Articles
Stocks
Query 3
Stocks.sym = Articles.sym
UDF(stocks.sym)
Articles
Stocks
Query 2
Stocks.sym = Articles.sym
UDF(stocks) Articles
Stocks
Query 1|result| > |stocks|
Expensive
SELECT stocks.sym, articles.text
FROM stocks,articles
WHERE stocks.sym = articles.sym AND
UDF(stocks)
n
0
Q
q stocks(stocks)UDF
*See Chen et al., SIGMOD 2000, ICDE 2002
24
CACQ vs. NiagaraCQ #2
UDF1
Query 1
UDF2
Query 2
UDF3
Query 3
Niagara Option #2
S1
S1A S2A
S2 S3
S3AUDF3
UDF2
UDF1
CACQ
S1
S1|2
S1|2|3
Stocks.sym = Articles.sym
ArticlesStocks
|result| > |stocks|
Expensive UDF1
Query 1Query 2Query 3
Niagara Option #1
UDF2 UDF3
S A
SASASA
No shared subexpressions,
so no shared work!
Lineage Allows Join To
Be Applied
Just Once
25
CACQ vs. NiagaraCQ Graph
NiagaraCQ (Option 1) vs. CACQ(log scale), 100 Queries, 2250 Stocks
1
10
100
1000
10000
1 10 100|Result| / |Stocks|
Ru
nn
ing
Tim
e L
og
(s)
Niagara
CACQ
26
Conclusion
CACQ: sharing and adaptivity for high performance monitoring queries over data streams
Features– Adaptivity
Adapt to changing query workload without costly multi-query reoptimization
– Work sharing via tuple lineage Without constraining the available plans
– Computation sharing via grouped filter– Storage sharing via SteMs
Future Work– More sophisticated routing policies– Batching & query grouping– Better integration with historical results (Chandrasekaran, VLDB
2002)