Upload
scott-holcomb
View
49
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Evaluating Window Joins over Punctuated Streams. Many slides taken from talk by Luping Ding and Elke A. Rundensteiner, CIKM04 Database Systems Research Group Worcester Polytechnic Institute. Stream Data Processing. Online Transaction Management. Sensor Network Monitoring. - PowerPoint PPT Presentation
Citation preview
23/4/19 CIKM'04 1
Evaluating Window Joins over Punctuated Streams
Many slides taken from talk byLuping Ding and Elke A. Rundensteiner, CIKM04
Database Systems Research GroupWorcester Polytechnic Institute
23/4/19 CIKM'04 2
Stream Data Processing
RegisterContinuous Queries
Stream QueryEngine
Stream QueryEngine
Streaming Data Streaming Result
• Network Usage Analysis
• Online Transaction Management • Sensor Network Monitoring
• Online Auction
23/4/19 CIKM'04 3
New Challenges in Stream Context Potentially infinite data streams vs. stateful ope
rators. e.g., join, distinct, …
Problem: potentially unbounded state Reason: no hint on which data is no longer use
ful
23/4/19 CIKM'04 4
Example -Symmetric Hash Join [WA93]
Memory overflow resolution – state relocation Example: XJoin [UF00],
Hash-Merge Join [MLA04] Problems
Join state still grows with no bound
Delivery of some join results may be highly deferred
A B
insert probe
MemoryMemoryOverflowSA SB
23/4/19 CIKM'04 5
Avoiding Unbounded State
Solution: exploit constraints to detect no-longer-useful data
Sliding window [MWA+03] Identify a bounded set of input data based on time
K-constraint [BW03] Models clustered or ordered data arrival pattern
Punctuation [TMSF03] Dynamically announce termination of certain value
23/4/19 CIKM'04 6
Sliding Window [KNV03]
Wb
Timeline
Wa
Stream A Stream B
… …
23/4/19 CIKM'04 7
Punctuation
Meta-knowledge embedded inside data streams An ordered set of patterns corresponding to attributes of tuples Wildcard (*), constant (9), list ({1,2,3}), range ([1, 20]), empty ()
Semantics: tuples after a punctuation p will NOT match p
No more tuplewill containItem_id 180.
180 Marlie 820.00 Nov-13-03 11:02:00
182 Ultrasale 1000.00 Nov-13-03 11:05:00
180 Jocelyn 850.00 Nov-13-03 11:14:00
180 * * *
…
181 pcfan 50.00 Nov-13-03 11:36:00
…
Bid
23/4/19 CIKM'04 8
Punctuation-Aware Join [DMR+04]
Joinitem_id
Stream BStream A
181 50.00
175 20.00
180 135.00
175 *
158 310.00
… …
2 63.00
175 80.00
1 200.00
A C
… … … …
No more tuple will have A = 175.
175 100.00
… …
A B
175 80.00
175 100.00
175 20.00
SA SB
23/4/19 CIKM'04 9
Features of Punctuation
Purge rule. For any tuple ta from stream A, if there exists a punctuation Pb that has already been received from stream B such that match (ta, ,,Pb), ta will not be joining with any future arriving tuples from stream B. ta doesn’t need to be maintained in the A state after being processed.
Propagation rule. The join operator can also propagate punctuations to the output stream in order to help do
wnstream operators.
23/4/19 CIKM'04 10
Based on punctuation semantics, we derive the following theorem as the foundation of our punctuation propagation algorithm.
Theorem 3.1. Let pa and pb be punctuations retrieved from streams A and B at time TSa and TSb respectively specifying the same punctuated value val of join attribute att. Then no output tuples with val being the value of attribut
e att will be generated after time max(TSa, TSb).
23/4/19 CIKM'04 11
Sliding Window Join
Suppose Ta and Tb are time windows for streams A and B respectively. We define the invalidation rule from the join state based on the sliding window:
Let tuple ta be the latest tuple with timestamp TS
a from stream A that has been processed.The tuple in the B state with timestamp TSb such that TSb + Tb < TSa is called a time-expired tuple and can be invalidated. The same invalidation rule applies to tuples in the A state.
23/4/19 CIKM'04 12
Tb
Ta
Stream A Stream B
……
TSb
TSa
TSb-Ta
TSa-Tb
timeline
Basic Window join
23/4/19 CIKM'04 13
Optimization Opportunities Maintain smaller state than either pure window join or pu
re punctuation-exploiting join Bid tuples that have been joined don’t need to be m
aintained in state (Punctuation)
Drop tuples without affecting precision of result Bid tuples out of 24-hour window of corresponding Au
ction tuple don’t need to be processed Aggregate result for some Auction tuples can be produce
d in less than 24 hours
23/4/19 CIKM'04 14
Features of PWJoin algorithm
Punctuation-exploiting Window Join is composed of three operations:
Probing state to find matching tuples for producing join results.
Purging no-longer-joining tuples by punctuations. Invalidating expired tuples by windows. Among these op
erations.
23/4/19 CIKM'04 15
SELECT A.item_id, Count (*) FROM Auction [Range 24 Hours] A, Bid B WHERE A.item_id = B.item_id GROUP BY A.item_id
Window and Punctuation Occur Simultaneously
Joinitem_id
Auction Stream
Bid Stream Out1
(item_id)
Group-byitem_id (count(*))
Out2
(item_id, count)
Contains punctuations on
item_id
Applies a 24-hour window on Auction
stream
23/4/19 CIKM'04 16
PWJoin Basics and Issue
Issue: how to design PWJoin state to facilitate all search-based operations? Invalidate conducts time-based search Probe and Purge needs value-based search
Receive a new tupleta from stream A
Invalidate tuplesfrom B state
Probe B stateInsert ta
into A state
Receive a new punctpa from stream A
Purge tuplesfrom B state
Insert pa
into A state
23/4/19 CIKM'04 17
PWJoin State with Two-dimensional Index
8
10
8
8
10
4
8
Key Head Tail PunctFlag
8 none
Time List I-Node Index (Hash Table)
WindowBegin
10 punctuated
WindowEnd
I-Node
tuple
NextTimeListTNode
NextValueListTNodeT-Node
Punctuation Timestamp
p1 T1
p2 T2
… …
Punctuation Time List
23/4/19 CIKM'04 18
PWJoin AlgorithmInvalidate: Once a new tuple t is retrieved from stream A, its timestamp is used to invalidate expired tuples from the head of the time list of stream B. Probe: probe I-Node index and join with tuples in value list of matching I-Node. After invalidation is done, the join value of t is used to probe the I-Node index of the B state. If the matching I-Node iNode is found, the corresponding value list is located by following the Head pointer of iNode. Tuple t then joins with all tuples in this value list by following the NextValueListTNode pointer of each T-Node. Finally, the PunctFlag of iNode is checked. If it is “punctuated”, t is discarded. If it is “none”, t is inserted into the A state.
23/4/19 CIKM'04 19
PWJoin AlgorithmPurge: probe I-Node index and delete tuples in value list of matching I-Node. When a new punctuation p is retrieved from stream A, p is used to probe the I-Node index of the B state. If the matching I-Node iNode is found, all tuples in the corresponding value list are deleted. iNode is removed from the I-Node index as well. If the PunctFlag of iNode is “punctuated”, p is discarded. If iNode is not found or iNode’s PunctFlag is “none”, p is used to probe the I-Node index of the A state and set the PunctFlag of the matching I-Node iNodea as “punctuated”.If iNodea does not exist, a new I-Node is created with its PunctFlag marked as true and inserted into the I-Node index of the A state.
23/4/19 CIKM'04 20
Punctuation Propagation [CIKM04] An operator may propagate punctuations to
benefit downstream operators
Joinitem_id
Auction Stream
Bid Stream
Group-byitem_id (count(*))
be unblocked by punctuations propagated by join o
perator
Item_id Bidder_id Bid_price
180 * *propagate punctuations on ite
m_id
23/4/19 CIKM'04 21
Early Punctuation Propagation
Optimizations Enabled by Combined Constraintsa1
a1
a2
a3
a4
a3
a1
a3
a6
a3
a6
a3
a3
a7
a2
a8
a2
a10
Stream S1 Stream S2
a3 propagation point 1
propagation point 2
Tuple Dropping
a1
a1
a2
a3
a4
a3
a1
a3
a6
a3
a6
a3
a3
a7
a2
a8
a2
a10
Stream S1 Stream S2
a3
23/4/19 CIKM'04 22
Achieving Optimizations by Combined Constraints Early propagation
Invalidate punctuations in punctuation time list as invalidating tuples
Expired punctuations can be propagated Tuple dropping
When early propagation happens, set PunctFlag of matching I-Node as “propagated”
Drop new tuples that matches an I-Node whose PunctFlag is “propagated”
23/4/19 CIKM'04 23
Memory Cost Analysis
|Sb|T = |Sb|Tinsert - |Sb|Tpurge = |Sb|Tarrive - |Sb|Tpurge
= bTb - bTb( paT/NKb,T)
b – tuple input rate of stream B
pa – punctuation input rate of stream A
NKb,T - # of distinct join values occurred in stream B up to T’th time unit
Tb – time window on stream B
Window Join Saving by Punctuation
23/4/19 CIKM'04 24
PWJoin vs. WJoin – Memory and Tuple Output Rate
0
500
1000
1500
2000
2500
1 4 7 10 13 16 19 22 25 28 31
Sampl i ng Step (per 2 seconds)
# of
Tup
les
in J
oin
Stat
e
WJ oi n- 1PWJ oi n- 1WJ oi n- 5PWJ oi n- 5WJ oi n- 15PWJ oi n- 15
0
100000
200000
300000
400000
500000
600000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Sampling Step (per 2 seconds)
# o
f T
up
les
Ou
tpu
t
WJoin-5
PWJoin-5
WJoin-15
PWJoin-15
Stream A, B: punct-asc-100-40
23/4/19 CIKM'04 25
PWJoin vs. PJoin – Punctuation Output Rate
0
100
200
300
400
500
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
Sampl i ng Step (per 1 second)
# of Pu
nctua
tions O
utput PJ oi n
PWJ oi n
Stream A: punct-asc-100-40, Stream B: punct-random-30-40Window: 1 second
23/4/19 CIKM'04 26
Conclusion
PWJoin algorithm Designed storage structure for PWJoin state Memory cost analysis of PWJoin
23/4/19 CIKM'04 27
Thanks
WPI Database Research Group
many slides are from davis.wpi.edu/~dsrg/CAPE/slides
23/4/19 CIKM'04 28
References [CIKM04], L. Ding and E.A. Rundensteiner. Evaluating Window Joins over Punctuated Streams. CIK
M04. [KNV03] J. Kang, J. F. Naughton and S. D. Viglas. Evaluating Window Joins over Unbounded Stream
s. ICDE’03. [UF00] T. Urhan and M. Franklin, XJoin: A Reactively Scheduled Pipelined Join Operator. IEEE Data
Engineering Bulletin, 23(2), 2000. [HH99] P. Haas and J. Hellerstein, Ripple Joins for Online Aggregation. SIGMOD’99. [GO03] L. Golab and M. T. Ozsu, Processing Sliding Window Multi-Joins in Continuous Queries over
Data Streams. VLDB’03. [GGO04] L. Golab, S. Garg and M. T. Ozsu, On Indexing Sliding Windows over On-line Data Streams,
EDBT’04. [RDS+04] E. A. Rundensteiner, L. Ding, T. Sutherland, Y. Zhu, B. Pielech and N. Mehta, CAPE: Conti
nuous Query Engine with Heterogeneous-Grained Adaptivity. VLDB Demo, 2004. [BW04] S. Babu and J. Widom. Exploiting k-Constraints to Reduce Memory Overhead in Continuous
Queries over Data Streams [TMS+03] P. A. Tucker, D. Maier, T. Sheard and L. Fegaras. Exploiting Punctuation Semantics in Con
tinuous Data Streams. TKDE, 15(3), 2003. [DMR+04] L. Ding, N. Mehta, E. A. Rundensteiner and G. T. Heineman, Joining Punctuated Streams.
EDBT’04. [MWA+03] R. Motwani, J. Widom, A. Arasu et al. Query Processing, Resource Management, and App
roximation in a Data Stream Management System. CIDR’03.
23/4/19 CIKM'04 29
Thanks!