Upload
leonard-mcbride
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
BRAID: Stream Mining through Group Lag Correlations
Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos
SIGMOD 2005
Outline
Introduction Proposed method EXPERIMENTS CONCLUSIONS
Introduction
Data Stream Lag correlations :
For example: Higher amounts of fluoride in water →
fewer dental cavities some years later
Goal : Monitor multiple numerical streams
determine the pair correlated with lag and the value
Introduction
k numerical sequences X1,…Xk , report all pair of Xi and Xj which Xi follow Xj with lag l
Introduction
Introduction
In this paper, propose BRAID handle data stream Any time processing, and fast Nimble Accurate Small resource consumption
Proposed method
Data stream X : {x1, …, xt, ..., xn} , xn is the most recent value
R(0) : X and Y with the same length n and have zero lag
Pearson ρ Coefficient :
Proposed method
For lag l ,consider common part of X and shifted Y
Proposed method
Proposed method
R(l) : correlation coefficient, X is delayed by l
Score at lag l :
Proposed method
R(l) for large value of lag l ≈ n, the original and shifted time sequence have too few overlapping Restrict maximum lag m to be n/2
Proposed method
Naive solution : At time n, access all value of X and Y,
compute R(l) of all value lag l(=0,1,…) Choose earliest max score above r , or
report no lag The solution based on three major
step
Proposed method
Need some sufficient statistics for R to computed easily Sx(l,n) = : sum of X of length n
Sxx(l,n) = : sum of square X of length n
Sxy(l) = : sum of square X of length n
n
tx1t
2
1
n
ttx
n
lt
ttyx1
1
Proposed method
R(l) is obtained :
Proposed method
R(l) can estimate at any point time, only need to keep track five sufficient statistics
It still needs linear time to compute the cross-correlation function between two sequences
Proposed method
Propose to keep track of only a geometric progression of the lag value : l= 0,1,2,..2i,.
Only O(logn) number to track of, instead of O(n) that “Naïve solution” requires
Space required grow linearly with length n
Proposed method
In order to compute R(l) at any time, keep sliding window of size l, m=n/2 need O(n) space
Instead of operating on original time sequence, we also compute their smoothed version, by computing the means of non-overlapping windows
Proposed method
Window size : power of g=2 X : original time sequence Axh : smoothed version with window of
length 2h
Ax0 : original sequence, Ax1 : consists of n/2 ticks ,..etc
Axh ‘s sufficient statistic need compute every 2h time ticks
At time n, need O(log n) level, for each level compute sufficient statistic
Proposed method
In contrast with small lags, the larger one are sparse Use cubic spline to interpolate the
missing correlation coefficient
Proposed method
Axh(t) : window average at time tick t for level h
Axh(0) ≡ xt
Proposed method
Sufficient statistics:
EXPERIMENTS
EXPERIMENTS
EXPERIMENTS
Conclusion
Proposed BRAID to detection lag correlation on streaming data At any time Low resource consumption High accuracy