Upload
christopher-herring
View
230
Download
0
Embed Size (px)
Citation preview
7/23/2019 linear regression.pdf
1/4
International Conference on Methods and Models in Computer Science, 2009
A LINEAR Regression-Based Frequent Itemset
Forecast Algorithm for Stream DATA
Sonali Shukla , Sushil Kumar , Bhupendra Verma
3
TIT, Bhopal,
e-mail:}[email protected]
b k _ v e r m a @ r e d i f f m a i l c o m
Abstract-
Data mining deals with extracting or mining
knowledge from large and infinite amount of stream data.
It also handles the data quality with limited volume of disk
or memory. In such traditional transaction environment it
is impossible to perform frequent items mining because it
requires analyzing which item is a frequent one to
continuously incoming stream data and which is probable
to become a frequent item. This paper proposes a way to
predict frequent items using regression model to the
continuously incoming real time stream data. By
establishing the regression model from the stream data, it
may be used for prediction of uncertain items.
After gathering real-time stream data through sliding
window, algorithm FIM-2DS computes support for
appointed sequence and describes linear equation to
forecast sequence trends in the future.
Keywords: Frequent items; stream data; linear
regression; sliding window.
I. INTRODUCTION
S
ream data is generated continuously in a dynamic
environment, with huge volume, infmite flow, and
fast changing behavior.
In recent years, many time-series mining problems
have been explored for a data stream environment.
Because sequence forecast is an import subject
of
those,
some methods were designed to develop sequence trend
analysis in a dynamic environment. Yixin Chen etc. [9]
investigated methods for multi-dimensional regression
analysis of
time-series stream data; they built up a
partially materialized data cube model and took an
exception-guided regression approach to analysis stream
data. Wei-Guang Teng etc. [10] devised a FTP-DS
algorithm to mine frequent temporal patterns of data
stream. The FTP-DS algorithm uses linear regression to
perform trend detection, it s an effective method, but it
always omits some exceptions, which are too pivotal to
lead to failure.
In this paper, we present a FIM-2DS algorithm that
use plan regression approach to ameliorate FTP-DS
algorithm, so the sequence trends can cover those key
exceptions. Our experiment on comparing between the
FTP-DS algorithm and the FIM-2DS algorithm shows
that coverage ofFIM-2DS algorithm ismuch wider.
The rest
of
the paper is organized as follows. In next
Section, we introduce the basic concepts of sequence
forecast and concerned problem, Section 1 describes
FIPM algorithm with example, and Section 2 describes
the FTP-DS algorithm.
In
Section 3, we present the
algorithm
of
FIM-2DS. Section 4 shows our
experiments and performance analysis and our study
and future research are concluded in Section 5 and 6
respectively.
II .
RELATED CONCEPTS
In this section, we introduce the basic concepts
of
sequence forecast in data stream and give a brief look at
FIPM algorithm and FTP-DS algorithm.
Sequence in data stream is essentially a sequence of
sets of events, which conform to the given timing
constraints [11]. As an example, the sequential pattern
(A)(C, B)(D), encodes an interesting fact that event D
occurs after an event-set(C, B), which in tum occurs
after event A. If the support
of
sequence is no less than
the threshold, this sequence is called frequent sequence.
Sequence forecast plays an important role in gene
analysis, fmance forecast, network security, web
information extraction, communication statistic, and so
on.
1.FREQUENT ITEMS PREDICTION METHOD
We explain the frequent items prediction method [8],
called
FIPM. The method uses a regression analysis model
in one dimensional stream data and predicts frequent
items by using the estimated regression model. First,
this method preprocesses one dimensional stream data
and transforms it to a form of sampling value for further
regression. When a sampling value is generated, a
regression model is found using the method of least
squares, a way to estimate regression model.
A. Data Preprocessing
The purpose
of
regression analysis is to fmd an
estimator in the future from observed values as samples.
In other words,
if
n number
of
sample values were
obtained, an estimator, yn+1 would be predicted based
on the input, xn+1 upon the analysis of the samples.
7/23/2019 linear regression.pdf
2/4
Therefore, it is necessary to transform the incoming
stream data into a data pair, such as (xi, yi). Stream data
has properties of time series data. As seen in the Figure
1, stream data comes in continuously as inputs with time,
t.
The stream data preprocessing is as follows: first,
each inputted value in stream data is reorganized into
stream data at the incoming time of each inputted value.
We first reorganize times of each value to
perform
preprocessing operation.
From the reorganized stream data we can find the
difference in time for each item. For example in the
value A, the value, A is entered in the stream data at t
IS, t-13, t-IO, t-S, t-4 , and t respectively. From the
example of stream data in the Figure I , it is unknown
when the value, A entered before the time
oft-IS
, but it
is possible to see the differences in time for the input
of
A from the time oft-IS to the present time, t: {2, 3, 2 , 4,
4}. Such differences in time can be sequentially paired
as follows : (2, 3), (3, 2), (2, 4), (4, 4). These pairs
indicate the differences in time before and after the
value , A is entered in the stream data . ( As given in
Table 2).
U A
B
A r
E c
A
E D
r
D E
A
I I I I
I
I I I I
I
ri
1
16
1ll
I J 11) Ig
>1
1-6 ,
l-l I)
r.[
r
time. In other words, previously appeared time intervals
can be set as an independent variable and the time
appeared intervals later as a dependent variable, we may
be able to estimate the linear regression model that we
want.
Putting the values of Dependent and Independent
variable in linear regression model, we will get the
regression fit line.
2. A BRIEF LOOK AT FPT-DS ALGORITHM
Algorithm FTP-DS [10] was presented by Wei-Guang
Teng on 29
th
VLDB conferences at 2003. There were
two key defmitions for FTP-DS algorithm. One is the
support of frequent sequence; the other is the pattern
representation, which called a compact ATF. We will
give a brieflook at those in below.
The support of frequent sequence X at a specific time
t is denoted by the ratio
of
the number
of
events having
sequence
X
in the current sliding window to total
number of events. Supposing the sliding window size is
3, there are 3 sliding windows in Figurel , i.e., [0,3],
[1,4], [2,5], the support of sequence is computed
as: in [0,3], the is in event {ID:2, ID:5}, so its
support is 2/5=0.4; in [1,4], the is in event {ID:I ,
ID:4}, so its support is 2/5=0.4; in [2,5], the is in
event {ID:I}, so its support is 1/5=0.2.
TO
St r{ am Sequence
J
1
(b)
(d ) rJ 1
2
(u , b. c (C. dl (d, f
1
3
(b)
(f.')
J
4
a
(b . eJ Cd)
I
5
(b) c . tI)
CI )
.t
aap
l 2 J
,I
Fig. 1. Exampleof onedimensionstreamdata
Calculate the values
of
Dependent variable and
Independent variable depend on the following model
using pairs of Data Sets:
TABLE I. EXAMPLEOFTHEORDERED PAIR,(X,V),
FORTHEVALUE, A
t
i mcs
input time
1 2 3 4 5
6
7 8 9
sequence
independent
2
3
2
4 4 3
5 7
6
variable(x)
dependent
3 2 4 4 3 5 7 6 5
variable(y)
TABLE2. PROCESS OFESTIMATINGREGRESSION
FORMULA BETWEEN ANINDEPENDENTVARIABLE
Fig. 2.Anexampleof supportcomputation
The ATP
form
of time series was defmed as
(ts, Df, f,
It2
and FPT-DS algorithm extracted a regression fit line:
f = a ~ t
Whereo, has to be calculated using some formulae.
Seq.
V....>
7/23/2019 linear regression.pdf
3/4
A
b
o
= j b
l
i
A
Ix
,r-
.1\ I (x -
i f
V
- \ .)
l;
= t J I .,-- = I
J .
,
-- (
-) '
L ;; - X z: .\ - x -
Step5: Fit the regression model ofFlM-2DS:
y =b
[
Step6: Calculate the error
- -
erroron
FTPDS
e r ro ron
FIPM2D
345678
Fig. 4. Graph to show the mean residual comparison
o
0.3
0.2
0.1
5. CONCLUSION
In this method , the least square method is used to
estimate a regression line based on the linear regression
analysis. The regression in this method is a simple linear
regression model estimating a regression model with
changes in two different variables . Since stream data has
a characteristic
of being continuously transmitted in
time, they defme the changes in time with two variables.
These two variables indicate an independent variable
and a dependent variable respectively.
Here, in this method I have presented an overview
of
some vulnerabilities of algorithm FIPM and designed
linear regression based algorithm FIM-2DS to predict
frequent sequence for real time stream data, in which I
reviewed the concept of preprocessing from the
algorithm FTP-DS. Our experimental result shows the
merit of our algorithm in terms
of
errors.
6. FUTURE ENHANCEMENT
REFERENCES
[I] B. Babcock, S. Babu, M. Datar, R. Motwani, and 1. Widom,
Models and Issues in Data Stream Systems, In Proc. of PODS,
March 2002.
[2] N. Davey, S. P. Hunt, and R. 1. Frank, Time Series Prediction
and Neural Networks, In Journal of Intelligent and Robotic
Systems, 200
I.
[3] L. Golab, M. Tamer Ozsu, Issues in Data StreamManagement,
In SIGMODRecord, Volume 32, Number 2,2003.
[4] X. Hao, D. Xu, Time Series Prediction based on Non
Parametric, In SIGMODRecord, Volume 32, Number 2,2003 .
[5] D. F. Specht, A General Regression Neural Network, IEEE
Trans. on Neural Networks, Vol.2, No.6, pp.568-576, Nov. 1991.
[6] D. Wacherly, W. Mendenhall,
R.
L. Scheaffer, Mathematical
Statistics with Applications , 5th edition, 2005.
[7] O. B. Yaik, C. H. Yong, and F. Haron, Time Series Prediction
using Adaptive Association Rules, In Proc. Of DFMA05,
pp.3IO-3I4,2005.
[8] Duck Jin Chai, Eun Hee Kin, Long Jin, Frequent Items
Prediction Method to one dimensional stream data , in Fifth
International Conference on Computational Science And
Applications, IEEE -2007.
Let us analyze a few open problems of FIM-2DS
algorithm from data stream background. Stream data is
so huge , infinite, and fast changing that it always
contains lots of noisy, overflowing, incomplete data
items. As my work is concerned, it is not dealing with
this problem that too at the cost of execution time, my
future work is to resolve this problem.
--- - e rro r o n FTPDS
____ e rro r o n F IPM2D
- 1
.- L . r
Tl r-J
Error
comparison o n F TP DS a nd
FIPM2D
Fig. 3. Graph to show the residual comparison
II .J>
II
I I
7/23/2019 linear regression.pdf
4/4
[9] Y.Chen, G.Dong, J.Han et, al, Multi-Dimensional Regression
Analysis of Time-Series Data Stream , Proceedings of the 28
th
International Conference on Very Large Data Base, Berlin, pp.
323-334,August 2002.
[10] Wei-Guang Teng, Ming-Syan Chen, Philip S.Yu, A
Regression-Based Temporal Pattern Mining Scheme for Data
Stream , Proceedings of the 29th International Conference on
Very Large Data Base, Berlin, pp.607-617, August 2003.
[11] Joshi M, Karypis G, A universal formulation of sequential
patterns , Research report No.99-021, Department of Computer
Science, University
of
Minnesota, Minnesota, 1999.