Upload
sangjun-lee
View
216
Download
2
Embed Size (px)
Citation preview
The Journal of Systems and Software 69 (2004) 105–113
www.elsevier.com/locate/jss
Minimum distance queries for time series data q
Sangjun Lee *, Dongseop Kwon, Sukho Lee
School of Electrical Engineering and Computer Science, Seoul National University, San 56-1, Shillim-dong, Kwanak-gu, Seoul 151-742, South Korea
Received 24 October 2002; received in revised form 7 March 2003; accepted 17 March 2003
Abstract
In this paper, we propose an indexing scheme for time sequences which supports the minimum distance of arbitrary Lp norms as a
similarity measurement. In many applications where the shape of the time sequence is a major consideration, the minimum distance
is a more suitable similarity measurement than the simple Lp norm. To support minimum distance queries, most of the previous
work has the preprocessing step for vertical shifting which normalizes each sequence by its mean. The vertical shifting, however, has
the additional overhead to get the mean of a sequence and to subtract it from each element of the sequence. The proposed method
can match time series of similar shape without vertical shifting and guarantees no false dismissals. In addition, the proposed method
needs only one index structure to support minimum distance queries in any arbitrary Lp norm. The experiments are performed on
real data (stock price movement) to verify the performance of the proposed method.
� 2003 Elsevier Inc. All rights reserved.
Keywords: Database; Time series; Similarity search; Minimum distance
1. Introduction
Time sequences are of growing importance in many
database applications, such as data mining and data
warehousing (Agrawal et al., 1993a; Fayyad et al., 1996).
A time sequence is a sequence of real numbers and each
number represents a value at a time point. Typical ex-
amples include stock price movement, exchange rate,
weather data, biomedical measurement, etc. Recently,
there has been a lot of attention in the problem of simi-larity search in time series databases, because it helps
predicting, hypothesis testing in data mining and knowl-
edgediscovery (Agrawal et al., 1993a; Fayyad et al., 1996).
The main issue of similarity search in time series
databases is to improve the search performance when a
particular similarity model is given. In general, indexing
is used to support fast retrieval and matching of similar
sequences. Most of approaches map the sequences oflength n into points in an n-dimensional space, and a
spatial access method such as R-tree (Guttman, 1984) or
R*-tree (Beckmann et al., 1990) can be used for fast
qThis work was supported in part by the Brain Korea 21 project and
the ITRC program.*Corresponding author. Tel.: +82-2-880-7299; fax: +82-2-883-8387.
E-mail address: [email protected] (S. Lee).
0164-1212/$ - see front matter � 2003 Elsevier Inc. All rights reserved.
doi:10.1016/S0164-1212(03)00078-5
retrieval of those points. However, sequences are usually
long. A straightforward indexing of sequences using thespatial access method suffers from the performance de-
terioration due to the dimensionality curse of an index
structure. To solve this problem, several dimensionality
reduction methods have been proposed to map se-
quences into a new feature space of a lower dimen-
sionality. Typical dimensionality reduction methods
include the discrete fourier transform (DFT) and the
discrete wavelet transform (DWT), etc. These methodsguarantee no false dismissals. Some false alarms can be
removed in the post-processing step. However, the di-
mensionality reduction method such as the DFT or the
DWT is known as efficient only when the given simi-
larity model is the Euclidean distance. The effectiveness
of the DFT and the DWT is uncertain for other simi-
larity models such as L1, L1.Another important issue of the similarity search in
time series databases is to choose the similarity model.
According to applications, several similarity models from
L1 to L1 can be required for the same sequence database.
To support this situation, several indexing methods have
to be implemented to support multi-modal queries in the
same system, which is not easy and inefficient.
The distance of simple Lp norm has been widely used
to measure the similarity of two sequences A and B of
106 S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113
the same length. Given two sequences A ¼ ða1; a2; . . . ;anÞ and B ¼ ðb1; b2; . . . ; bnÞ of equal length n, their
simple distance is calculated by
Lsimp ðA;BÞ ¼
Xni¼1jai
� bijp
!1=p
; 16 p61
L1 is the Manhattan distance and L2 is the Euclidean
distance. L1 is the maximum distance in any pair of
elements.Many techniques have been proposed to support the
fast retrieval of similar sequences based on the simple Lpnorm. However, the distance of simple Lp norm as a
similarity model has the following problem: it is sensi-
tive to the absolute offsets of the sequences, so two se-
quences that have similar shapes but with different
vertical positions may be classified as dissimilar.
Consider a query sequence Q ¼ ð4; 9; 4; 9; 4Þ and twodata sequences A ¼ ð7; 5; 6; 7; 6Þ and B ¼ ð14; 19; 14;19; 14Þ in Fig. 1. Note that shifting Q upward for 10
units generates B. Using the similarity definition of
simple Lp norm, A is a more similar sequence of Q than
B. However, B is more similar to Q in shape. From this
example, simple Lp norm is not a good measurement of
similarity when shape is the major consideration.
In order to overcome the shortcoming of simple Lpnorm above mentioned, the minimum distance is often
used for sequence matching (Lam and Wong, 1998; Chu
et al., 1998; Chan and Fu, 1999). The minimum distance
is the distance of Lp norm between two sequences, ne-
glecting their offsets. This can give a better estimation of
similarity in shape between two sequences irrespective of
their vertical positions. To support minimum distance
queries, most of the previous work has the preprocessingstep, called vertical shifting, which normalizes each se-
quence by its mean before indexing (Goldin and Ka-
nellakis, 1995; Keogh and Pazzani, 2000; Keogh et al.,
2001). This has the effect of shifting the sequence in the
value-axis such that its mean is zero, removing its offset.
0
5
10
15
20
25
0 1 2 3 4 5 6
Val
ue
Time
QAB
Fig. 1. Shortcoming of simple Lp norm.
However, the vertical shifting has the additional over-
head to get the mean of a sequence and to subtract the
mean from each element of the sequence. The general
approach used in previous work to process minimum
distance queries is as follows.
Building Index
1. Snormi ( normalization (Si, mean(Si))//Normalizing each sequence Si by its mean
2. FSi ( transformation ðSnormi Þ//Applying some transformations to normalized
sequence Snormi for feature extraction
//FSi is the dimensionality reduced feature of normal-
ized sequence Snormi3. Insert FSi into the spatial access method
Query Processing
1. FQ ( transformation (normalization (Q,meanðQÞ))//Project the query sequence Q into the feature space
2. Candidate selection within error bound � using in-
dex structure3. Compute the actual minimum distance between a
query sequence and candidate sequences, and filter
out false alarms
The preprocessing of vertical shifting can be negligible if
data sequences and query sequences are of the same
length. However, in time series databases, the subse-
quence matching or the sequence matching of differencelengths is more practical and important. When we would
like to search subsequences similar to a query sequence
in time series databases, the preprocessing of vertical
shifting is not easy since the length of query sequences
can be arbitrary. To solve this situation, all possible
cases of subsequences should be preprocessed. Conse-
quently, the preprocessing cost can be exponentially
large. Moreover, it is hard to maintain all possible casesin the index structure.
In this paper, we propose a novel and fast time series
indexing scheme, called the segmented mean variation
indexing (SMV-indexing). This method is motivated by
autocorrelation of sequence, that is, the variation be-
tween two adjacent elements in a sequence is invariant
under vertical shifting. This property of autocorrelation
motivated the dimensionality reduction technique in-troduced in this paper. The proposed method can match
time series of similar shapes without vertical shifting and
guarantees no false dismissals. In addition, the proposed
method needs only one index structure to process min-
imum distance queries in any arbitrary Lp norm. Our
approach can handle the subsequence matching without
preprocessing of vertical shifting based on the minimum
distance by directly adopting the ST-index method(Faloutsos et al., 1994) which is proposed for subse-
quence matching by Christos Faloutsos et al.
S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113 107
The remainder of this paper is organized as follows.
Section 2 provides a survey of related work. We will give
similarity models used in sequence matching in Section
3, and our proposed approach is described in Section 4.
We will present the overall process of minimum distance
queries in Section 5. Section 6 presents the experimentalresults. Finally, several concluding remarks are given in
Section 7.
2. Related work
Various methods have been proposed for fast
matching and retrieval of time series. The main focus is
to speed up the search process. The most popular
methods perform the feature extraction as the dimensi-
onality reduction of time series data, and then use a
spatial access method such as R-tree to index time seriesdata in the feature space.
An indexing scheme called the F-index (Agrawal
et al., 1993b) is proposed to handle sequences of the
same length. The idea is to use the DFT to transform a
sequence from time domain to frequency domain, drop
all but the first few frequencies, and then use the re-
maining ones to index the sequence using a spatial
access method. The results of Agrawal et al. (1993b) areextended in Faloutsos et al. (1994) and the ST-index is
proposed for subsequence matching in time series
databases. The methods proposed in Agrawal et al.
(1993b) and Faloutsos et al. (1994) use the Euclidean
distance as a similarity measurement without consider-
ing any transformation.
In Goldin and Kanellakis (1995), the authors show
that the similarity retrieval will be invariant to simpleshifting and scaling if sequences are normalized before
indexing. In Das et al. (1997) and Bollobas et al. (1997),
authors present an intuitive similarity model for time
series data. They argue that the similarity model with
scaling and shifting is better than the Euclidean dis-
tance. However, they do not present any indexing
method. In Agrawal et al. (1995), authors give a method
to retrieve similar sequences in the presence of noise,scaling and transformation in time series databases. In
Chu and Wong (1999), the authors propose the defini-
tion of similarity based on scaling and shifting trans-
formations. In Li et al. (1996) authors present a
hierarchical algorithm called the HierarchyScan. The
idea of this method is to perform correlation between
the stored sequences and the template in the trans-
formed domain hierarchically. In Lam and Wong (1998)and Chu et al. (1998), the definition of sequence simi-
larity based on the slope of a sequence’s segment is
discussed.
In Korn et al. (1997), the singular value decomposi-
tion (SVD) is used for the dimensionality reduction
technique. The SVD is a global transformation which
maps the entire dataset into a much smaller one. The
SVD can be used to support ad-hoc queries on large
dataset of sequences. The problem of the SVD is that an
insertion of sequence data to the database requires re-
computing the entire index. The SVD has to examine the
entire dataset again to update index.In Rafiei and Mendelzon (1997), authors propose a
set of linear transformations such as moving average,
time warping, and reversing. These transformations can
be used as the basis of similarity queries for time series
data. The results of Rafiei and Mendelzon (1997) are
extended in Rafiei (1999) and authors propose the
method for processing queries that express the similarity
in terms of multiple transformations instead of a singleone. In Yi et al. (1998) and Park et al. (2000), authors
use the time warping as distance function and present
algorithms for retrieving similar sequences under this
function. However, a time warping distance does not
satisfy triangular inequality and can cause false dis-
missals.
In Chan and Fu (1999), authors propose to use the
DWT for the dimensionality reduction instead of theDFT. They argue that the Harr wavelet transform per-
forms better than the DFT. However, the DWT has the
limitation that it can only be defined for sequences with
a length of the power of two.
In Perng et al. (2000), authors propose the Landmark
Model for similarity-based pattern queries in time series
databases. The Landmark Model integrates the similar-
ity measurement, data representations, and smoothingtechniques in a single framework. The model is based on
the fact that people recognize patterns by identifying
important points.
In Keogh and Pazzani (2000) and Keogh et al.
(2001), the authors introduce the dimensionality re-
duction technique of a sequence by using segmented
mean features. In this method, the sequence is divided
into non-overlapping segments. The feature of a seg-ment is represented by the segment’s mean value.
Then, the sequence is indexed in the feature space by a
spatial access method. The same concept of the inde-
pendent research is proposed in Yi and Faloutsos
(2000). They show that the segmented mean features
can be used in arbitrary Lp norm. However, since the
segmented mean features cannot support minimum
distance queries, vertical shifting is required in pre-processing raw data to process minimum distance
queries.
3. Similarity models for sequence matching
In this section, we will give two similarity models usedin sequence matching. The first definition is based on the
simple Lp norm between two sequences
108 S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113
Definition 1 (Simple Lp Norm). Given a threshold �, twosequences A ¼ fa1; a2; . . . ; ang and B ¼ fb1; b2; . . . ; bngof equal length n are said to be similar if
Lsimp ðA;BÞ ¼
Xni¼1jai
� bijp
!1=p
6 �
A shortcoming of Definition 1 is demonstrated in Fig. 1.
Two sequences B and Q are the same in shape, because Bcan be obtained by shifting Q upward 10 units. How-ever, they can be classified as dissimilar by Definition 1.
From this example, The simple Lp norm is not a suitable
measurement of similarity when the shape of a sequence
is the major consideration.
Definition 2 (Minimum Distance). Given a threshold �,two sequences A ¼ fa1; a2; . . . ; ang and B ¼ fb1; b2;. . . ; bng of equal length n are said to be similar in shapeif
Lminp ðA;BÞ ¼
Xni¼1jai
� bi � mjp
!1=p
6 �
where
m ¼Xni¼1ðai � biÞ=n
From Definition 2, two sequences are said to be similar
in shape if the minimum distance is less or equal to a
threshold �. This definition is a more flexible similarity
model when we would like to retrieve similar sequences
in shape from databases irrespective of their vertical
positions.
4. Proposed approach
The problem we focus on is the design of fast
searching and retrieval of similar sequences in databases
based on the minimum distance. To the best of our
knowledge, this is the first work that examines indexing
methods for minimum distance queries without verticalshifting for time series of an arbitrary length in arbitrary
Lp norm. We will now introduce the segmented mean
variation indexing (SMV-indexing) and show that it
guarantees no false dismissals.
4.1. Dimensionality reduction
Our goal is to extract features that capture the in-formation on shape of a sequence, and that will lead to a
feature distance definition satisfying the lower bound
condition of the minimum distance. Suppose we have a
set of sequences of length n. The idea of our proposed
feature extraction method consists of two steps. First,
we divide each sequence into m segments of equal length
l. Next, we extract a simple feature from each segment.
We propose to use segmented mean variation as the
feature of a segment in a sequence. FAj denotes the fea-
ture of the j-th segment of a sequence A. We define a
feature vector of a sequence A as follows.
Definition 3 (Segmented mean variation feature). Given a
sequence A ¼ fa1; a2; . . . ; ang and the number of seg-
ments m > 0, define the feature vector FA�!
of a sequence
A by
FA�!¼ ðFA1; FA2; . . . ; FAmÞ
¼ 1
l� 1�Xl�1i¼1jaiþ1
� aij;
X2ðl�1Þi¼ljaiþ1 � aij; . . . ;
Xmðl�1Þi¼ðm�1Þlþð2�mÞ
jaiþ1 � aij!
Fig. 2 illustrates the dimensionality reduction method
used in this paper. A sequence of length 11 is projected
into two dimensions. The sequence is divided into two
segments and the mean variation of each segment isobtained.
The algorithm to extract the feature vector is obvi-
ously simple. Since the segmented mean variation is in-
variant under vertical shifting, the segmented mean
variation features can be indexed to support minimum
distance queries. The proposed indexing scheme is mo-
tivated by the following observation.
Observation 1 (Autocorrelation of sequence). The seg-mented mean variation is invariant under vertical shifting.
Verifying its correctness is very obvious. The fol-
lowing equality shows its correctness.
MeanVariationðAÞ ¼ 1
ðn� 1ÞXn�1i¼1jaiþ1 � aij
¼ 1
ðn� 1ÞXn�1i¼1jðaiþ1 � maÞ � ðai � maÞj
where
ma ¼Xni¼1ðaiÞ=n
4.2. Lower bounding of the minimum distance
In order to guarantee no false dismissals, we must
show that the distance between feature vectors is the
lower bound of the minimum distance between original
sequences.
Sequence A=(5,4,5,6,8,7,7,6,7,6,5)
Fig. 2. Dimensionality reduction technique used in this paper.
S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113 109
It is not hard to show that a feature distance is the the
lower bound of the minimum distance.
Lsimp ðFA
�!Þ6 Lmin
p ðAÞ
In practice, however, the segmented mean variation
feature FA!
is a poor approximation of a sequence A,since it is essentially a down-sampling of a sequence.
Much of the information will be lost, consequently, too
many false alarms will occur. It is necessary to find a
way to compensate the loss of information so that we
could reduce the number of false alarms. More specifi-cally, we find a factor ap > 1 such that,
ap � Lsimp ðFA
�!Þ6 Lmin
p ðAÞSince F ð�Þ ¼ j � jp is a convex function on R for
16 p61, we can use the property of convex functionsto find ap. There is a well-known mathematical result on
convex functions. We can use the following theorem
from Yi and Faloutsos (2000) and Protter and Morrey
(1977).
Theorem 1. Suppose that x1; x2; . . . ; xn 2 R, andk1; k2; . . . ; kn 2 R such that ki P 0 and
Pni¼1 ki
� �¼ 1. If
F ðÞ is a convex function on R, thenF ðk1x1 þ k2x2 þ � � � þ knxnÞ6 k1F ðx1Þ þ k2F ðx2Þ þ � � �þ knF ðxnÞ
where R is the set of real numbers.
Proof. The proof is shown in Protter and Morrey
(1977). �
Using the result of Theorem 1 by taking ki ¼ 1n, we
have the following corollary.
Corollary 1. For any sequence A ¼ ða1; a2; . . . ; anÞ and16 p61, the following holds.
ðn� 1Þ � jMeanVariationðAÞjp 6Xn�1i¼1jaiþ1 � aijp
6 2p�1Xn�1i¼1jai
� majp þ
Xni¼2jai � majp
!
6 2pXni¼1jai � majp
where
ma ¼Xni¼1ðaiÞ=n
or, equivalently, for each segment Sj in a sequence A,16 j6m, we have
ðl� 1Þ � jMeanVariationðSjÞjp 6Xjðl�1Þ
i¼ðj�1Þlþð2�jÞjaiþ1 � aijp
6 2p�1Xjðl�1Þ
i¼ðj�1Þlþð2�jÞjai
� majp þ
Xjðl�1Þþ1
i¼ðj�1Þlþð3�jÞjai � majp
!
where
ma ¼Xni¼1ðaiÞ=n
Now, we present our main theorem as follows.
Theorem 2. For any sequence A ¼ ða1; a2; . . . ; anÞ and16 p61, the following holds.
ffiffiffiffiffiffiffiffiffiffiffil� 1
pp
� Lsimp ðFA
�!Þ6 2 �
Xni¼1jai
� majp
!1=p
¼ 2 � Lminp ðAÞ
where
ma ¼Xni¼1ðaiÞ=n
110 S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113
Proof. Based on the definitions of Lp and FA!, by taking
p-th root of both sides.
ðl� 1Þ � Lsimp ðFA
�!Þp ¼ ðl� 1Þ �
Xmj¼1jMeanVariationðSjÞjp
By Corollary 1, we can get the following inequality.
6 2p�1Xl�1i¼1jai
(� majp þ
Xli¼2jai � majp
)
þ 2p�1X2ðl�1Þi¼ljai
(� majp þ
X2ðl�1Þþ1
i¼lþ1jai � majp
)þ � � �
þ 2p�1Xjðl�1Þ
i¼ðj�1Þlþð2�jÞjai
(� majp þ
Xjðl�1Þþ1
i¼ðj�1Þlþð3�jÞjai � majp
)
6 2p �Xni¼1jai � majp ¼ 2p � Lmin
p ðAÞp
where
ma ¼Xni¼1ðaiÞ=n �
Owing to Theorem 2, we can efficiently handle min-
imum distance queries with the segmented mean varia-
tion features. In case, we compare two sequences A andB. By Theorem 2, we know that Lmin
p ðA;BÞ6 � representsLsimp ðFA
�!; FB�!Þ6 2 � �=
ffiffiffiffiffiffiffiffiffiffiffil� 1pp
. Consequently, minimum
distance queries in any LP norm can be correctly con-
verted to simple Lp-based distance queries in the feature
space without false dismissals. We can reduce the search
space by a factor of 2=ffiffiffiffiffiffiffiffiffiffiffil� 1pp
. Note that if ap < 1, apcannot be used to reduce the search space because it
produces more false alarms. In case of ap < 1, we use theoriginal feature distance.
5. Query processing
In this section, we present the overall process of our
time series indexing scheme. Before a query is per-
formed, we shall do some preprocessing to extract fea-ture vectors from sequences, and then to build an index
structure. After the index is built, the similarity search
can be performed to select candidate sequences from
databases.
5.1. Preprocessing
Step 1. Similarity Model Selection: According to theirapplications, users may choose the distance
model from Lmin1 to Lmin
1 as their similarity
model. The same index structure can be re-used
to process minimum distance queries without
regard to the distance model (Lminp ).
Step 2. Index Construction: Each sequence is divided
into m segments. The feature vector is extracted
from the sequence using the method mentioned
in Section 4.1 for all sequences in databases.
Then, we build a multidimensional index struc-
ture such as R-tree using the feature vectors
extracted from sequences.
5.2. Index searching
After an index structure has been built, we can per-
form the similarity search against a given query se-
quence. The searching algorithm consists of two main
parts. The first is for candidate selection and the other is
for post-processing to remove false alarms. Some non-
qualifying sequences may be included in the result ofcandidate selection. The actual minimum distance be-
tween a query sequence and candidate sequences are
computed and only those within the error bound are
reported as the query result. The implementation of the
segmented mean variation indexing is described in Al-
gorithm 1.
Algorithm 1: SMV-Indexingbegin
Input: query sequence Q, error bound �Output: data sequences within error bound �Result NULL;
Candidate NULL;
// project the query sequence Q into the
index space,
FQ FeatureExtraction(Q);// candidate selection using a Spatial
Access Method,
Candidate IndexSearching FQ; SAM; 2��ffiffiffiffiffiffil�1pp
� �;
// post-processing to remove false
alarms,
for all Ci 2 Candidate do
if ComputeMinimumDistanceðQ;CiÞ6 � thenResult Ci
SResult;
else
Reject Ci;
end
end
return Result;
end
6. Performance evaluation
In this section, we will present the results of some
experiments performed to analyze the performance of
the SMV-indexing. To verify the effectiveness of the
proposed method, we compared the SMV-indexing with
the sequential scanning with respect to the search space
ratio and the average response time to process minimumdistance queries. The experimental settings are described
first and the results of experiments are given next.
0
0.2
0.4
0.6
0.8
1
1000 1500 2000 2500 3000 3500 4000 4500 5000
sear
ch s
pace
rat
io
epsilon
L_1L_2L_3
L_inf
Fig. 3. Search space ratio by varying error bound.
0
0.2
0.4
0.6
0.8
1
10 20 30 40 50 60
sear
ch s
pace
rat
io
number of feature
L_1L_2L_3
L_inf
Fig. 4. Search space ratio by varying feature dimensionality.
0
0.2
0.4
0.6
0.8
1
1000 1500 2000 2500 3000 3500 4000 4500 5000
sear
ch s
pace
rat
io
epsilon
seq-scanSMV(8dim)
SMV(16dim)
S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113 111
6.1. Experimental setup
We implemented both the SMV-indexing and the
sequential scanning in C++ on a Linux machine (Red-
hat 7.1, kernel version 2.4.2) with dual Pentium III 500
MHz CPUs, 512 MB of memory and 40 GB HDD. Thesize of a disk page is set to 4 KB. For a spatial access
method, we used the Katayama’s R*-tree source
codes. 1
The real sequence data were obtained from Seoul
Stock Market, Korea. 2 The stock data were based on
their daily closing prices. We collected 2000 stock se-
quence data of average length 128. We have run 25
random queries over real dataset to find similar se-quences based on the minimum distance. The query
sequences are randomly selected from databases.
6.2. Experimental results
In first, we compared the SMV-indexing with the
sequential scanning in terms of the number of actual
computations required. We evaluated the search spaceratio to test the filtering effect of removing irrelevant
sequences in the process of index searching. The search
space ratio is defined as follows:
search space ratio
¼ the number of candidate sequences
the number of sequences in a database
Fig. 3 show the search space ratio with respect to the
error bound in Lmin1 , Lmin
2 , Lmin3 , and Lmin
1 . The dimensio-
nality of feature vectors for the SMV-indexing is set to 8
in this experiment. The results show that there is a sig-
nificant gain in reducing the search space by the SMV-
indexing.
Fig. 4 show the search space ratio with respect to the
feature dimensionality in Lmin1 , Lmin
2 , Lmin3 , and Lmin
1 . Weset the error bound to 5000 in this experiment. As the
dimensionality of feature vectors increases, the filtering
effect increases.
Good filtering effect does not always result in good
performance since the spatial access method can be
affected by the dimensionality of feature vector. We
compared the SMV-indexing of 8 dimensions and 16
dimensions with the sequential scanning with respect tothe search space ratio and the average execution time by
varying error bound in L2 norm. Figs. 5 and 6 show the
experimental results that the SMV-indexing of 16 di-
mensions is more efficient than the SMV-indexing of 8
dimensions in reducing the search space. However, the
1 The Katayama’s R*-tree source codes can be retrieved from
http://research.nii.ac.jp/~katayama/homepage/re-
search/srtree/English.html2 This data can be retrieved from http://www.kse.or.kr/
kor/stat/stat_data.htm
Fig. 5. Search space ratio: SMV-indexing vs. sequential scanning
in Lmin2 .
performance of the SMV-indexing of 16 dimensions isnot better than the SMV-indexing of 8 dimensions. The
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
1000 1500 2000 2500 3000 3500 4000 4500 5000
exec
utio
n tim
e(se
c)
epsilon
seq-scanSMV(8dim)
SMV(16dim)
Fig. 6. Execution time: SMV-indexing vs. sequential scanning in Lmin2 .
112 S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113
SMV-indexing is more efficient than the sequential
scanning in performance. However, as shown in Fig. 6,
the average execution time increases as the given errorbound increases, eventually, the performance of the
SMV-indexing becomes worse than that of the sequen-
tial scanning. This is because the number of retrieved
sequences for a larger error bound increases more
rapidly than the number of relevant sequences (dimen-
sionality curse of R*-tree). Consequently, the post-
processing time to remove false alarms increases with
a larger error bound.
7. Conclusion
In this paper, we considered the problem of the fast
similarity search in large time series databases, when the
similarity measurement is the minimum distance in ar-
bitrary Lp. Previous work on the sequence searchingusing the simple Lp distance as a similarity measurement
suffers from some drawbacks. The simple Lp distance is
sensitive to the absolute offsets of the sequences. In
addition, two sequences that have similar shapes but
with different vertical positions may be classified as
dissimilar. Therefore, the simple Lp distance between
two sequences is not a good measurement of similarity
in terms of their shapes.The minimum distance is a more suitable similarity
measurement than the simple Lp norm when we would
like to retrieve similar sequences in shape from data-
bases irrespective of their vertical positions. To process
minimum distance queries, most of the previous work
has the preprocessing step for vertical shifting that
normalizes each sequence by its mean. However, vertical
shifting of a sequence has the overhead in buildingindex.
In this paper, we have proposed a novel time series
indexing scheme that supports minimum distance que-
ries without vertical shifting in arbitrary Lp norm. Our
indexing scheme is motivated by autocorrelation of a
sequence, that is, the segmented mean variation in a
sequence is invariant under vertical shifting. Using the
above property (autocorrelation of a sequence), our in-
dexing scheme can handle minimum distance querieswithout vertical shifting. In addition, the segmented
mean variation feature can be used to process minimum
distance queries in any arbitrary Lp norm. Our approach
can handle the subsequence matching without prepro-
cessing of vertical shifting based on the minimum dis-
tance by directly adopting the ST-index method
(Faloutsos et al., 1994).
The major contributions of this work can be sum-marized as follows.
• We introduced the efficient indexing scheme which
processes minimum distance queries without vertical
shifting, guaranteeing no false dismissals.
• We showed that the same index structure can be used
to process minimum distance queries in arbitrary Lpnorm.
We have performed the experiments on real stock data,
and examined the pruning power and performance of
our proposed scheme compared with the sequential
scanning. The experiments show that our indexing
scheme is more efficient than the sequential scanning.
References
Agrawal, R., Imielinski, T., Swami, A., 1993a. Database mining: A
performance perspective, special issue on learning and discovery in
knowledge-based databases. IEEE TKDE 5 (6), 914–925 (special
issue).
Agrawal Rakesh, Faloutsos Christos, Swami Arun N., 1993b. Efficient
similarity search in sequence databases. In: FODO, pp. 69–84.
Agrawal Rakesh, Lin King-Ip, Sawhney Harpeet S., Shim Kyuseok,
1995. Fast similarity search in the presence of noise, scaling, and
translation in time-series databases. In: VLDB, pp. 490–501.
Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B., 1990. The
R*-tree: An efficient and robust access method for points and
rectangles. In: SIGMOD Conference, pp. 322–331.
Bollobas Bela, Das Gautam, Gunopulos Dimitrios, Mannila Heikki,
1997. Time-series similarity problems and well-separated geo-
metric sets. In: Symposium on Computational Geometry, pp. 454–
456.
Chan Kin-pong, Fu Ada Wai-chee, 1999. Efficient time series matching
by wavelets. In: ICDE, pp. 126–133.
Chu Kelvin Kam Wing, Wong Man Hon, 1999. Fast time-series
searching with scaling and shifting. In: PODS, pp. 237–248.
Chu Kelvin Kam Wing, Lam, Sze Kin, Wong, Man Hon, 1998. An
efficient hash-based algorithm for sequence data searching. The
computer journal 41 (6), 402–415.
Das Gautam, Gunopulos Dimitrios, Mannila Heikki, 1997. Finding
similar time series. In: PKDD, pp. 88–100.
Faloutsos Christos, Ranganathan M., Manolopoulos Yannis, 1994.
Fast subsequence matching in time-series databases. In: SIGMOD
Conference, pp. 419–429.
S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113 113
Fayyad Usama M., Gregory Piatetsky-Shapiro, Smyth Padharic, 1996.
Knowledge discovery and data mining: towards a unifying frame-
work. In: KDD, pp. 82–88.
Goldin Dina Q., Kanellakis Paris C., 1995. On similarity queries for
time-series data: Constraint specification and implementation. In:
CP, pp. 137–153.
Guttman, A., 1984. R-trees: A dynamic index structure for spatial
searching. In: SIGMOD Conference, pp. 47–57.
Keogh Eamonn J., Chakrabarti Kaushik, Mehrotra Sharad, Pazzani
Michael J., 2001. Locally adaptive dimensionality reduction for
indexing large time series databases. In: SIGMOD conference, pp.
151–162.
Keogh Eamonn J., Pazzani Michael J., 2000. A simple dimensionality
reduction technique for fast similarity search in large time series
databases. In: PAKDD, pp. 122–133.
Korn Flip, Jagadish H.V., Faloutsos Christos, 1997. Efficiently
supporting ad hoc queries in large datasets of time sequences. In:
SIGMOD Conference, pp. 289–300.
Lam, Sze Kin, Wong, Man Hon, 1998. A fast projection algorithm for
sequence data searching. DKE 28 (3), 321–339.
Li Chung-Sheng, Yu Philip S., Castelli Vittorio, 1996. HierarchyScan:
A Hierarchical similarity search algorithm for databases of long
sequences. In: ICDE, pp. 546–553.
Park Sanghyun, Chu Wesley W., Yoon Jeehee, Hsu Chihcheng, 2000.
Efficient searches for similar subsequences of different lengths in
sequence databases. In: ICDE, pp. 23–32.
Perng Chang-Shing, Wang Haixun, Zhang Sylvia R., Parker D. Stott,
2000. Landmarks: A new model for similarity-based pattern
querying in time series databases. In: ICDE, pp. 33–42.
Protter, M.H., Morrey, C.B., 1977. A First Course in Real Analysis.
Springer-Verlag.
Rafiei Davood, 1999. On similarity-based queries for time series data.
In: ICDE, pp. 410–417.
Rafiei Davood, Mendelzon Alberto O., 1997. Similarity-based queries
for time series data. In: SIGMOD Conference, pp. 13–25.
Yi Byoung-Kee, Faloutsos Christos, 2000. Fast time sequence indexing
for arbitrary Lp norms. In: VLDB, pp. 385–394.
Yi Byoung-Kee, Jagadish H.V., Faloutsos Christos, 1998. Efficient
retrieval of similar time sequences under time warping. In: ICDE,
pp. 201–208.