9
Minimum distance queries for time series data q Sangjun Lee * , Dongseop Kwon, Sukho Lee School of Electrical Engineering and Computer Science, Seoul National University, San 56-1, Shillim-dong, Kwanak-gu, Seoul 151-742, South Korea Received 24 October 2002; received in revised form 7 March 2003; accepted 17 March 2003 Abstract In this paper, we propose an indexing scheme for time sequences which supports the minimum distance of arbitrary L p norms as a similarity measurement. In many applications where the shape of the time sequence is a major consideration, the minimum distance is a more suitable similarity measurement than the simple L p norm. To support minimum distance queries, most of the previous work has the preprocessing step for vertical shifting which normalizes each sequence by its mean. The vertical shifting, however, has the additional overhead to get the mean of a sequence and to subtract it from each element of the sequence. The proposed method can match time series of similar shape without vertical shifting and guarantees no false dismissals. In addition, the proposed method needs only one index structure to support minimum distance queries in any arbitrary L p norm. The experiments are performed on real data (stock price movement) to verify the performance of the proposed method. Ó 2003 Elsevier Inc. All rights reserved. Keywords: Database; Time series; Similarity search; Minimum distance 1. Introduction Time sequences are of growing importance in many database applications, such as data mining and data warehousing (Agrawal et al., 1993a; Fayyad et al., 1996). A time sequence is a sequence of real numbers and each number represents a value at a time point. Typical ex- amples include stock price movement, exchange rate, weather data, biomedical measurement, etc. Recently, there has been a lot of attention in the problem of simi- larity search in time series databases, because it helps predicting, hypothesis testing in data mining and knowl- edge discovery (Agrawal et al., 1993a; Fayyad et al., 1996). The main issue of similarity search in time series databases is to improve the search performance when a particular similarity model is given. In general, indexing is used to support fast retrieval and matching of similar sequences. Most of approaches map the sequences of length n into points in an n-dimensional space, and a spatial access method such as R-tree (Guttman, 1984) or R*-tree (Beckmann et al., 1990) can be used for fast retrieval of those points. However, sequences are usually long. A straightforward indexing of sequences using the spatial access method suffers from the performance de- terioration due to the dimensionality curse of an index structure. To solve this problem, several dimensionality reduction methods have been proposed to map se- quences into a new feature space of a lower dimen- sionality. Typical dimensionality reduction methods include the discrete fourier transform (DFT) and the discrete wavelet transform (DWT), etc. These methods guarantee no false dismissals. Some false alarms can be removed in the post-processing step. However, the di- mensionality reduction method such as the DFT or the DWT is known as efficient only when the given simi- larity model is the Euclidean distance. The effectiveness of the DFT and the DWT is uncertain for other simi- larity models such as L 1 , L 1 . Another important issue of the similarity search in time series databases is to choose the similarity model. According to applications, several similarity models from L 1 to L 1 can be required for the same sequence database. To support this situation, several indexing methods have to be implemented to support multi-modal queries in the same system, which is not easy and inefficient. The distance of simple L p norm has been widely used to measure the similarity of two sequences A and B of q This work was supported in part by the Brain Korea 21 project and the ITRC program. * Corresponding author. Tel.: +82-2-880-7299; fax: +82-2-883-8387. E-mail address: [email protected] (S. Lee). 0164-1212/$ - see front matter Ó 2003 Elsevier Inc. All rights reserved. doi:10.1016/S0164-1212(03)00078-5 The Journal of Systems and Software 69 (2004) 105–113 www.elsevier.com/locate/jss

Minimum distance queries for time series data

Embed Size (px)

Citation preview

Page 1: Minimum distance queries for time series data

The Journal of Systems and Software 69 (2004) 105–113

www.elsevier.com/locate/jss

Minimum distance queries for time series data q

Sangjun Lee *, Dongseop Kwon, Sukho Lee

School of Electrical Engineering and Computer Science, Seoul National University, San 56-1, Shillim-dong, Kwanak-gu, Seoul 151-742, South Korea

Received 24 October 2002; received in revised form 7 March 2003; accepted 17 March 2003

Abstract

In this paper, we propose an indexing scheme for time sequences which supports the minimum distance of arbitrary Lp norms as a

similarity measurement. In many applications where the shape of the time sequence is a major consideration, the minimum distance

is a more suitable similarity measurement than the simple Lp norm. To support minimum distance queries, most of the previous

work has the preprocessing step for vertical shifting which normalizes each sequence by its mean. The vertical shifting, however, has

the additional overhead to get the mean of a sequence and to subtract it from each element of the sequence. The proposed method

can match time series of similar shape without vertical shifting and guarantees no false dismissals. In addition, the proposed method

needs only one index structure to support minimum distance queries in any arbitrary Lp norm. The experiments are performed on

real data (stock price movement) to verify the performance of the proposed method.

� 2003 Elsevier Inc. All rights reserved.

Keywords: Database; Time series; Similarity search; Minimum distance

1. Introduction

Time sequences are of growing importance in many

database applications, such as data mining and data

warehousing (Agrawal et al., 1993a; Fayyad et al., 1996).

A time sequence is a sequence of real numbers and each

number represents a value at a time point. Typical ex-

amples include stock price movement, exchange rate,

weather data, biomedical measurement, etc. Recently,

there has been a lot of attention in the problem of simi-larity search in time series databases, because it helps

predicting, hypothesis testing in data mining and knowl-

edgediscovery (Agrawal et al., 1993a; Fayyad et al., 1996).

The main issue of similarity search in time series

databases is to improve the search performance when a

particular similarity model is given. In general, indexing

is used to support fast retrieval and matching of similar

sequences. Most of approaches map the sequences oflength n into points in an n-dimensional space, and a

spatial access method such as R-tree (Guttman, 1984) or

R*-tree (Beckmann et al., 1990) can be used for fast

qThis work was supported in part by the Brain Korea 21 project and

the ITRC program.*Corresponding author. Tel.: +82-2-880-7299; fax: +82-2-883-8387.

E-mail address: [email protected] (S. Lee).

0164-1212/$ - see front matter � 2003 Elsevier Inc. All rights reserved.

doi:10.1016/S0164-1212(03)00078-5

retrieval of those points. However, sequences are usually

long. A straightforward indexing of sequences using thespatial access method suffers from the performance de-

terioration due to the dimensionality curse of an index

structure. To solve this problem, several dimensionality

reduction methods have been proposed to map se-

quences into a new feature space of a lower dimen-

sionality. Typical dimensionality reduction methods

include the discrete fourier transform (DFT) and the

discrete wavelet transform (DWT), etc. These methodsguarantee no false dismissals. Some false alarms can be

removed in the post-processing step. However, the di-

mensionality reduction method such as the DFT or the

DWT is known as efficient only when the given simi-

larity model is the Euclidean distance. The effectiveness

of the DFT and the DWT is uncertain for other simi-

larity models such as L1, L1.Another important issue of the similarity search in

time series databases is to choose the similarity model.

According to applications, several similarity models from

L1 to L1 can be required for the same sequence database.

To support this situation, several indexing methods have

to be implemented to support multi-modal queries in the

same system, which is not easy and inefficient.

The distance of simple Lp norm has been widely used

to measure the similarity of two sequences A and B of

Page 2: Minimum distance queries for time series data

106 S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113

the same length. Given two sequences A ¼ ða1; a2; . . . ;anÞ and B ¼ ðb1; b2; . . . ; bnÞ of equal length n, their

simple distance is calculated by

Lsimp ðA;BÞ ¼

Xni¼1jai

� bijp

!1=p

; 16 p61

L1 is the Manhattan distance and L2 is the Euclidean

distance. L1 is the maximum distance in any pair of

elements.Many techniques have been proposed to support the

fast retrieval of similar sequences based on the simple Lpnorm. However, the distance of simple Lp norm as a

similarity model has the following problem: it is sensi-

tive to the absolute offsets of the sequences, so two se-

quences that have similar shapes but with different

vertical positions may be classified as dissimilar.

Consider a query sequence Q ¼ ð4; 9; 4; 9; 4Þ and twodata sequences A ¼ ð7; 5; 6; 7; 6Þ and B ¼ ð14; 19; 14;19; 14Þ in Fig. 1. Note that shifting Q upward for 10

units generates B. Using the similarity definition of

simple Lp norm, A is a more similar sequence of Q than

B. However, B is more similar to Q in shape. From this

example, simple Lp norm is not a good measurement of

similarity when shape is the major consideration.

In order to overcome the shortcoming of simple Lpnorm above mentioned, the minimum distance is often

used for sequence matching (Lam and Wong, 1998; Chu

et al., 1998; Chan and Fu, 1999). The minimum distance

is the distance of Lp norm between two sequences, ne-

glecting their offsets. This can give a better estimation of

similarity in shape between two sequences irrespective of

their vertical positions. To support minimum distance

queries, most of the previous work has the preprocessingstep, called vertical shifting, which normalizes each se-

quence by its mean before indexing (Goldin and Ka-

nellakis, 1995; Keogh and Pazzani, 2000; Keogh et al.,

2001). This has the effect of shifting the sequence in the

value-axis such that its mean is zero, removing its offset.

0

5

10

15

20

25

0 1 2 3 4 5 6

Val

ue

Time

QAB

Fig. 1. Shortcoming of simple Lp norm.

However, the vertical shifting has the additional over-

head to get the mean of a sequence and to subtract the

mean from each element of the sequence. The general

approach used in previous work to process minimum

distance queries is as follows.

Building Index

1. Snormi ( normalization (Si, mean(Si))//Normalizing each sequence Si by its mean

2. FSi ( transformation ðSnormi Þ//Applying some transformations to normalized

sequence Snormi for feature extraction

//FSi is the dimensionality reduced feature of normal-

ized sequence Snormi3. Insert FSi into the spatial access method

Query Processing

1. FQ ( transformation (normalization (Q,meanðQÞ))//Project the query sequence Q into the feature space

2. Candidate selection within error bound � using in-

dex structure3. Compute the actual minimum distance between a

query sequence and candidate sequences, and filter

out false alarms

The preprocessing of vertical shifting can be negligible if

data sequences and query sequences are of the same

length. However, in time series databases, the subse-

quence matching or the sequence matching of differencelengths is more practical and important. When we would

like to search subsequences similar to a query sequence

in time series databases, the preprocessing of vertical

shifting is not easy since the length of query sequences

can be arbitrary. To solve this situation, all possible

cases of subsequences should be preprocessed. Conse-

quently, the preprocessing cost can be exponentially

large. Moreover, it is hard to maintain all possible casesin the index structure.

In this paper, we propose a novel and fast time series

indexing scheme, called the segmented mean variation

indexing (SMV-indexing). This method is motivated by

autocorrelation of sequence, that is, the variation be-

tween two adjacent elements in a sequence is invariant

under vertical shifting. This property of autocorrelation

motivated the dimensionality reduction technique in-troduced in this paper. The proposed method can match

time series of similar shapes without vertical shifting and

guarantees no false dismissals. In addition, the proposed

method needs only one index structure to process min-

imum distance queries in any arbitrary Lp norm. Our

approach can handle the subsequence matching without

preprocessing of vertical shifting based on the minimum

distance by directly adopting the ST-index method(Faloutsos et al., 1994) which is proposed for subse-

quence matching by Christos Faloutsos et al.

Page 3: Minimum distance queries for time series data

S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113 107

The remainder of this paper is organized as follows.

Section 2 provides a survey of related work. We will give

similarity models used in sequence matching in Section

3, and our proposed approach is described in Section 4.

We will present the overall process of minimum distance

queries in Section 5. Section 6 presents the experimentalresults. Finally, several concluding remarks are given in

Section 7.

2. Related work

Various methods have been proposed for fast

matching and retrieval of time series. The main focus is

to speed up the search process. The most popular

methods perform the feature extraction as the dimensi-

onality reduction of time series data, and then use a

spatial access method such as R-tree to index time seriesdata in the feature space.

An indexing scheme called the F-index (Agrawal

et al., 1993b) is proposed to handle sequences of the

same length. The idea is to use the DFT to transform a

sequence from time domain to frequency domain, drop

all but the first few frequencies, and then use the re-

maining ones to index the sequence using a spatial

access method. The results of Agrawal et al. (1993b) areextended in Faloutsos et al. (1994) and the ST-index is

proposed for subsequence matching in time series

databases. The methods proposed in Agrawal et al.

(1993b) and Faloutsos et al. (1994) use the Euclidean

distance as a similarity measurement without consider-

ing any transformation.

In Goldin and Kanellakis (1995), the authors show

that the similarity retrieval will be invariant to simpleshifting and scaling if sequences are normalized before

indexing. In Das et al. (1997) and Bollobas et al. (1997),

authors present an intuitive similarity model for time

series data. They argue that the similarity model with

scaling and shifting is better than the Euclidean dis-

tance. However, they do not present any indexing

method. In Agrawal et al. (1995), authors give a method

to retrieve similar sequences in the presence of noise,scaling and transformation in time series databases. In

Chu and Wong (1999), the authors propose the defini-

tion of similarity based on scaling and shifting trans-

formations. In Li et al. (1996) authors present a

hierarchical algorithm called the HierarchyScan. The

idea of this method is to perform correlation between

the stored sequences and the template in the trans-

formed domain hierarchically. In Lam and Wong (1998)and Chu et al. (1998), the definition of sequence simi-

larity based on the slope of a sequence’s segment is

discussed.

In Korn et al. (1997), the singular value decomposi-

tion (SVD) is used for the dimensionality reduction

technique. The SVD is a global transformation which

maps the entire dataset into a much smaller one. The

SVD can be used to support ad-hoc queries on large

dataset of sequences. The problem of the SVD is that an

insertion of sequence data to the database requires re-

computing the entire index. The SVD has to examine the

entire dataset again to update index.In Rafiei and Mendelzon (1997), authors propose a

set of linear transformations such as moving average,

time warping, and reversing. These transformations can

be used as the basis of similarity queries for time series

data. The results of Rafiei and Mendelzon (1997) are

extended in Rafiei (1999) and authors propose the

method for processing queries that express the similarity

in terms of multiple transformations instead of a singleone. In Yi et al. (1998) and Park et al. (2000), authors

use the time warping as distance function and present

algorithms for retrieving similar sequences under this

function. However, a time warping distance does not

satisfy triangular inequality and can cause false dis-

missals.

In Chan and Fu (1999), authors propose to use the

DWT for the dimensionality reduction instead of theDFT. They argue that the Harr wavelet transform per-

forms better than the DFT. However, the DWT has the

limitation that it can only be defined for sequences with

a length of the power of two.

In Perng et al. (2000), authors propose the Landmark

Model for similarity-based pattern queries in time series

databases. The Landmark Model integrates the similar-

ity measurement, data representations, and smoothingtechniques in a single framework. The model is based on

the fact that people recognize patterns by identifying

important points.

In Keogh and Pazzani (2000) and Keogh et al.

(2001), the authors introduce the dimensionality re-

duction technique of a sequence by using segmented

mean features. In this method, the sequence is divided

into non-overlapping segments. The feature of a seg-ment is represented by the segment’s mean value.

Then, the sequence is indexed in the feature space by a

spatial access method. The same concept of the inde-

pendent research is proposed in Yi and Faloutsos

(2000). They show that the segmented mean features

can be used in arbitrary Lp norm. However, since the

segmented mean features cannot support minimum

distance queries, vertical shifting is required in pre-processing raw data to process minimum distance

queries.

3. Similarity models for sequence matching

In this section, we will give two similarity models usedin sequence matching. The first definition is based on the

simple Lp norm between two sequences

Page 4: Minimum distance queries for time series data

108 S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113

Definition 1 (Simple Lp Norm). Given a threshold �, twosequences A ¼ fa1; a2; . . . ; ang and B ¼ fb1; b2; . . . ; bngof equal length n are said to be similar if

Lsimp ðA;BÞ ¼

Xni¼1jai

� bijp

!1=p

6 �

A shortcoming of Definition 1 is demonstrated in Fig. 1.

Two sequences B and Q are the same in shape, because Bcan be obtained by shifting Q upward 10 units. How-ever, they can be classified as dissimilar by Definition 1.

From this example, The simple Lp norm is not a suitable

measurement of similarity when the shape of a sequence

is the major consideration.

Definition 2 (Minimum Distance). Given a threshold �,two sequences A ¼ fa1; a2; . . . ; ang and B ¼ fb1; b2;. . . ; bng of equal length n are said to be similar in shapeif

Lminp ðA;BÞ ¼

Xni¼1jai

� bi � mjp

!1=p

6 �

where

m ¼Xni¼1ðai � biÞ=n

From Definition 2, two sequences are said to be similar

in shape if the minimum distance is less or equal to a

threshold �. This definition is a more flexible similarity

model when we would like to retrieve similar sequences

in shape from databases irrespective of their vertical

positions.

4. Proposed approach

The problem we focus on is the design of fast

searching and retrieval of similar sequences in databases

based on the minimum distance. To the best of our

knowledge, this is the first work that examines indexing

methods for minimum distance queries without verticalshifting for time series of an arbitrary length in arbitrary

Lp norm. We will now introduce the segmented mean

variation indexing (SMV-indexing) and show that it

guarantees no false dismissals.

4.1. Dimensionality reduction

Our goal is to extract features that capture the in-formation on shape of a sequence, and that will lead to a

feature distance definition satisfying the lower bound

condition of the minimum distance. Suppose we have a

set of sequences of length n. The idea of our proposed

feature extraction method consists of two steps. First,

we divide each sequence into m segments of equal length

l. Next, we extract a simple feature from each segment.

We propose to use segmented mean variation as the

feature of a segment in a sequence. FAj denotes the fea-

ture of the j-th segment of a sequence A. We define a

feature vector of a sequence A as follows.

Definition 3 (Segmented mean variation feature). Given a

sequence A ¼ fa1; a2; . . . ; ang and the number of seg-

ments m > 0, define the feature vector FA�!

of a sequence

A by

FA�!¼ ðFA1; FA2; . . . ; FAmÞ

¼ 1

l� 1�Xl�1i¼1jaiþ1

� aij;

X2ðl�1Þi¼ljaiþ1 � aij; . . . ;

Xmðl�1Þi¼ðm�1Þlþð2�mÞ

jaiþ1 � aij!

Fig. 2 illustrates the dimensionality reduction method

used in this paper. A sequence of length 11 is projected

into two dimensions. The sequence is divided into two

segments and the mean variation of each segment isobtained.

The algorithm to extract the feature vector is obvi-

ously simple. Since the segmented mean variation is in-

variant under vertical shifting, the segmented mean

variation features can be indexed to support minimum

distance queries. The proposed indexing scheme is mo-

tivated by the following observation.

Observation 1 (Autocorrelation of sequence). The seg-mented mean variation is invariant under vertical shifting.

Verifying its correctness is very obvious. The fol-

lowing equality shows its correctness.

MeanVariationðAÞ ¼ 1

ðn� 1ÞXn�1i¼1jaiþ1 � aij

¼ 1

ðn� 1ÞXn�1i¼1jðaiþ1 � maÞ � ðai � maÞj

where

ma ¼Xni¼1ðaiÞ=n

4.2. Lower bounding of the minimum distance

In order to guarantee no false dismissals, we must

show that the distance between feature vectors is the

lower bound of the minimum distance between original

sequences.

Page 5: Minimum distance queries for time series data

Sequence A=(5,4,5,6,8,7,7,6,7,6,5)

Fig. 2. Dimensionality reduction technique used in this paper.

S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113 109

It is not hard to show that a feature distance is the the

lower bound of the minimum distance.

Lsimp ðFA

�!Þ6 Lmin

p ðAÞ

In practice, however, the segmented mean variation

feature FA!

is a poor approximation of a sequence A,since it is essentially a down-sampling of a sequence.

Much of the information will be lost, consequently, too

many false alarms will occur. It is necessary to find a

way to compensate the loss of information so that we

could reduce the number of false alarms. More specifi-cally, we find a factor ap > 1 such that,

ap � Lsimp ðFA

�!Þ6 Lmin

p ðAÞSince F ð�Þ ¼ j � jp is a convex function on R for

16 p61, we can use the property of convex functionsto find ap. There is a well-known mathematical result on

convex functions. We can use the following theorem

from Yi and Faloutsos (2000) and Protter and Morrey

(1977).

Theorem 1. Suppose that x1; x2; . . . ; xn 2 R, andk1; k2; . . . ; kn 2 R such that ki P 0 and

Pni¼1 ki

� �¼ 1. If

F ðÞ is a convex function on R, thenF ðk1x1 þ k2x2 þ � � � þ knxnÞ6 k1F ðx1Þ þ k2F ðx2Þ þ � � �þ knF ðxnÞ

where R is the set of real numbers.

Proof. The proof is shown in Protter and Morrey

(1977). �

Using the result of Theorem 1 by taking ki ¼ 1n, we

have the following corollary.

Corollary 1. For any sequence A ¼ ða1; a2; . . . ; anÞ and16 p61, the following holds.

ðn� 1Þ � jMeanVariationðAÞjp 6Xn�1i¼1jaiþ1 � aijp

6 2p�1Xn�1i¼1jai

� majp þ

Xni¼2jai � majp

!

6 2pXni¼1jai � majp

where

ma ¼Xni¼1ðaiÞ=n

or, equivalently, for each segment Sj in a sequence A,16 j6m, we have

ðl� 1Þ � jMeanVariationðSjÞjp 6Xjðl�1Þ

i¼ðj�1Þlþð2�jÞjaiþ1 � aijp

6 2p�1Xjðl�1Þ

i¼ðj�1Þlþð2�jÞjai

� majp þ

Xjðl�1Þþ1

i¼ðj�1Þlþð3�jÞjai � majp

!

where

ma ¼Xni¼1ðaiÞ=n

Now, we present our main theorem as follows.

Theorem 2. For any sequence A ¼ ða1; a2; . . . ; anÞ and16 p61, the following holds.

ffiffiffiffiffiffiffiffiffiffiffil� 1

pp

� Lsimp ðFA

�!Þ6 2 �

Xni¼1jai

� majp

!1=p

¼ 2 � Lminp ðAÞ

where

ma ¼Xni¼1ðaiÞ=n

Page 6: Minimum distance queries for time series data

110 S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113

Proof. Based on the definitions of Lp and FA!, by taking

p-th root of both sides.

ðl� 1Þ � Lsimp ðFA

�!Þp ¼ ðl� 1Þ �

Xmj¼1jMeanVariationðSjÞjp

By Corollary 1, we can get the following inequality.

6 2p�1Xl�1i¼1jai

(� majp þ

Xli¼2jai � majp

)

þ 2p�1X2ðl�1Þi¼ljai

(� majp þ

X2ðl�1Þþ1

i¼lþ1jai � majp

)þ � � �

þ 2p�1Xjðl�1Þ

i¼ðj�1Þlþð2�jÞjai

(� majp þ

Xjðl�1Þþ1

i¼ðj�1Þlþð3�jÞjai � majp

)

6 2p �Xni¼1jai � majp ¼ 2p � Lmin

p ðAÞp

where

ma ¼Xni¼1ðaiÞ=n �

Owing to Theorem 2, we can efficiently handle min-

imum distance queries with the segmented mean varia-

tion features. In case, we compare two sequences A andB. By Theorem 2, we know that Lmin

p ðA;BÞ6 � representsLsimp ðFA

�!; FB�!Þ6 2 � �=

ffiffiffiffiffiffiffiffiffiffiffil� 1pp

. Consequently, minimum

distance queries in any LP norm can be correctly con-

verted to simple Lp-based distance queries in the feature

space without false dismissals. We can reduce the search

space by a factor of 2=ffiffiffiffiffiffiffiffiffiffiffil� 1pp

. Note that if ap < 1, apcannot be used to reduce the search space because it

produces more false alarms. In case of ap < 1, we use theoriginal feature distance.

5. Query processing

In this section, we present the overall process of our

time series indexing scheme. Before a query is per-

formed, we shall do some preprocessing to extract fea-ture vectors from sequences, and then to build an index

structure. After the index is built, the similarity search

can be performed to select candidate sequences from

databases.

5.1. Preprocessing

Step 1. Similarity Model Selection: According to theirapplications, users may choose the distance

model from Lmin1 to Lmin

1 as their similarity

model. The same index structure can be re-used

to process minimum distance queries without

regard to the distance model (Lminp ).

Step 2. Index Construction: Each sequence is divided

into m segments. The feature vector is extracted

from the sequence using the method mentioned

in Section 4.1 for all sequences in databases.

Then, we build a multidimensional index struc-

ture such as R-tree using the feature vectors

extracted from sequences.

5.2. Index searching

After an index structure has been built, we can per-

form the similarity search against a given query se-

quence. The searching algorithm consists of two main

parts. The first is for candidate selection and the other is

for post-processing to remove false alarms. Some non-

qualifying sequences may be included in the result ofcandidate selection. The actual minimum distance be-

tween a query sequence and candidate sequences are

computed and only those within the error bound are

reported as the query result. The implementation of the

segmented mean variation indexing is described in Al-

gorithm 1.

Algorithm 1: SMV-Indexingbegin

Input: query sequence Q, error bound �Output: data sequences within error bound �Result NULL;

Candidate NULL;

// project the query sequence Q into the

index space,

FQ FeatureExtraction(Q);// candidate selection using a Spatial

Access Method,

Candidate IndexSearching FQ; SAM; 2��ffiffiffiffiffiffil�1pp

� �;

// post-processing to remove false

alarms,

for all Ci 2 Candidate do

if ComputeMinimumDistanceðQ;CiÞ6 � thenResult Ci

SResult;

else

Reject Ci;

end

end

return Result;

end

6. Performance evaluation

In this section, we will present the results of some

experiments performed to analyze the performance of

the SMV-indexing. To verify the effectiveness of the

proposed method, we compared the SMV-indexing with

the sequential scanning with respect to the search space

ratio and the average response time to process minimumdistance queries. The experimental settings are described

first and the results of experiments are given next.

Page 7: Minimum distance queries for time series data

0

0.2

0.4

0.6

0.8

1

1000 1500 2000 2500 3000 3500 4000 4500 5000

sear

ch s

pace

rat

io

epsilon

L_1L_2L_3

L_inf

Fig. 3. Search space ratio by varying error bound.

0

0.2

0.4

0.6

0.8

1

10 20 30 40 50 60

sear

ch s

pace

rat

io

number of feature

L_1L_2L_3

L_inf

Fig. 4. Search space ratio by varying feature dimensionality.

0

0.2

0.4

0.6

0.8

1

1000 1500 2000 2500 3000 3500 4000 4500 5000

sear

ch s

pace

rat

io

epsilon

seq-scanSMV(8dim)

SMV(16dim)

S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113 111

6.1. Experimental setup

We implemented both the SMV-indexing and the

sequential scanning in C++ on a Linux machine (Red-

hat 7.1, kernel version 2.4.2) with dual Pentium III 500

MHz CPUs, 512 MB of memory and 40 GB HDD. Thesize of a disk page is set to 4 KB. For a spatial access

method, we used the Katayama’s R*-tree source

codes. 1

The real sequence data were obtained from Seoul

Stock Market, Korea. 2 The stock data were based on

their daily closing prices. We collected 2000 stock se-

quence data of average length 128. We have run 25

random queries over real dataset to find similar se-quences based on the minimum distance. The query

sequences are randomly selected from databases.

6.2. Experimental results

In first, we compared the SMV-indexing with the

sequential scanning in terms of the number of actual

computations required. We evaluated the search spaceratio to test the filtering effect of removing irrelevant

sequences in the process of index searching. The search

space ratio is defined as follows:

search space ratio

¼ the number of candidate sequences

the number of sequences in a database

Fig. 3 show the search space ratio with respect to the

error bound in Lmin1 , Lmin

2 , Lmin3 , and Lmin

1 . The dimensio-

nality of feature vectors for the SMV-indexing is set to 8

in this experiment. The results show that there is a sig-

nificant gain in reducing the search space by the SMV-

indexing.

Fig. 4 show the search space ratio with respect to the

feature dimensionality in Lmin1 , Lmin

2 , Lmin3 , and Lmin

1 . Weset the error bound to 5000 in this experiment. As the

dimensionality of feature vectors increases, the filtering

effect increases.

Good filtering effect does not always result in good

performance since the spatial access method can be

affected by the dimensionality of feature vector. We

compared the SMV-indexing of 8 dimensions and 16

dimensions with the sequential scanning with respect tothe search space ratio and the average execution time by

varying error bound in L2 norm. Figs. 5 and 6 show the

experimental results that the SMV-indexing of 16 di-

mensions is more efficient than the SMV-indexing of 8

dimensions in reducing the search space. However, the

1 The Katayama’s R*-tree source codes can be retrieved from

http://research.nii.ac.jp/~katayama/homepage/re-

search/srtree/English.html2 This data can be retrieved from http://www.kse.or.kr/

kor/stat/stat_data.htm

Fig. 5. Search space ratio: SMV-indexing vs. sequential scanning

in Lmin2 .

performance of the SMV-indexing of 16 dimensions isnot better than the SMV-indexing of 8 dimensions. The

Page 8: Minimum distance queries for time series data

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

1000 1500 2000 2500 3000 3500 4000 4500 5000

exec

utio

n tim

e(se

c)

epsilon

seq-scanSMV(8dim)

SMV(16dim)

Fig. 6. Execution time: SMV-indexing vs. sequential scanning in Lmin2 .

112 S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113

SMV-indexing is more efficient than the sequential

scanning in performance. However, as shown in Fig. 6,

the average execution time increases as the given errorbound increases, eventually, the performance of the

SMV-indexing becomes worse than that of the sequen-

tial scanning. This is because the number of retrieved

sequences for a larger error bound increases more

rapidly than the number of relevant sequences (dimen-

sionality curse of R*-tree). Consequently, the post-

processing time to remove false alarms increases with

a larger error bound.

7. Conclusion

In this paper, we considered the problem of the fast

similarity search in large time series databases, when the

similarity measurement is the minimum distance in ar-

bitrary Lp. Previous work on the sequence searchingusing the simple Lp distance as a similarity measurement

suffers from some drawbacks. The simple Lp distance is

sensitive to the absolute offsets of the sequences. In

addition, two sequences that have similar shapes but

with different vertical positions may be classified as

dissimilar. Therefore, the simple Lp distance between

two sequences is not a good measurement of similarity

in terms of their shapes.The minimum distance is a more suitable similarity

measurement than the simple Lp norm when we would

like to retrieve similar sequences in shape from data-

bases irrespective of their vertical positions. To process

minimum distance queries, most of the previous work

has the preprocessing step for vertical shifting that

normalizes each sequence by its mean. However, vertical

shifting of a sequence has the overhead in buildingindex.

In this paper, we have proposed a novel time series

indexing scheme that supports minimum distance que-

ries without vertical shifting in arbitrary Lp norm. Our

indexing scheme is motivated by autocorrelation of a

sequence, that is, the segmented mean variation in a

sequence is invariant under vertical shifting. Using the

above property (autocorrelation of a sequence), our in-

dexing scheme can handle minimum distance querieswithout vertical shifting. In addition, the segmented

mean variation feature can be used to process minimum

distance queries in any arbitrary Lp norm. Our approach

can handle the subsequence matching without prepro-

cessing of vertical shifting based on the minimum dis-

tance by directly adopting the ST-index method

(Faloutsos et al., 1994).

The major contributions of this work can be sum-marized as follows.

• We introduced the efficient indexing scheme which

processes minimum distance queries without vertical

shifting, guaranteeing no false dismissals.

• We showed that the same index structure can be used

to process minimum distance queries in arbitrary Lpnorm.

We have performed the experiments on real stock data,

and examined the pruning power and performance of

our proposed scheme compared with the sequential

scanning. The experiments show that our indexing

scheme is more efficient than the sequential scanning.

References

Agrawal, R., Imielinski, T., Swami, A., 1993a. Database mining: A

performance perspective, special issue on learning and discovery in

knowledge-based databases. IEEE TKDE 5 (6), 914–925 (special

issue).

Agrawal Rakesh, Faloutsos Christos, Swami Arun N., 1993b. Efficient

similarity search in sequence databases. In: FODO, pp. 69–84.

Agrawal Rakesh, Lin King-Ip, Sawhney Harpeet S., Shim Kyuseok,

1995. Fast similarity search in the presence of noise, scaling, and

translation in time-series databases. In: VLDB, pp. 490–501.

Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B., 1990. The

R*-tree: An efficient and robust access method for points and

rectangles. In: SIGMOD Conference, pp. 322–331.

Bollobas Bela, Das Gautam, Gunopulos Dimitrios, Mannila Heikki,

1997. Time-series similarity problems and well-separated geo-

metric sets. In: Symposium on Computational Geometry, pp. 454–

456.

Chan Kin-pong, Fu Ada Wai-chee, 1999. Efficient time series matching

by wavelets. In: ICDE, pp. 126–133.

Chu Kelvin Kam Wing, Wong Man Hon, 1999. Fast time-series

searching with scaling and shifting. In: PODS, pp. 237–248.

Chu Kelvin Kam Wing, Lam, Sze Kin, Wong, Man Hon, 1998. An

efficient hash-based algorithm for sequence data searching. The

computer journal 41 (6), 402–415.

Das Gautam, Gunopulos Dimitrios, Mannila Heikki, 1997. Finding

similar time series. In: PKDD, pp. 88–100.

Faloutsos Christos, Ranganathan M., Manolopoulos Yannis, 1994.

Fast subsequence matching in time-series databases. In: SIGMOD

Conference, pp. 419–429.

Page 9: Minimum distance queries for time series data

S. Lee et al. / The Journal of Systems and Software 69 (2004) 105–113 113

Fayyad Usama M., Gregory Piatetsky-Shapiro, Smyth Padharic, 1996.

Knowledge discovery and data mining: towards a unifying frame-

work. In: KDD, pp. 82–88.

Goldin Dina Q., Kanellakis Paris C., 1995. On similarity queries for

time-series data: Constraint specification and implementation. In:

CP, pp. 137–153.

Guttman, A., 1984. R-trees: A dynamic index structure for spatial

searching. In: SIGMOD Conference, pp. 47–57.

Keogh Eamonn J., Chakrabarti Kaushik, Mehrotra Sharad, Pazzani

Michael J., 2001. Locally adaptive dimensionality reduction for

indexing large time series databases. In: SIGMOD conference, pp.

151–162.

Keogh Eamonn J., Pazzani Michael J., 2000. A simple dimensionality

reduction technique for fast similarity search in large time series

databases. In: PAKDD, pp. 122–133.

Korn Flip, Jagadish H.V., Faloutsos Christos, 1997. Efficiently

supporting ad hoc queries in large datasets of time sequences. In:

SIGMOD Conference, pp. 289–300.

Lam, Sze Kin, Wong, Man Hon, 1998. A fast projection algorithm for

sequence data searching. DKE 28 (3), 321–339.

Li Chung-Sheng, Yu Philip S., Castelli Vittorio, 1996. HierarchyScan:

A Hierarchical similarity search algorithm for databases of long

sequences. In: ICDE, pp. 546–553.

Park Sanghyun, Chu Wesley W., Yoon Jeehee, Hsu Chihcheng, 2000.

Efficient searches for similar subsequences of different lengths in

sequence databases. In: ICDE, pp. 23–32.

Perng Chang-Shing, Wang Haixun, Zhang Sylvia R., Parker D. Stott,

2000. Landmarks: A new model for similarity-based pattern

querying in time series databases. In: ICDE, pp. 33–42.

Protter, M.H., Morrey, C.B., 1977. A First Course in Real Analysis.

Springer-Verlag.

Rafiei Davood, 1999. On similarity-based queries for time series data.

In: ICDE, pp. 410–417.

Rafiei Davood, Mendelzon Alberto O., 1997. Similarity-based queries

for time series data. In: SIGMOD Conference, pp. 13–25.

Yi Byoung-Kee, Faloutsos Christos, 2000. Fast time sequence indexing

for arbitrary Lp norms. In: VLDB, pp. 385–394.

Yi Byoung-Kee, Jagadish H.V., Faloutsos Christos, 1998. Efficient

retrieval of similar time sequences under time warping. In: ICDE,

pp. 201–208.