26
1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

Embed Size (px)

DESCRIPTION

3 Problem definition Time series S = D 1, D 2,..., D n, where D i is a set of features for time instant i. Partial pattern s = s 1... s p. Here, s i is defined over (2 L - {  }  {*}) where L is the underlying set of features and * refers to the “don’t care” character. – |s|: pattern length – L-length of s: number of s i which contains letters from L. – subpattern of a pattern s: is a pattern s’ = s’ 1... s’ p such that |s| = |s’| and s’ i  s i for every position i where s’ i  *. – E.g.: s = a*{a,c}de |s|=5, L-length is 4(also called 4-pattern) a*{a,c}** and **cde are all its subpatterns.

Citation preview

Page 1: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

1

Finding Periodic Partial Patterns in Time Series Database

Huiping CaoApr. 30, 2003

Page 2: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

2

Outline

Problem Definition Mining partial periodicity for some given

period(s)– single period– multi periods

Mining partial periodicity when no period length is given in advance

Conclusion & Future work

Page 3: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

3

Problem definition

Time series S = D1, D2, ..., Dn, where Di is a set of features for time instant i.

Partial pattern s = s1 ... sp . Here, si is defined over (2L-{}{*}) where L is the underlying set of features and * refers to the “don’t care” character.

– |s|: pattern length– L-length of s: number of si which contains letters from L.– subpattern of a pattern s: is a pattern s’ = s’1 ... s’p such that |s| = |s’| and s’i si

for every position i where s’i *.– E.g.: s = a*{a,c}de

|s|=5, L-length is 4(also called 4-pattern) a*{a,c}** and **cde are all its subpatterns.

Page 4: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

4

Problem definition

frequency_count(s) in sequence S=D1, D2, ..., Dn

– frequency_count(s) = |{i|0i<m, and string s is true in Di|s|+1, Di|s|+s, ..., Di|s|+|s|}|. confidence(s) = frequency_count(s)/m

– m: maximum number of periods of length |s| contained in the time series.(m|s| n<(m+1)|s|).

– E.g.: In a{b,c}baebaced, freq_count(a*b) =2, conf(a*b) =2/3 period segment: segment in form of Di|s|+1, Di|s|+s, ..., Di|s|+|s| where 0i<m.

– A patterns s = s1 ... sp is true in some period segment means: for each position i, either si is * or all the letters in si occur in the ith set of the features in the segment. Pattern “a*b” is true in segment “acb”, but not true in “bcb”

frequent partial periodic pattern s: – confidence(s) min_conf, which is a user specified threshold

Page 5: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

5

Problem definition

Input:– A time series S– Specified period(s)– m: indicating the ratio of the lengths of S and the

patterns must be at least m– min_conf

Goal:– Discover all the frequent patterns for one period or

some periods

Page 6: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

6

Mining partial periodicity for some given period(s)

For single period For multi periods Deviation of all partial patterns

– Max-subpattern tree– Deviation of frequent patterns from max-subpattern

tree

Page 7: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

7

Mining partial periodicity for single period

Notation:– F1: the set of frequent 1-patterns of period p. For example,

p=3, a**, *{b,c}*, **g are all in F1. Single-period Apriori

– Find frequent F1. Accumulate the frequency count for each 1-pattern in each whole

period segment; select those F1 whose frequency count min_conf*m

– Find all frequent i-patterns of period p(2ip) using the Apriori property. Terminate when the candidate i-pattern set is empty.

Step 1 scan source data once, and step 2 need scan source data up to p-1 times in the worst case.

Page 8: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

8

Mining partial periodicity for single period

The advantage of Apriori property in mining partial patterns is not as obvious as that in mining association rule.– In mining association rule, the number of frequent i-

itemsets shrinks quickly as i increase because of the sparsity of frequent i-itemsets.

– In mining periodic patterns, the number of frequent i-patterns does not shrink quickly as i increase because of strong correlation between frequencies of patterns and their subpatterns.

Page 9: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

9

Max-subpattern hit set method -single period

Candidate max-pattern, Cmax, is the maximal pattern generated from F1.

– E.g., F1={a****,*b***,**c**}, Cmax =abc**

A subpattern of Cmax is hit in a period segment Si of S if it is the maximal subpattern of Cmax.

– E.g., Cmax =abc**, Si = abdef, then ab*** is its hit subpattern. The complete set of partial periodic patterns can be

derived from the frequency counts of all the hit maximal subpatterns of Cmax.

Page 10: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

10

Max-subpattern hit set method-single period(algorithm2)

Scan S once to find frequent F1. Form the candidate Cmax.

Scan S once again. For each period segment, if it’s nonempty, add it to the max-subpattern tree(introduced later).

Derive frequent patterns from the max-subpattern tree(introduced later).

Only two scans on source data.

Page 11: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

11

Max-subpattern tree

Max-subpattern tree is used to facilitate the process of deriving the set of frequent patterns, which is the step 2 of the algorithm 2.

Rooted at Cmax

Each subpattern of Cmax with one non-* letter missing is a direct child node of the root.

A node w with one more non-* letters may have a set of children, each of which is a subpattern of w with one more non-* letter missing.

Each node has a “count” field.

Page 12: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

12

Max-subpattern tree10

*b2*d* *b1*d* *{b1,b2}*** a**d* ab2*** ab1***8 18 5 19 0 2

~a

~d

~a~a~b1 ~b1 ~b2~b1 ~b2

~d~d~b2

*{b1,b2}*d* ab2*d* ab1*d* a{b1,b2}***0 50 40 32

a{b1,b2}*d*~a ~b1 ~d~b2

One node is linked to only one parent, e.g., a**d* is linked to ab2*d*,but not linked to ab1*d*(missing link is marked by a green dash line.)

Page 13: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

13

Insertion in the max-subpattern tree

Insert a max-subpattern w found during the scan of S into the max-subpattern tree T.

Step1: Starting from the root of the tree, find the corresponding node by checking the missing non-* letter in order. – E.g., To max-pattern node *b1*d* , it has two letters,

a and b2 missing from the Cmax =a{b1,b2}*d*. The node can be found follong ~a link to *{b1,b2}*d*, then following ~b2 link to *b1*d*.

Page 14: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

14

Insertion in the max-subpattern tree

Step2: If the node W is found, increase its count by 1. Otherwise, create a new node w with count 1 and its missing ancestor nodes(only those on the path to w, with count 0), if any, and insert it/them into the corresponding place(s) of the tree.

– E.g., If the first max-subpattern node found is *b1*d*, we will create the node *b1*d* with count 1 and create two ancestor nodes with count 0: w1=a{b1,b2}*d*(root) and w2= *{b1,b2}*d* following ~a link of w1. The node *b1*d* is w2’s child, following the ~b2 link.

Page 15: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

15

Derivation of frequent patterns from max-subpattern tree

The set frequent F1 is derived in the first scan. The set of frequent k-pattern(k>1) is derived

from the max-subpattern tree as follows:– for i=2 to |F1|

derive candidate patterns with L-length i from frequent (L-1)-length patterns by “(i+1)-way join”

scan tree T to find frequency counts of these candidate patterns and eliminate the non-frequent once.(note: the frequency count of a node is the sum of count(itself) and count(all of its reachable ancestors)). IF the derived frequent i-pattern set is empty, return.

Page 16: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

16

Example:

Set of reachable ancestors of a node w in a max-subpattern tree T is the set of all the nodes in T, which are proper super-patterns of w.

Let min_conf*m=45. We check level-2 node, w = *b2*d*

Page 17: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

17

Example:

– reachable ancestor set(w) = {root, *{b1,b2}*d*,ab2*d*}– frequency_count(w) = 10+0+50+8 =68 >45– it’s frequent!

10

*b2*d* *b1*d* *{b1,b2}*** a**d* ab2*** ab1***8 18 5 19 0 2

~a

~d

~a~a~b1 ~b1 ~b2~b1 ~b2

~d~d~b2

*{b1,b2}*d* ab2*d* ab1*d* a{b1,b2}***0 50 40 32

a{b1,b2}*d*~a ~b1 ~d~b2

Page 18: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

18

Max-subpattern hit set method -multi period

Scan S once, for all periods pj. Find F1(pj) and form Cmax(pj).

Scan S again, for all periods pj do the same as step2 in the former algorithm.

Only two scans on source data.

Page 19: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

19

Experiment

Synthetic data– 100k, 550k

|F1|=12 p=50 HitSet method

outperforms Apriori method

Performance gain when L-length increase

Page 20: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

20

Mining partial periodicity when no period is given in advance

In the algorithms provided before, the period length is given in advance. – How to find frequent partial patterns when periods

are unknown? PPD(Partial Periodic Detection) algorithm

– Filter step: scan the data once to find those possible periods automatically(introduce later).

– Mining step:Apply Han’s algorithm to discover frequent partial patterns for the periods gotten in the first step.

Page 21: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

21

Filter step

Assume the size of a time series is N Step1: Create a binary vector of size N for

every letter in the alphabet.– 1 for every occurrence of the corresponding letter– 0 for every other letter

e.g.: abcdabebadfcacdcfcaa, N=20, binary vector(a) = 10001000100010000011 binary vector(b) = 01000101000000000000 ...

Page 22: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

22

Filter step

Step2: calculate the Circular Autocorrelation Function for every binary vector.

– Autocorrelation means self-correlation. I.e., discovering correlations among the elements of the same vector v for every possible period length(1,...,N).

– The function value is the dot products between v and v shifted circularly by a lag k. We denote v(k) as vector shifted by k.

Circular means that if one point will be out of N after shifting k, it will be moved to the beginning of the vector.

E.g., assume v =1001, then v shifted by 2 is 0110– For each k from 1 to N, compute autocorrelation function r(k)

r(k) = v.v(k) E.g., r(2) =(1001). = 0 for v=1001

0110

Page 23: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

23

– Example:abcdabebadfcacdcfcaa, N=20, binary vector(a) = 10001000100010000011 first value of the autocorrelation vector is the dot product of the binary vector with itself. It’s 6. The peak identified at position 5 implies that there is probably a period of length 4 and the

value of 3 at this position is an estimate of the frequency count of this period. Identifying all those peaks get a set of candidate periods. Given min-conf c, extract frequent periods. If the value for period p is greater than or equal to

cN/p, then p is frequent period.

Page 24: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

24

Exprement:

Test data: – Real data: Wal-Mart stores and power consumption data.– Synthetic data: Machine Learning Repository.

Different runs over different portions of the data sets showed that the execution time is linearly proportional to the size of the time series as well as the size of the alphabet.

Page 25: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

25

Conclusion & future work

Introduce algorithms used to mine frequent partial patterns in time series database– given periods– periods are unknown

How to solve the similar problem in data stream setting considering some constraint:– one-pass– memory limitation

Page 26: 1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

26

References:

J. Han, G. Dong, Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series Database. In ICDE99.

C.Berberidis, I. Vlahavas, W. G. Aref. etc. On the Discovery of Weak Periodicities in Large Time Series. In PKDD02.