1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003

1

Finding Periodic Partial Patterns in Time Series Database

Huiping CaoApr. 30, 2003

2

Outline

Problem Definition Mining partial periodicity for some given

period(s)– single period– multi periods

Mining partial periodicity when no period length is given in advance

Conclusion & Future work

3

Problem definition

Time series S = D1, D2, ..., Dn, where Di is a set of features for time instant i.

Partial pattern s = s1 ... sp . Here, si is defined over (2L-{}{*}) where L is the underlying set of features and * refers to the “don’t care” character.

– |s|: pattern length– L-length of s: number of si which contains letters from L.– subpattern of a pattern s: is a pattern s’ = s’1 ... s’p such that |s| = |s’| and s’i si

for every position i where s’i *.– E.g.: s = a*{a,c}de

|s|=5, L-length is 4(also called 4-pattern) a*{a,c}** and **cde are all its subpatterns.

4

Problem definition

frequency_count(s) in sequence S=D1, D2, ..., Dn

– frequency_count(s) = |{i|0i<m, and string s is true in Di|s|+1, Di|s|+s, ..., Di|s|+|s|}|. confidence(s) = frequency_count(s)/m

– m: maximum number of periods of length |s| contained in the time series.(m|s| n<(m+1)|s|).

– E.g.: In a{b,c}baebaced, freq_count(a*b) =2, conf(a*b) =2/3 period segment: segment in form of Di|s|+1, Di|s|+s, ..., Di|s|+|s| where 0i<m.

– A patterns s = s1 ... sp is true in some period segment means: for each position i, either si is * or all the letters in si occur in the ith set of the features in the segment. Pattern “a*b” is true in segment “acb”, but not true in “bcb”

frequent partial periodic pattern s: – confidence(s) min_conf, which is a user specified threshold

5

Problem definition

Input:– A time series S– Specified period(s)– m: indicating the ratio of the lengths of S and the

patterns must be at least m– min_conf

Goal:– Discover all the frequent patterns for one period or

some periods

6

Mining partial periodicity for some given period(s)

For single period For multi periods Deviation of all partial patterns

– Max-subpattern tree– Deviation of frequent patterns from max-subpattern

tree

7

Mining partial periodicity for single period

Notation:– F1: the set of frequent 1-patterns of period p. For example,

p=3, a**, *{b,c}*, **g are all in F1. Single-period Apriori

– Find frequent F1. Accumulate the frequency count for each 1-pattern in each whole

period segment; select those F1 whose frequency count min_conf*m

– Find all frequent i-patterns of period p(2ip) using the Apriori property. Terminate when the candidate i-pattern set is empty.

Step 1 scan source data once, and step 2 need scan source data up to p-1 times in the worst case.

8

Mining partial periodicity for single period

The advantage of Apriori property in mining partial patterns is not as obvious as that in mining association rule.– In mining association rule, the number of frequent i-

itemsets shrinks quickly as i increase because of the sparsity of frequent i-itemsets.

– In mining periodic patterns, the number of frequent i-patterns does not shrink quickly as i increase because of strong correlation between frequencies of patterns and their subpatterns.

9

Max-subpattern hit set method -single period

Candidate max-pattern, Cmax, is the maximal pattern generated from F1.

– E.g., F1={a****,*b***,**c**}, Cmax =abc**

A subpattern of Cmax is hit in a period segment Si of S if it is the maximal subpattern of Cmax.

– E.g., Cmax =abc**, Si = abdef, then ab*** is its hit subpattern. The complete set of partial periodic patterns can be

derived from the frequency counts of all the hit maximal subpatterns of Cmax.

10

Max-subpattern hit set method-single period(algorithm2)

Scan S once to find frequent F1. Form the candidate Cmax.

Scan S once again. For each period segment, if it’s nonempty, add it to the max-subpattern tree(introduced later).

Derive frequent patterns from the max-subpattern tree(introduced later).

Only two scans on source data.

11

Max-subpattern tree

Max-subpattern tree is used to facilitate the process of deriving the set of frequent patterns, which is the step 2 of the algorithm 2.

Rooted at Cmax

Each subpattern of Cmax with one non-* letter missing is a direct child node of the root.

A node w with one more non-* letters may have a set of children, each of which is a subpattern of w with one more non-* letter missing.

Each node has a “count” field.

12

Max-subpattern tree10

*b2*d* *b1*d* *{b1,b2}*** a**d* ab2*** ab1***8 18 5 19 0 2

~a

~d

~a~a~b1 ~b1 ~b2~b1 ~b2

~d~d~b2

*{b1,b2}*d* ab2*d* ab1*d* a{b1,b2}***0 50 40 32

a{b1,b2}*d*~a ~b1 ~d~b2

One node is linked to only one parent, e.g., a**d* is linked to ab2*d*,but not linked to ab1*d*(missing link is marked by a green dash line.)

13

Insertion in the max-subpattern tree

Insert a max-subpattern w found during the scan of S into the max-subpattern tree T.

Step1: Starting from the root of the tree, find the corresponding node by checking the missing non-* letter in order. – E.g., To max-pattern node *b1*d* , it has two letters,

a and b2 missing from the Cmax =a{b1,b2}*d*. The node can be found follong ~a link to *{b1,b2}*d*, then following ~b2 link to *b1*d*.

14

Insertion in the max-subpattern tree

Step2: If the node W is found, increase its count by 1. Otherwise, create a new node w with count 1 and its missing ancestor nodes(only those on the path to w, with count 0), if any, and insert it/them into the corresponding place(s) of the tree.

– E.g., If the first max-subpattern node found is *b1*d*, we will create the node *b1*d* with count 1 and create two ancestor nodes with count 0: w1=a{b1,b2}*d*(root) and w2= *{b1,b2}*d* following ~a link of w1. The node *b1*d* is w2’s child, following the ~b2 link.

15

Derivation of frequent patterns from max-subpattern tree

The set frequent F1 is derived in the first scan. The set of frequent k-pattern(k>1) is derived

from the max-subpattern tree as follows:– for i=2 to |F1|

derive candidate patterns with L-length i from frequent (L-1)-length patterns by “(i+1)-way join”

scan tree T to find frequency counts of these candidate patterns and eliminate the non-frequent once.(note: the frequency count of a node is the sum of count(itself) and count(all of its reachable ancestors)). IF the derived frequent i-pattern set is empty, return.

16

Example:

Set of reachable ancestors of a node w in a max-subpattern tree T is the set of all the nodes in T, which are proper super-patterns of w.

Let min_conf*m=45. We check level-2 node, w = *b2*d*

17

Example:

– reachable ancestor set(w) = {root, *{b1,b2}*d*,ab2*d*}– frequency_count(w) = 10+0+50+8 =68 >45– it’s frequent!

10

*b2*d* *b1*d* *{b1,b2}*** a**d* ab2*** ab1***8 18 5 19 0 2

~a

~d

~a~a~b1 ~b1 ~b2~b1 ~b2

~d~d~b2

*{b1,b2}*d* ab2*d* ab1*d* a{b1,b2}***0 50 40 32

a{b1,b2}*d*~a ~b1 ~d~b2

18

Max-subpattern hit set method -multi period

Scan S once, for all periods pj. Find F1(pj) and form Cmax(pj).

Scan S again, for all periods pj do the same as step2 in the former algorithm.

Only two scans on source data.

19

Experiment

Synthetic data– 100k, 550k

|F1|=12 p=50 HitSet method

outperforms Apriori method

Performance gain when L-length increase

20

Mining partial periodicity when no period is given in advance

In the algorithms provided before, the period length is given in advance. – How to find frequent partial patterns when periods

are unknown? PPD(Partial Periodic Detection) algorithm

– Filter step: scan the data once to find those possible periods automatically(introduce later).

– Mining step:Apply Han’s algorithm to discover frequent partial patterns for the periods gotten in the first step.

21

Filter step

Assume the size of a time series is N Step1: Create a binary vector of size N for

every letter in the alphabet.– 1 for every occurrence of the corresponding letter– 0 for every other letter

e.g.: abcdabebadfcacdcfcaa, N=20, binary vector(a) = 10001000100010000011 binary vector(b) = 01000101000000000000 ...

22

Filter step

Step2: calculate the Circular Autocorrelation Function for every binary vector.

– Autocorrelation means self-correlation. I.e., discovering correlations among the elements of the same vector v for every possible period length(1,...,N).

– The function value is the dot products between v and v shifted circularly by a lag k. We denote v(k) as vector shifted by k.

Circular means that if one point will be out of N after shifting k, it will be moved to the beginning of the vector.

E.g., assume v =1001, then v shifted by 2 is 0110– For each k from 1 to N, compute autocorrelation function r(k)

r(k) = v.v(k) E.g., r(2) =(1001). = 0 for v=1001

0110

23

– Example:abcdabebadfcacdcfcaa, N=20, binary vector(a) = 10001000100010000011 first value of the autocorrelation vector is the dot product of the binary vector with itself. It’s 6. The peak identified at position 5 implies that there is probably a period of length 4 and the

value of 3 at this position is an estimate of the frequency count of this period. Identifying all those peaks get a set of candidate periods. Given min-conf c, extract frequent periods. If the value for period p is greater than or equal to

cN/p, then p is frequent period.

24

Exprement:

Test data: – Real data: Wal-Mart stores and power consumption data.– Synthetic data: Machine Learning Repository.

Different runs over different portions of the data sets showed that the execution time is linearly proportional to the size of the time series as well as the size of the alphabet.

25

Conclusion & future work

Introduce algorithms used to mine frequent partial patterns in time series database– given periods– periods are unknown

How to solve the similar problem in data stream setting considering some constraint:– one-pass– memory limitation

26

References:

J. Han, G. Dong, Y. Yin. Efficient Mining of Partial Periodic Patterns in Time Series Database. In ICDE99.

C.Berberidis, I. Vlahavas, W. G. Aref. etc. On the Discovery of Weak Periodicities in Large Time Series. In PKDD02.

Documents

1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003