Discovering periodic frequent patterns in transactional databases

7/25/2019 Discovering periodic frequent patterns in transactional databases

http://slidepdf.com/reader/full/discovering-periodic-frequent-patterns-in-transactional-databases 1/12

Efficient Discovery of Periodic-FrequentPatterns using Periodic-Condence

Submitted for Blind Review

Abstract. Knowledge pertaining to rare items is often of great interestand high value. However, nding this knowledge in big data is chal-lenging due to sparse appearances of rare items. This problem is knownas the rare item problem . This paper makes an effort to address therare item problem in periodic-frequent pattern mining. This task is non-trivial, because the problem needs to be tackled not only with respect tofrequency dimension, but also with respect to time dimension. A novelmodel to discover periodic-frequent patterns involving both frequent andrare items has been proposed in this paper. A new interestingness mea-sure, periodic-condence , is proposed to address the rare item problemin time dimension. This measure determines the condence with whichevery item within a pattern is appearing together in time dimension.This model also employs all-condence to address the same problem infrequency dimension. A pattern-growth algorithm to extract all periodic-frequent patterns in a database has also been proposed. Experimentalresults show that the proposed model is efficient.

Keywords: Data mining, rare item problem, periodic patterns

1 Introduction

1.1 Background and related work

Frequency and time are two most important dimensions to determine the in-terestingness of a pattern in a given database. Periodic-frequent patterns (oritemsets) are an important class of regularities that exist in a database with re-spect to these two dimensions. Finding periodic-frequent patterns is a signicanttask with many real-world applications [4, 11]. A classic application is market-basket analysis. It analyzes how regularly the sets of items are being purchasedby the customers. An example of a periodic-frequent pattern is as follows:

{Bat, Ball } [support = 10% , periodicity = 1 hour ].

The above pattern says that 10% of customers have purchased the items ‘ Bat ’and ‘Ball ’ at least once in every hour.

The problem of nding periodic patterns has been widely studied in time se-

ries data [4, 13]. These studies consider time series data as a symbolic sequence,and therefore, do not take into account the actual temporal information of theitems within the data. Ozden et al. [9] have enhanced the transactional databasewith a time attribute that indicates the time when a transaction has appeared,and discussed a model to nd periodically appearing frequent patterns (or cyclicassociation rules). The main limitation of this model is that it suffers from anopen problem of methodology to fragment data with respect to time. Tanbeer et



2 Submitted for Blind Review

al. [11] have proposed a simplied periodic-frequent model, which does not re-quire data fragmentation. Briey, this model discovers all patterns in a database

that satisfy the user-specied minimum support (minSup ) and maximum peri-odicity (maxPer ) constraints. The minSup controls the minimum number of transactions in which a pattern has to appear, and the maxPer controls themaximum inter-arrival time of a pattern in the entire database.

1.2 Motivation

The constraints, minSup and maxPer, play a key role in the practical appli-cability of periodic-frequent patterns. They are used to prune the search spaceand limit the number of patterns generated. Since only a single minSup andmaxPer is used for the whole data, this model implicitly assumes that all itemsin the data have uniform frequency and same periodic behavior. However, this isoften not the case in many real-world data sets. In many data sets, some items

appear very frequently in the data, while others rarely appear. Moreover, rareitems typically have long periods (or inter-arrival times) as compared againstthe frequent items. Consider the following applications.

– In an accident data set, reports related to the completely destroyed vehiclesdo not appear as frequently as the reports related to the minor damages to avehicle. As a result, former type of accidents generally has less frequency andlonger inter-arrival times as compared against the latter type of accidents.Figure 1(a) and (b) show the frequency and median of inter-arrival timesof three different damage types reported in the Federal Aviation Adminis-tration (FAA) data set [1], respectively. It can be observed from these twogures that rarely appearing damage types have longer inter-arrival time ascompared against the relatively frequent damage types.

– In a computer network, high severity events (e.g. cascading failure) do notappear as frequently as the regular routine events (e.g. data backup). Asa result, former events have relatively low frequency and long inter-arrivaltimes as compared against the latter events.

Henceforth, nding periodic-frequent patterns with a single minSup and maxPercauses the rare item problem [12]. Briey, this problem involves either missing theperiodic-frequent patterns involving rare items or generating too many patterns(due to combinatorial explosion of frequent items).

Uday and Reddy [7] have tried to address the rare item problem by ndingperiodic-frequent patterns using item specic minSup and maxPer values. Thisapproach requires too many input parameters, and moreover, suffers from anopen problem of determining the item specic minSup and maxPer values.

In the literature, researchers have discussed several interestingness measures(e.g. all-condence [8] and χ 2 [3,10]) to address the rare item problem in fre-quent pattern mining. We cannot simply use these measures to address the sameproblem in periodic-frequent pattern mining. The reason is that these measurestake into account only the frequency dimension and do not consider the timedimension of a pattern. Henceforth, addressing this problem in periodic-frequentpattern mining is a non-trivial and challenging task.



Efficient Discovery of Periodic-Frequent Patterns using Periodic-Condence 3

Fig. 1. Statistics on different damage types inFAA data set.

0 10 20 30 40 50 60 70 80 90

100

0 10000 20000 30000 40000 50000 60000

70000

Minor Substantial Destroyed

Type of accident(a) Frequency (b) Median of inter-arrival times

Type of accident

M e

d i a n o

f i n t e r - a r r

i v a

l

t i m e s

( d a y s

)

Minor Substantial Destroyed N u m

b e r o

f t r a n s a c t

i o n s

Table 1. Running example: a trans-actional database

ts Items1 a, b2 a , b, d3 c , d , g , h4 c , e , f 5 a, b

ts Items7 a , b, c , e8 c, d10 a , b, d, e , f 11 c , d, g12 a , e , f

1.3 Contributions of this paperThis paper proposes a model to address the rare item problem in periodic-frequent pattern mining. A new interestingness measure, periodic-condence , hasbeen introduced to address the above problem in time dimension. This measuredetermines the interestingness of a pattern by assessing how regularly all of its

items are appearing together in a database. The proposed model also employsall-condence to address the same problem in frequency dimension. The usageof periodic-condence and all-condence facilitate us to achieve the objective of generating periodic-frequent patterns containing both frequent and rare itemsyet without causing frequent items to generate too many uninteresting patterns.A pattern-growth algorithm, Extended Periodic-Frequent pattern-growth (EPF-growth), has been described to nd all periodic-frequent patterns. Experimentalresults demonstrate that the proposed model can discover useful informationand EPF-growth is efficient.

1.4 Paper organizationSection 2 describes the basic model of periodic-frequent patterns. Section 3 dis-cusses the rare item problem, and introduces the proposed model of periodic-frequent patterns. Section 4 describes the EPF-growth algorithm. Section 5 re-ports on the experimental results. Finally, section 6 concludes the paper withfuture research directions.

2 Basic Model of Periodic-Frequent PatternsLet I be the set of items, and X ⊆ I be a pattern (or an itemset). A patterncontaining β number of items is called a β -pattern . A transaction , tk = ( ts,Y )is a tuple, where ts ∈ R represents the timestamp at which the pattern Y hasoccurred. A transactional database T DB over I is a set of transactions,T DB = {t 1 , · · · , t m }, m = |T DB |, where |T DB | can be dened as the num-ber of transactions in TDB. For a transaction tk = ( ts, Y ), k ≥ 1, such thatX ⊆ Y , it is said that X occurs in tk and such timestamp is denoted as ts X .Let T S X = {ts X

j , · · · , ts X

k }, j, k ∈ [1, m ] and j ≤ k, be an ordered set of

timestamps where X has occurred in T DB . In this paper, we call this listof timestamps of X as ts-list of X . The number of transactions containingX in T DB is dened as the support of X and denoted as sup (X ). That is,sup (X ) = |T S X |. Let ts X

q and ts Xr , j ≤ q < r ≤ k, be the two consecutive

timestamps in T S X . The time difference (or an inter-arrival time) between ts Xr

and ts Xq is dened as a period of X , say pX

a . That is, pXa = ts X

r − ts Xq . Let




P X = ( pX1 , pX

2 , · · · , pXr ) be the set of all periods for pattern X . The period-

icity of X , denoted as per (X ) = max ( pX1 , pX

2 , · · · , pXr ). The pattern X is a

frequent pattern if sup (X ) ≥ minSup , where minSup refers to the user-specied minimum support constraint. The frequent pattern X is said to beperiodic-frequent if per (X ) ≤ maxPer , where maxPer refers to the user-specied maximum periodicity constraint. The problem denition of periodic-frequent pattern mining involves discovering all patterns in T DB that satisfythe user-specied minSup and maxPer constraints.

Example 1. Table 1 shows the database with the set of items I = {a,b,c,d,e,f,g,h}. The set of items ‘a ’ and ‘b’, i.e., ‘ab’ is a pattern. This pattern contains onlytwo items. Therefore, this is a 2-pattern. In the rst transaction, t1 = {1 : ab}, 1denotes the timestamp at which the pattern ‘ ab’ has appeared in the database.In the entire database, the pattern ‘ ab’ appears at the timestamps of 1 , 2, 5, 7and 10. Therefore, T S ab = {1, 2, 5, 7, 10}. The support of ‘ ab’, i.e., sup (ab) =|T S ab | = |1, 2, 5, 7, 10| = 5. If the user-specied minSup = 5, then ab is afrequent pattern because sup (ab) ≥ minSup . The periods for this pattern are: pab

1 = 1 (= 1 − ts ini ), pab2 = 1 (= 2 − 1), pab

3 = 3 (= 5 − 2), pab4 = 2 (= 7 − 5),

pab5 = 3 (10 − 7) and pab

6 = 2 (= ts f in − 10), where ts ini = 0 represents thetimestamp of initial transaction and ts f in = 12 represents the timestamp of naltransaction in the database. Therefore, P ab = (1 , 1, 3, 2, 3, 2). The periodicityof ‘ab ,’ i.e., per (ab) = max (1, 1, 3, 2, 3, 2)=3. If the user-dened maxPer = 3,then the frequent pattern ‘ ab’ is a periodic-frequent pattern because per (ab) ≤maxPer .

3 Proposed ModelIn this section, we rst describe the rare item problem in the basic model. Next,we introduce an enhanced model that address this problem effectively.

3.1 The rare item problemFinding periodic-frequent patterns with a single minSup and maxPer leads tothe following problems:

– If the minSup is set too high and/or the maxPer is set too short, we willmiss the periodic-frequent patterns involving rare items.

– In order to nd the periodic-frequent patterns involving both frequent andrare items, we have to set a low minSup and a long maxPer . However,this may result in combinatorial explosion, producing too many patterns,because frequent items can combine with one another in all possible waysand many of them are meaningless.

This dilemma is known as the rare item problem .Example 2. In Table 1, the items e and f are appearing less frequently (or rarely)than the items a,b,c and d. If we set a high minSup and a short maxPer ,say minSup = 5 and maxPer = 3, we will miss the periodic-frequent patterncontaining these two rare items. In order to discover the periodic-frequent patterncontaining e and f , we have to set a low minSup and a long maxPer , sayminSup = 2 and maxPer = 6. All periodic-frequent patterns discovered at




these threshold values are shown in the column titled I in Table 2. It can beobserved from this table that setting a low minSup and a long maxPer has not

only resulted in nding ‘ ef ’ as a periodic-frequent pattern, but also resulted ingenerating the uninteresting patterns ‘ ce’ and ‘cd’ as periodic-frequent patterns.The pattern ‘ ce’ is uninteresting, because the rare item e is randomly appearingwith a frequent item c in very few transactions. The pattern ‘ cd’ is uninteresting,because it contains the frequent items c and d appearing together at very longinter-arrival times (or periodicity ). This pattern may be considered interesting if its periodicity is same as that of frequently occurring pattern ‘ ab’ as in Example1.Table 2. Periodic-frequent patterns discovered from Table 1. The termsPat, s, allConf, p and perConf refer to pattern , support , all -confidence , periodicityand periodic -confidence , respectively. The columns titled I , II and I II represent theperiodic-frequent patterns generated using basic model, extending all-condence to thebasic model and the proposed model, respectively.

Pat s allConf p perConf I II IIIa 6 1 3 1

b 5 1 3 1

c 5 1 3 1

d 5 1 5 1

e 4 1 4 1

Pat s allConf p perConf I II IIIf 3 1 6 1

ab 5 0.833 3 1

ef 3 0.75 6 1.5

ce 2 0.4 5 1.67 × ×cd 3 0.6 5 1.67 ×

3.2 Extended model of periodic-frequent patternsTo address the rare item problem, the periodic-frequent model should be capa-ble of determining the interestingness of a pattern by taking into account theoccurrence of all of its items. We consider a pattern as interesting, if the supportof the pattern is close to the support of its individual items and the periodicityof the pattern is close to the periodicity of its individual items. So, we needtwo measures to select interesting patterns based on support and periodicity.We employ all-condence measure [8] to select the interesting patterns based onsupport, as the all-condence value is the ratio of the support of the pattern tothe support of the maximum frequent item. The all-condence measure satisesboth the null-invariant and anti-monotonic properties. The null-invariantproperty discloses genuine correlation relationships without being inuenced bythe object co-absence in a database [6]. The anti-monotonic property facilitatesin the reduction of search space as all non-empty subsets of a frequent patternmust also be frequent patterns. As there exists no measure in the literature thatdetermines the interestingness of a pattern with respect to the periodicities of all of its items, we propose a new measure, periodic-condence , to address the

rare item problem in time dimension.We give the denition of all-condence measure, and redene the notion of

frequent pattern using this measure.

Denition 1. ( All-condence of X ) The all-condence of X , denoted as allConf (X ), is the ratio of support of X to the support of maximum frequent item in X . That is, allConf (X ) = sup (X )

max ( sup ( i j ) |∀ i j ∈X ) .




Example 3. The all-condence of ab, i.e., allConf (ab) = sup (ab )max ( sup (a ) ,sup (b)) =

5max (6 ,5) = 5

6 = 0 .83.

Denition 2. ( Frequent pattern X ) The pattern X is a frequent pattern if sup (X ) ≥ minSup and allConf (X ) ≥ minAllConf , where minSup and minAll-Conf represents user-specied minimum support and minimum all-condence.

Example 4. If the user-specied minSup = 2 and minAllConf = 0 .6, then ab isa frequent pattern because sup (ab) ≥ minSup and allConf (ab) ≥ minAllConf .

The all-condence determines the interestingness of a pattern only in fre-quency dimension. As a result, extending only this measure to the basic modelof periodic-frequent patterns does not address the rare item problem effectively.Example 5. The column titled II in Table 2 shows the periodic-frequent pat-terns discovered when all-condence is used along with support and periodicitymeasures. The minSup , minAllConf and maxPer values used to nd thesepatterns are 2, 0.6 and 6, respectively. It can be observed from the discoveredperiodic-frequent patterns that though all-condence is able to prune the unin-teresting pattern ‘ ce,’ it has failed to prune the uninteresting pattern ‘ cd’ fromthe list of periodic-frequent patterns discovered by the basic model.

We now introduce a new measure periodic-condence and dene the notionof periodic-frequent pattern.Denition 3. ( Periodic-condence of X ) The periodic-condence of X , de-noted as perConf (X ), is the ratio of periodicity of X to the minimal periodicity of an item ij ∈ X . That is, perConf (X ) = per (X )

min ( per ( i j ) |∀ i j ∈X ) .

For a pattern X , perConf (X ) ∈ [1, ∞ ). If perConf (X ) = 1, all items in X are occurring together, which means whenever ij has appeared, all other itemsin X (i.e., X − i j ) have also appeared in T DB .

Example 6. In Table 1, the periodicity of the patterns a, b and ab are 3, 3and 3, respectively. Therefore, the periodic-condence of ab, i.e., perConf (ab) =

per (ab )min ( per (a ) ,per (b)) = 3

min (3 ,3) = 1.

Denition 4. ( Periodic-frequent pattern X ) The frequent pattern X is said to be periodic-frequent if per (X ) ≤ maxPer and perConf (X ) ≤ maxPerConf .The terms, maxPer and maxPerConf, respectively represent the user-specied maximum periodicity and maximum periodic-condence.

Example 7. If the user-specied maxPer = 6 and maxPerConf = 1 .5, then thefrequent pattern ab is said to be a periodic-frequent pattern, because per (ab) ≤maxPer

and perConf

(ab

) ≤ maxPerConf

. The column titled III in Table 2shows the complete set of periodic-frequent patterns discovered from Table 1. Itcan be observed that the proposed model has not only discovered the periodic-frequent patterns containing rare items but also pruned the uninteresting pat-terns ‘cd’ and ‘ce.’ This clearly demonstrates that proposed model discoversperiodic-frequent patterns containing rare items without generating too manyuninteresting patterns involving the frequent items.




Denition 5. Problem Denition: Given the database ( T DB ) and the user-specied minimum support (minSup), minimum all-condence (minAllConf),

maximum periodicity (maxPer) and maximum periodic-condence (maxPerConf ),the problem of nding periodic-frequent patterns involves discovering all patterns that satisfy the minSup, minAllConf, maxPer and maxPerConf. Please note that the support and periodicity of a pattern can also be expressed in percentage of |T DB |.

4 Proposed AlgorithmTanbeer et al. [11] have proposed Periodic-Frequent pattern-growth (PF-growth)to discover periodic-frequent patterns using support and periodicity . We ex-tend this algorithm to discover the patterns using all-condence and periodic-condence measures. We call the derived algorithm as Extended Periodic-Frequentpattern-growth (EPF-growth). The proposed algorithm involves two steps: ( i)construction of Extended Periodic-Frequent pattern-tree (EPF-tree) and ( ii ) re-cursively mining EPF-tree to discover all periodic-frequent patterns. Before wedescribe these two steps, we explain the structure of EPF-tree.4.1 Structure of EPF-treeThe structure of EPF-tree consists of a prex-tree and a EPF-list. The EPF-list consists of three elds: item name ( i ), support (s ) and periodicity (p). Thestructure of prex-tree in EPF-tree is similar to that of the prex-tree in FP-tree[5]. However, to obtain both support and periodicity of the patterns, the nodesin EPF-tree explicitly maintain the occurrence information for each transactionby maintaining an occurrence timestamp list, called ts -list , only at the last nodeof every transaction. Complete details on prex-tree are available in [11].4.2 Construction of EPF-treeThe discovered periodic-frequent patterns satisfy the anti-monotonic property (see Property 1). Henceforth, periodic-frequent items (or 1-patterns) play a keyrole in efficient discovery of higher order periodic-frequent patterns. Periodic-frequent items are discovered by populating the EPF-list (lines 1 to 13 in Algo-rithm 1). Figure 2 (a), (b), (c), (d) and (e) show the steps involved in ndingperiodic-frequent items from EPF-list. The user-specied minSup , minAllConf ,maxPer and maxPerConf values are 2, 0.6, 6 and 1.5, respectively.Property 1. If X ⊂ Y , then T S X ⊇ T S Y . Therefore, sup (X ) ≥ sup (Y ), all -Conf (X ) ≥ allConf (Y ), per (X ) ≤ per (Y ) and perConf (X ) ≤ perConf (Y ).

After nding periodic-frequent items, prex-tree is constructed by performinganother scan on the database (lines 14 to 16 in Algorithm 1). The constructionof prex-tree in EPF-tree is similar to the construction of prex-tree in FP-

tree [5]. However, it has to be noted that leaf nodes in EPF-tree maintain thetimestamps of the transactions. Figure 3 (a), (b) and (c) show the construction of EPF-tree after scanning rst, second and every transaction in the transactionaldatabase, respectively. In EPF-tree, an item header table is built so that eachitem points to its occurrences in the tree via a chain of node-links, to facilitatetree traversal. For simplicity, we do not show these node-links in trees; however,they are maintained as in FP-tree.




Algorithm 1 Construction of EPF-tree( T DB : Transactional database,minSup : minimum support, minAllConf : minimum all-condence, maxPer :

maximum periodicity , maxPerConf : maximum periodic-condence)1: Let ts l be a temporary array that records the timestamp of the last appearance

of each item in the T DB . Let t = {ts cur , X } denote the current transaction withts cur and X representing the timestamp of the current transaction and pattern,respectively.

2: for each transaction t ∈ T DB do3: if an item i occurs for the rst time then4: Insert i into the EPF-list with sup i = 1, per i = ts cur and ts i

l = 1.5: else6: sup i = sup i + 1.7: if (ts cur − ts i

l ) > per i then8: per i = ts cur − ts i

l .9: for each item i in EPF-list do

10: if (|T DB | − tsi

l ) > peri

then11: per i = |T DB | − ts il .

12: Remove items from the EPF-list that do not satisfy minSup and maxPer .13: Sort the remaining items in EPF-list in descending order of their support . Let this

sorted list of items be E P F .14: Create a root node in EPF-tree, T , and label it “ null .”15: for each transaction tr ∈ T DB do16: Sort the items in t in EP F order. Let this list of sorted periodic-frequent

items in t be [ p|P ], where p is the rst item and P is the remaining list. Callinsert tree ([ p|P ], ts cur , T ), which is same as in [5].

p

11

11 1

1

d 1 2 2

a

b22

11 2

2

d 5 5 11

ab

65

33 10

12

h 1 3 3

cg

52

38 11

11

e 4

3

4

6 12

12

d 5 5

ab

65

33

h 1 9

cg

52

38

e

f

4

3

4

6

d 5 5

ab

65

33

c 5 3e

f 43

46

(a) (b) (c) (d) (e)

ab

f

ts l ts l ts lsi psi psi psipsi

Fig. 2. Construction of EPF-list for Table 1. (a) After scanning the rst transaction(b) After scanning the second transaction (c) After scanning the last transaction (d)Updated EPF-list (e) Final EPF-list with sorted list of items with respect to support .

4.3 Mining EPF-tree

Algorithm 2 describes the procedure for mining periodic-frequent patterns fromEPF-tree. The EPF-tree is mined by calling EPF-growth as (EPF-tree, null ).This algorithm resembles FP-growth. However, the key difference is that oncethe pattern-growth is achieved for a suffix 1-pattern (or item), it is completelypruned from the EPF-tree by pushing its ts-list to respective parent nodes.

Table 3 summarizes the working of this algorithm. First, we consider item‘f ,’ which is the bottom-most item in the EPF-list, as a suffix pattern. This item




appears in three branches of the EPF-tree (see Figure 3(c)). The paths formedby these branches are {cef : 4}, {abdef : 10} and {aef : 12} (format of these

branches is {nodes : time -stamps }). Therefore, considering f as a suffix item,its corresponding three prex paths are {ce : 4}, {abde : 10} and {ae : 12},which form its conditional pattern base (see Figure 4(a)). Its conditional EPF-tree contains only a single path, e : 4, 10, 12 ; ‘a’,‘b’,‘c’ and ‘d’ are not includedbecause their all-condence and periodic-condence do not satisfy minAllConf and maxPerConf respectively. Figure 4(b) shows the conditional EPF-tree of f . The single path generates the pattern {ef : 3, 0.75, 6, 1.5} (format is {pattern:support , all-condence , periodicity , periodic-condence }). The same process of creating prex-tree and its corresponding conditional tree is repeated for furtherextensions of ‘ef ’. Next, ‘f ’ is pruned from the original EPF-tree and its ts -listsare pushed to its parent nodes, as shown in Figure 4(c). All the above processesare once again repeated until EPF-list = ∅.

{} null

a

b:1

{} null

a

b:1

d:2

{} null

a

b:1,5

d:2

e

f:10

e

f:12c

e:7

c

e

f:4

d

c:3, 8, 11

(b)(a)

i s p

d 5 5

a

b

6

5

3

3

c 5 3

e

f

4

3

4

6

(c)

Fig. 3. Construction of EPF-tree for Table 1. (a) After scanning rst transaction (b)After scanning second transaction (c) After scanning every transaction

Algorithm 2 EPF-growth( Tree , α )1: for each a i in the header of T ree do2: Generate pattern β = a i ∪ α . Construct an array T S β , which represents the

set of timestamps at which β has appeared in T DB . Next, compute from T S β ,sup (β ), allConf (β ), per (β ) and perConf (β ) and compare them with minSup ,minAllConf , maxPer and maxPerConf , respectively.

3: if sup (β ) ≥ minSup , allConf (β ) ≥ minAllConf , per (β ) ≤ maxPer and perConf (β ) ≤ maxPerConf then

4: Output β as a periodic frequent pattern as {β : sup, allConf, per, perConf }.5: Traverse T ree using the node-links of β , and construct β ’s conditional pattern

base and β ’s conditional EPF-tree T ree β .6: if Tree β = ∅ then7: call EPF-growth( Tree β , β );8: Remove a i from the T ree and push a i ’s ts-list to its parent nodes.

5 Experimental Results

This section evaluates the proposed model against the basic model of periodic-frequent patterns [11]. We show that the proposed model discovers interesting




Table 3. Mining the EPF-tree by creating conditional (sub-)pattern bases

Item sup per Conditional Pattern Base Conditional EPF-tree Per. Freq. Patternsf 3 6 {ce : 4}, {abde : 10}, e : 4, 10, 12 {ef : 3,0.75,6,1.5 }

{ae : 12}e 4 4 {c : 4}, {abc : 7}, {abd : 10}, − −

{a : 12}c 5 3 {d : 3, 8, 11}, {ab : 7} − −d 5 5 {ab : 2, 10} − −b 5 3 {a : 1, 2, 5, 7, 10} a : 1, 2, 5, 7, 10 {ab: 5,0.833,3,1 }

{} null

a

b

d

e:10

e:12

c

e:4

(a)

i s pe

a

b

d

c

3

2

1

1

1

6

10

10

10

8

{} null

e:4, 10, 12

i s pe 3 6

(b)

i s p

d 5 5

a

b

6

5

3

3

c 5 3

e 4 4

{} null

a

e:12

c

e:4

d

c:3, 8, 11

(c)

b:1,5

d:2

e:10

c

e:7

Fig. 4. Mining of EPF-tree for Table 1. (a) Prex-tree of suffix item ‘ f ’, i.e., P T f (b)Conditional tree of suffix item ‘ f ’, i.e., C T f (c) EPF-tree after pruning item ‘ f ’.

patterns pertaining to both frequent and rare items by pruning uninterestingpatterns. We also show that EPF-growth is efficient.

All algorithms are written in C++ and run with Fedora 22 on a 2.66 GHzmachine with 8 GB of memory. We have conducted experiments using both syn-thetic ( T10I4D100K ) and real-world ( Retail and FAA-accidents ) databases.The T10I4D100K database is generated using the IBM data generator [2]. This

database contains 878 items with 100,000 transactions. The Retail databasecontains the market basket data from an Belgian retail store. This databasecontains 16,471 items with 88,162 transactions. The FAA-accidents databaseis constructed from the accidents data recorded by FAA from 1-January-1978 to31-December-2014. This database contains 9,290 items with 98,864 transactions.

Figure 5(a), (b) and (c) show the graph of maxPerConf versus number of periodic-frequent patterns generated for different values of minAllConf gen-erated in T10I4D100K, Retail and FAA-accidents databases, respectively. TheminSup and maxPer are set at 0 .01% and 40%, respectively. The following ob-servations can be drawn from these gures: ( i) The increase in maxPerConf results in increase of periodic-frequent patterns. The reason is that increasein maxPerConf increases the maximum inter-arrival time of a pattern in adatabase. ( ii ) The increase in minAllConf results in decrease of periodic-frequentpatterns. The reason is that increase in minAllConf increases the supportthreshold value of a pattern.

The usage of a low minSup and a high maxPer value has caused the basicmodel [11] to generate too many patterns (401,519 in T10I4D100K, 162,652in Retail and 1,173,777 in FAA-accidents). For brevity in gures, we are notplotting these results in Figure 5.




0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

2 4 6 8 10

(a) T10I4D100KmaxPerConf

6000

7000

8000

9000

10000

11000

12000

13000

2 4 6 8 10

(b) RetailmaxPerConf

3000

4000

5000

6000

7000

8000

9000

10000

1 6 11 16 21

(c) FAA-accidentsmaxPerConf

P a t t e r n s

P a t t e r n s

P a t t e r n s

minAllConf=0.01

minAllConf=0.06

minAllConf=0.11

minAllConf=0.01

minAllConf=0.06

minAllConf=0.11

minAllConf=0.01

minAllConf=0.06

minAllConf=0.11

Fig. 5. Periodic-frequent patterns discovered in various databases

Figure 6 (a), (b) and (c) show the graph of maxPerConf versus runtime of EPF-growth algorithm for different values of minAllConf on T10I4D100K, Re-tail and FAA-accidents databases respectively. The changes in minAllConf andmaxPerConf threshold values shows a similar effect on runtime requirement asin the generation of periodic-frequent patterns.

30

32

34

36

38

40

42

44

46

48

2 4 6 8 10

(a) T10I4D100KmaxPerConf

R u n t i m e

( s e c )

52

53

54

55

56

57

58

59

60

61

2 4 6 8 10

minAllConf=0.01

minAllConf=0.06

minAllConf=0.11

R u n t i m e

( s e c )

(b) RetailmaxPerConf

28

29

30

31

32

33

34

35

36

37

1 6 11 16 21

(c) FAA-accidentsmaxPerConf

R u n t i m e

( s e c )minAllConf=0.01

minAllConf=0.06

minAllConf=0.11

minAllConf=0.01

minAllConf=0.06

minAllConf=0.11

Fig. 6. Runtime requirements of EPF -growth in various databases

Some of the interesting periodic-frequent patterns discovered by the pro-posed model are presented in Table 4. The minSup , minAllConf , maxPerand maxPerConf values used to discover these patterns are 0.01%, 0.01, 40%and 30, respectively. Please note that periodicity (per ) of a pattern isexpressed in days . The rst three patterns in this table provide useful in-formation pertaining to rare events (i.e., substantial damages to an aircraft ordestroyed aircrafts), while the last two patterns provide useful information per-taining to frequent events (i.e., minor damages to an aircraft).

Table 4. Some of the interesting patterns discovered in FAA-accidents database

S. No. Patterns sup allConf per perConf 1 {Pilot Not Certicated, Destroyed } 13 0.06 4756 7.772 {Student, Substantial } 136 0.02 893 29.773 {Boeing, Substantial } 214 0.02 214 26.354 {Private-Pilot, Cessna, CE-172, Minor } 1,661 0.03 117 23.45 {General Operating Rules, Commercial 10,399 0.15 32 6.4

Pilot, Minor }




The rst pattern in this table reveals interesting information that 13 aircraftshave been ‘destroyed’ when piloted by a non-certied pilot. The periodicity of

this event is 4756 days ( ≈ 13 years). The second pattern indicates 136 aircraftsdriven by student pilots have suffered substantial damages at least once in ev-ery ≈ 2.5 years. The third pattern indicates that Boeing aircrafts have sufferedsubstantial damages at least once in every ≈ 7 months. The fourth pattern re-veals the information that Cessna airlines CE-172 driven by private pilots haveencountered minor damages at least once in every ≈ 4 months. The last patternreveals the information that at least once in every 32 days, an aircraft driven bycommercial pilots has witnessed minor damages during general operating rules.

6 Conclusions and Future WorkThis paper introduces a model to address the rare item problem in both fre-quency and time dimensions. A new interestingness measure, periodic-condence ,is proposed to address the problem in time dimension. An efficient pattern-growth algorithm has been proposed to discover all periodic-frequent patternsin a database. Experimental results demonstrate that the proposed model is ef-cient. As a part of future work, we would like to study the change in periodicbehavior of rare items due to noise.

References

1. Faa accidents dataset, http://www.asias.faa.gov/pls/apex/f ?p=100:1:0::NO2. Agrawal, R., Imieli´nski, T., Swami, A.: Mining association rules between sets of

items in large databases. In: SIGMOD. pp. 207–216 (1993)3. Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: generalizing associ-

ation rules to correlations. In: SIGMOD. pp. 265–276 (1997)4. Han, J., Gong, W., Yin, Y.: Mining segment-wise periodic patterns in time-related

databases. In: KDD. pp. 214–218 (1998)5. Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate

generation: A frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), pp.53–87 (2004)

6. Kim, S., Barsky, M., Han, J.: Efficient mining of top correlated patterns based onnull-invariant measures. In: PKDD. pp. 177–192 (2011)

7. Kiran, R.U., Reddy, P.K.: Towards efficient mining of periodic-frequent patternsin transactional databases. In: DEXA (2). pp. 194–208 (2010)

8. Omiecinski, E.R.: Alternative interest measures for mining associations indatabases. IEEE Trans. on Knowl. and Data Eng. 15(1), pp. 57–69 (2003)

9. Ozden, B., Ramaswamy, S., Silberschatz, A.: Cyclic association rules. In: ICDE.pp. 412–421 (1998)

10. Tan, P.N., Kumar, V., Srivastava, J.: Selecting the right interestingness measurefor association patterns. In: Knowledge Discovery and Data Mining. pp. 32–41(2002)

11. Tanbeer, S.K., Ahmed, C.F., Jeong, B.S., Lee, Y.K.: Discovering periodic-frequentpatterns in transactional databases. In: PAKDD. pp. 242–253 (2009)

12. Weiss, G.M.: Mining with rarity: A unifying framework. In: SIGKDD Explorations6(1), pp. 7–19 (2004)

13. Yang, R., Wang, W., Yu, P.: Infominer+: mining partial periodic patterns withgap penalties. In: ICDM. pp. 725–728 (2002)

Documents

Discovering periodic frequent patterns in transactional databases