12
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 A Transaction Mapping Algorithm for Frequent Itemsets Mining Mingjun Song, and Sanguthevar Rajasekaran, Member, IEEE Abstract—In this paper, we present a novel algorithm for mining complete frequent itemsets. This algorithm is referred to as the TM (Transaction Mapping) algorithm from hereon. In this algorithm, transaction ids of each itemset are mapped and compressed to continuous trans- action intervals in a different space and the counting of itemsets is performed by intersecting these interval lists in a depth-first order along the lexicographic tree. When the compression coefficient becomes smaller than the average number of comparisons for intervals intersection at a certain level, the algorithm switches to transaction id intersection. We have evaluated the algorithm against two popular frequent itemset mining algorithms - FP-growth and dEclat using a variety of data sets with short and long frequent patterns. Experimental data show that the TM algorithm outperforms these two algorithms. Index Terms— Algorithms, Association Rule Mining, Data Mining, Frequent Itemsets. I. I NTRODUCTION A SSOCIATION rules mining is a very popular data mining technique and it finds relation- ships among the different entities of records (for example transaction records). Since the introduction of frequent itemsets in 1993 by Agrawal et al. [1], it has received a great deal of attention in the field of knowledge discovery and data mining. One of the first algorithms proposed for associ- ation rules mining was the AIS algorithm [1]. The problem of association rules mining was introduced in [1] as well. This algorithm was improved later to obtain the Apriori algorithm [2]. The Apriori algorithm employs the downward closure property - if an itemset is not frequent, any superset of it cannot be frequent either. The Apriori algorithm performs a breadth-first search in the search space M. Song is with the Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269. Email: [email protected]. S. Rajasekaran is with the Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269. Email: [email protected]. Manuscript received December 14, 2004; revised October 5, 2005. by generating candidate k +1-itemsets from frequent k-itemsets. The frequency of an itemset is com- puted by counting its occurrence in each transaction. Many variants of the Apriori algorithm have been developed, such as AprioriTid, ArioriHybrid, direct hashing and pruning (DHP), dynamic itemset count- ing (DIC), Partition algorithm, etc. For a survey on association rules mining algorithms, please see [3]. FP-growth [4] is a well-known algorithm that uses the FP-tree data structure to achieve a con- densed representation of the database transactions and employs a divide-and-conquer approach to de- compose the mining problem into a set of smaller problems. In essence, it mines all the frequent item- sets by recursively finding all frequent 1-itemsets in the conditional pattern base that is efficiently constructed with the help of a node link structure. A variant of FP-growth is the H-mine algorithm [5]. It uses array-based and trie-based data structures to deal with sparse and dense datasets respectively. PatriciaMine [6] employs a compressed Patricia trie to store the datasets. FPgrowth* [7] uses an array technique to reduce the FP-tree traversal time. In FP-growth based algorithms, recursive construction of the FP-tree affects the algorithm’s performance. Eclat [8] is the first algorithm to find frequent patterns by a depth-first search and it has been shown to perform well. It uses a vertical database representation and counts the itemset supports us- ing the intersection of tids. However, because of the depth-first search, pruning used in the Apriori algorithm is not applicable during the candidate itemsets generation. VIPER [9] and Mafia [10] also use the vertical database layout and the intersection to achieve a good performance. The only difference is that they use the compressed bitmaps to represent the transaction list of each itemset. However, their compression scheme has limitations especially when tids are uniformly distributed. Zaki and Gouda [11] developed a new approach called dEclat using the vertical database representation. They store the dif-

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …rajasek/Jdm6.pdf · A Transaction Mapping Algorithm for Frequent Itemsets Mining ... Association Rule Mining, ... technique to reduce the

Embed Size (px)

Citation preview

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

A Transaction Mapping Algorithm for FrequentItemsets Mining

Mingjun Song, and Sanguthevar Rajasekaran,Member, IEEE

Abstract— In this paper, we present a novel algorithmfor mining complete frequent itemsets. This algorithm isreferred to as the TM (Transaction Mapping) algorithmfrom hereon. In this algorithm, transaction ids of eachitemset are mapped and compressed to continuous trans-action intervals in a different space and the countingof itemsets is performed by intersecting these intervallists in a depth-first order along the lexicographic tree.When the compression coefficient becomes smaller than theaverage number of comparisons for intervals intersectionat a certain level, the algorithm switches to transaction idintersection. We have evaluated the algorithm against twopopular frequent itemset mining algorithms - FP-growthand dEclat using a variety of data sets with short and longfrequent patterns. Experimental data show that the TMalgorithm outperforms these two algorithms.

Index Terms— Algorithms, Association Rule Mining,Data Mining, Frequent Itemsets.

I. I NTRODUCTION

A SSOCIATION rules mining is a very populardata mining technique and it finds relation-

ships among the different entities of records (forexample transaction records). Since the introductionof frequent itemsets in 1993 by Agrawal et al. [1],it has received a great deal of attention in the fieldof knowledge discovery and data mining.

One of the first algorithms proposed for associ-ation rules mining was the AIS algorithm [1]. Theproblem of association rules mining was introducedin [1] as well. This algorithm was improved laterto obtain the Apriori algorithm [2]. The Apriorialgorithm employs the downward closure property- if an itemset is not frequent, any superset of itcannot be frequent either. The Apriori algorithmperforms a breadth-first search in the search space

M. Song is with the Department of Computer Science andEngineering, University of Connecticut, Storrs, CT 06269. Email:[email protected].

S. Rajasekaran is with the Department of Computer Science andEngineering, University of Connecticut, Storrs, CT 06269. Email:[email protected].

Manuscript received December 14, 2004; revised October 5, 2005.

by generating candidatek+1-itemsets from frequentk-itemsets. The frequency of an itemset is com-puted by counting its occurrence in each transaction.Many variants of the Apriori algorithm have beendeveloped, such as AprioriTid, ArioriHybrid, directhashing and pruning (DHP), dynamic itemset count-ing (DIC), Partition algorithm, etc. For a survey onassociation rules mining algorithms, please see [3].

FP-growth [4] is a well-known algorithm thatuses the FP-tree data structure to achieve a con-densed representation of the database transactionsand employs a divide-and-conquer approach to de-compose the mining problem into a set of smallerproblems. In essence, it mines all the frequent item-sets by recursively finding all frequent 1-itemsetsin the conditional pattern base that is efficientlyconstructed with the help of a node link structure.A variant of FP-growth is the H-mine algorithm [5].It uses array-based and trie-based data structuresto deal with sparse and dense datasets respectively.PatriciaMine [6] employs a compressed Patricia trieto store the datasets. FPgrowth* [7] uses an arraytechnique to reduce the FP-tree traversal time. InFP-growth based algorithms, recursive constructionof the FP-tree affects the algorithm’s performance.

Eclat [8] is the first algorithm to find frequentpatterns by a depth-first search and it has beenshown to perform well. It uses a vertical databaserepresentation and counts the itemset supports us-ing the intersection of tids. However, because ofthe depth-first search, pruning used in the Apriorialgorithm is not applicable during the candidateitemsets generation. VIPER [9] and Mafia [10] alsouse the vertical database layout and the intersectionto achieve a good performance. The only differenceis that they use the compressed bitmaps to representthe transaction list of each itemset. However, theircompression scheme has limitations especially whentids are uniformly distributed. Zaki and Gouda [11]developed a new approach called dEclat using thevertical database representation. They store the dif-

2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

ference of tids called diffset between a candidatek-itemset and its prefixk−1-frequent itemsets, insteadof the tids intersection set, denoted here as tidset.They compute the support by subtracting the cardi-nality of diffset from the support of its prefixk−1-frequent itemset. This algorithm has been shownto gain significant performance improvements overEclat. However, when the database is sparse, diffsetwill lose its advantage over tidset.

In this paper, we present a novel approach thatmaps and compresses the transaction id list of eachitemset into an interval list using a transaction tree,and counts the support of each itemset by intersect-ing these interval lists. The frequent itemsets arefound in a depth-first order along a lexicographictree as done in the Eclat algorithm. The basic ideais to save the intersection time in Eclat by mappingtransaction ids into continuous transaction intervals.When these intervals become scattered, we switchto transaction ids as in Eclat. We call the newalgorithm the TM (transaction mapping) algorithm.The rest of the paper is arranged as follows: sectionII introduces the basic concept of association rulesmining, two types of data representation, and thelexicographic tree used in our algorithm; section IIIaddresses how the transaction id list of each itemsetis compressed to a continuous interval list, and thedetails of the TM algorithm; section IV gives ananalysis of the compression efficiency of transactionmapping; section V experimentally compares theTM algorithm with two popular algorithms - FP-Growth and dEclat; in section VI, we provide somegeneral comments; section VII concludes the paper.

II. BASIC PRINCIPLES

A. Association Rules Mining

Let I = {i1, i2, . . . , im} be a set of items andlet D be a database having a set of transactionswhere each transactionT is a subset ofI. Anassociation rule is an association relationship ofthe form: X ⇒ Y , where X ⊂ I, Y ⊂ I, andX ∩ Y = ∅. The support of ruleX ⇒ Y is definedas the percentage of transactions containing bothXandY in D. The confidence ofX ⇒ Y is definedas the percentage of transactions containingX thatalso containY in D. The task of association rulesmining is to find all strong association rules thatsatisfy a minimum support threshold (min sup) anda minimum confidence threshold (min conf). Mining

TABLE I

HORIZONTAL REPRESENTATION

tid items

1 2, 1, 5, 3

2 2, 3

3 1, 4

4 3, 1, 5

5 2, 1 ,3

6 2, 4

TABLE II

VERTICAL TIDSET REPRESENTATION

item tidset

1 1, 3, 4, 5

2 1, 2, 5, 6

3 1, 2, 4, 5

4 3

5 1, 4

association rules consists of two phases. In the firstphase, all frequent itemsets that satisfy themin supare found. In the second phase, strong associationrules are generated from the frequent itemsets foundin the first phase. Most research considers only thefirst phase because once frequent itemsets are found,mining association rules is trivial.

B. Data Representation

Two types of database layouts are employed inassociation rules mining: horizontal and vertical.In the traditional horizontal database layout, eachtransaction consists of a set of items and the data-base contains a set of transactions. Most Apriori-likealgorithms use this type of layout. For vertical data-base layout, each item maintains a set of transactionids (denoted by tidset) where this item is contained.This layout could be maintained as a bitvector. Eclatuses tidsets while VIPER and Mafia use compressedbitvectors. It has been shown that vertical layoutperforms generally better than horizontal format [8][9]. Tables I through III show examples for differenttypes of layouts.

C. Lexicographic Prefix Tree

In this paper, we employ a lexicographic prefixtree data structure to efficiently generate candidateitemsets and count their frequency, which is very

SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 3

TABLE III

VERTICAL BITVECTOR REPRESENTATION

item bitvector

1 1 0 1 1 1 0

2 1 1 0 0 1 1

3 1 1 0 1 1 0

4 0 0 1 0 0 0

5 1 0 0 1 0 0

similar to the lexicographic tree used in the TreeP-rojection algorithm [12]. This tree structure is alsoused in many other algorithms such as Eclat [8].An example of this tree is shown in Fig. 1. Eachnode in the tree stores a collection of frequentitemsets together with the support of these itemsets.The root contains all frequent 1-itemsets. Itemsetsin level l (for any l) are frequent l-itemsets. Eachedge in the tree is labeled with an item. Itemsetsin any node are stored as singleton sets with theunderstanding that the actual itemset also containsall the items found on the edges from this node tothe root. For example, consider the leftmost nodein level 2 of the tree in Fig. 1. There are four 2-itemsets in this node, namely,{1,2}, {1,3}, {1,4},and {1,5}. The singleton sets in each node of thetree are stored in the lexicographic order. If the rootcontains{1}, {2}, . . . , {n}, then, the nodes in level2 will contain{2}, {3}, . . . ,{n}; {3}, {4}, . . . ,{n};. . . ; {n}, and so on. For each candidate itemset,we also store a list of transaction ids (i.e., ids oftransactions in which all the items of the itemsetoccur). This tree will not be generated in full. Thetree is generated in a depth first order and at anygiven time, we only store minimum informationneeded to continue the search. In particular, thismeans that at any instance at most a path of thetree will be stored. As the search progresses, if theexpansion of a node cannot possibly lead to thediscovery of itemsets that have minimum support,then the node will not be expanded and the searchwill backtrack. As a frequent itemset that meets theminimum support requirement is found, it is output.Candidate itemsets generated by depth first searchare the same as those generated by the joining step(without pruning) of the Apriori algorithm.

Fig. 1. Illustration of lexicographic tree

III. TM ALGORITHM

Our contribution is that we compress tids (trans-action ids) for each itemset to continuous intervalsby mapping transaction ids into a different spaceappealing to a transaction tree. The finding of fre-quent itemsets is done by intersecting these intervallists instead of intersecting the transaction id lists(as in the Eclat algorithm). We will begin with theconstruction of a transaction tree.

A. Transaction tree

The transaction tree is similar to FP-tree exceptthat there is no header table or node link. Thetransaction tree can be thought of as a compactrepresentation of all the transactions in the database.Each node in the tree has an id correspondingto an item and a counter that keeps the numberof transactions that contain this item in this path.Adapted from [4], the construction of the transactiontree (calledconstrucTransactionTree) is as follows:

1) Scan through the database once and identify allthe frequent 1-itemsets and sort them in descendingorder of frequency. At the beginning the transactiontree consists of just a single node (which is a dummyroot).

2) Scan through the database for a second time.For each transaction, select items that are in frequent1-itemsets, sort them according to the order of fre-quent 1-itemsets and insert them into the transactiontree. When inserting an item, start from the root.At the beginning the root is the current node. In

4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

TABLE IV

A SAMPLE TRANSACTION DATABASE

TID Items Ordered frequent items

1 2,1,5,3,19,20 1,2,3

2 2,6,3 2,3

3 1,7,8 1

4 3,1,9,10 1,3

5 2,1,11,3,17,18 1,2,3

6 2,4,12 2,4

7 1,13,14 1

8 2,15,4,16 2,4

Fig. 2. A transaction tree for the above database

general, if the current node has a child node whoseid is equal to this item, then just increment the countof this child by 1, otherwise create a new child nodeand set its counter as 1.

Table IV and Fig. 2 illustrate the constructionof a transaction tree. Table IV shows an exampleof a transaction database and Fig. 2 displays theconstructed transaction tree assuming the minimumsupport count is 2. The number before the colon ineach node is the item id and the number after thecolon is the count of this item in this path.

B. Transaction mapping and the construction ofinterval lists

After the transaction tree is constructed, all thetransactions that contain an item are representedwith an interval list. Each interval corresponds to acontiguous sequence of relabeled ids. Each node inthe transaction tree will be associated with an inter-val. The construction of interval lists for each item

TABLE V

EXAMPLE OF TRANSACTION MAPPING

Item Mapped transaction interval list

1 [1,500]

2 [1,200], [501,800]

3 [1,300], [501,600]

4 [601,800]

is done recursively starting from the root in a depth-first order. The process is described as follows:Consider a nodeu whose number of transactionsis c and whose associated interval is [s, e]. Heres is the relabeled start id ande is the relabeledend id with e − s + 1 = c. Assume thatu hasm children with child i having ci transactions, fori = 1, 2, . . . ,m. It is obvious that

∑mi=1 ci ≤ c. If

the intervals associated with the children ofu are:[s1, e1], [s2, e2], . . . , [sm, em], these intervals areconstructed as follows:

s1 = s (1)

e1 = s1 + c1 − 1 (2)

si = ei−1 + 1, for i = 2, 3, . . . ,m (3)

ei = si + ci − 1, for i = 2, 3, . . . ,m (4)

For the root,s = 1. For example, in Fig. 2, the roothas two children. For the first child,s1 = 1, e1 =1 + 5 − 1 = 5, so the interval is [1,5]; for thesecond child,s2 = 5 + 1 = 6, e2 = 6 + 3 − 1 = 8,so the interval is [6,8]. The compressed transactionid lists of each item is ordered by the start id ofeach associated interval. In addition, if two intervalsare contiguous, they will be merged and replacedwith a single interval. For example, each intervalassociated with each node is shown in Fig. 2. Twointervals of item 3, [1,2] and [3,3] will be mergedto [1,3].

To illustrate the efficiency of this mappingprocess more clearly, assume that the eight trans-actions of the example database shown in table IVrepeat 100 times each. In this case the transactiontree becomes the one shown in Fig. 3.

The mapped transaction interval lists for eachitem is shown in Table V, where 1-300 of item 3results from the merging of 1-200 and 201-300.

We now summarize a procedure (calledmap-TransactionIntervals) that computes the interval listsfor each item as follows: Using depth first order,

SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 5

Fig. 3. Transaction tree for illustration

traverse the transaction tree. For each node, createan interval composed of a start id and an end id. If itis the first child of its parent, then the start id of theinterval is equal to the start id of the parent (equation(1)) and the end id is computed by equation (2). Ifnot, the start id is computed by equation (3), andthe end id is computed by equation (4). Insert thisinterval to the interval list of the corresponding item.

Once the interval lists for frequent 1-itemsets areconstructed, frequenti-itemsets (for anyi) are foundby intersecting interval lists along the lexicograpgictree. Details are provided in the next subsection.

C. Interval lists intersection

In addition to the items described above, eachelement of a node in the lexicographic tree alsostores a transaction interval list (corresponding tothe itemset denoted by the element). By constructingthe lexicographic tree in a depth-first order, thesupport count of the candidate itemset is computedby intersecting the interval lists of the two elements.For example, element 2 in the second level of thelexicographic tree in Fig. 1 represents the itemset1,2, whose support count is computed by intersect-ing the interval lists of itemset 1 and itemset 2. Incontrast, Eclat uses a tid list intersection. Intervallists intersection is more efficient. Note that sincethe interval is constructed from the transaction tree,it cannot partially contain or be partially containedin another interval. There are only three possiblerelationships between any two intervalsA = [s1, e1]andB = [s2, e2].

1) A∩B = ∅. In this case, intervalA and intervalB come from different paths of the transaction tree.

For instance, interval [1,500] and interval [501,800]in table V.

2) A ⊇ B. In this case, intervalA comes from theancestor nodes of intervalB in the transaction tree.For instance, interval [1,500] and interval [1,300] intable V.

3) A ⊆ B. In this case, intervalA comesfrom the descendant nodes of intervalB in thetransaction tree. For instance, interval [1,300] andinterval [1,500] in table V.

Considering the above three cases, the averagenumber of comparisons for two intervals is 2.

D. Switching

After a certain level of the lexicographic tree,the transaction interval lists of elements in anynode will be expected to become scattered. Therecould be many transaction intervals that containonly single tids. At this point, interval representationwill lose its advantage over single tid represen-tation, because the intersection of two segmentswill use three comparisons in the worst case whilethe intersection of two single tids only needs onecomparison. Therefore, we need to switch to thesingle tid representation at some point. Here, wedefine a coefficient of compression for one node inthe lexicographic tree, denoted bycoeff, as follows:Assume that a node hasm elements, and letsi rep-resent the support of theith element,li representingthe size of the transaction list of theith element.Then,

coeff =1

m

m∑i=1

si

li

For the intersection of two interval lists, theaverage number of comparisons is 2, so we willswitch to tid set intersection whencoeff is less than2.

E. Details of the TM Algorithm

Now we provide details on the steps involved inthe TM algorithm. There are four steps involved:

1) Scan through the database and identify allfrequent-1 itemsets.

2) Construct the transaction tree with counts foreach node.

3) Construct the transaction interval lists. Mergeintervals if they are mergeable (i.e., if the intervalsare contiguous).

6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Fig. 4. Full transaction tree

4) Construct the lexicographic tree in a depthfirst order keeping only the minimum amount ofinformation necessary to complete the search. Thisin particular means that no more than a path inthe lexicographic tree will ever be stored. While atany node, if further expansion of that will not befruitful, then the search backtracks. When process-ing a node in the tree, for every element in thenode, the corresponding interval lists are computedby interval intersections. As the search progresses,itemsets with enough support are output. When thecompression coefficient of a node becomes less than2, switch to tid list intersection.

In the next section we provide an analysis to indi-cate how TM can provide computational efficiency.

IV. COMPRESSION AND TIME ANALYSISOF TRANSACTION MAPPING

Suppose the transaction tree is fully filled in theworst case as illustrated in Fig. 4, where the sub-script ofC is the possible itemset, andC representsthe count for this itemset.

Assume that there aren frequent 1-itemsets witha support ofS1, S2, . . . , Sn respectively. Then wehave the following relationships:

S1 = C1 = |T1|

S2 = C2 + C1,2 = |T1|+ |T1,2|

S3 = C3 + C1,3 + C1,2,3 + C2,3

= |T3|+ |T1,3|+ |T1,2,3|+ |T2,3|

. . .

Sn = Cn + C1,n + C2,n + . . . + Cn−1,n + C1,2,n

+C1,3,n + . . .

= |Tn|+ |T1,n|+ |T2,n|+ . . . + |Tn−1,n|+|T1,2,n|+ |T1,3,n|+ . . .

Here eachT represents the interval for a node,and |T | represents the length ofT , which is equalto C. The maximum number of intervals possiblefor each frequent 1-itemseti is 2i−1.

The average compression ratio is

Avgratio ≥ S1 +S2

21+

S3

22+ . . . +

Si

2i−1+ . . .

+Sn

2n−1

≥ Sn(1 +1

21+

1

22+ . . . +

1

2n−1)

= 2Sn(1− 2−n)

WhenSn, which is equal tomin sup, is high, thecompression ratio will be large and thus the inter-section time will be less. On the other hand, becausethe compression ratio for any itemset cannot be lessthan 1, we assume that for frequent 1-itemset i, thecompression ratio is equal to 1,i.e., Si

2i−1 = 1. Thenfor all frequent 1-itemsets (in the first level of thelexicographic tree) whose ID number is less thani,the compression ratio is greater than 1 and for allfrequent 1-itemsets whose ID number is larger thani, the compression ratio is equal to 1. Therefore, wehave:

Avgratio ≥ S1 +S2

21+

S3

22+ . . . +

Si

2i−1+ n− i

≥ 2Si(1− 2−i) + n− i

= 2i − 1 + n− i

Since 2i > i, when i is large, i.e., when fewerof the frequent 1-itemsets have compression equalto 1, the transaction tree is ’narrow’. In the worstcase, when the transcation tree is fully filled, thecompression ratio reaches the minimum value. Intu-itively, when the size of the dataset is large and thereare more repetitive patterns, the transaction tree willbe narrow. In general, market data has this kindof characteristics. In summary, when the minimumsupport is large, or the items are sparsely associatedand there are more repetitive patterns (as in the caseof market data), the algorithm runs faster.

V. EXPERIMENTS AND PERFORMANCEEVALUATION

A. Comparison with dEclat and FP-growth

We used five sets of data in our experi-ments. Three of these sets are synthetic data(T10I4D100K, T25I10D10K, and T40I10D100k).

SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 7

TABLE VI

CHARACTERISTICS OF EXPERIMENT DATA SETS

Data #items avg. trans. length #transactions

T10I4D100k 1000 10 100,000

T25I10D10K 1000 25 9,219

T40I10D100k 942 39 100,000

mushroom 120 23 8,124

Connect-4 130 43 67,557

These synthetic data resemble market basket datawith short frequent patterns. The other two datasetsare real data (Mushroom and Connect-4 data)which are dense in long frequent patterns. Thesedata sets were often used in the previous studyof association rules mining and were down-loaded from http://fimi.cs.helsinki.fi/testdata.htmland http://miles.cnuce.cnr.it/ palmeri/datam/DCI/datasets.php. Some characteristics of these datasetsare shown in in table VI.

We have compared the TM algorithm mainly withtwo popular algorithms - dEclat and FP-growth, theimplementations of which were downloaded fromhttp://www.cs.helsinki.fi/u/goethals/software, imple-mented by Goethals, B. using std libraries. Theywere compiled in Visual C++. The TM algorithmwas implemented based on these two codes. Smallmodifications were made to implement the transac-tion tree and interval lists construction, interval listsintersection and switching. The same std librarieswere used to make the comparison fair. Imple-mentations that employ other libraries and datastructures might be faster than Goethals’ implemen-tation. Comparing such implementations with theTM implementation will be unfair. FP-growth codewas modified a little to read the whole database intomemory at the beginning so that the comparison ofall the three algorithms is fair. We did not comparewith Eclat because it was shown in [11] that dEclatoutperforms Eclat. Both TM and dEclat use thesame optimization techniques described below:

1) Early stoppingThis technique was used earlier in Eclat [8]. The

intersection between two tid sets can be stopped ifthe number of mismatches in one set is greater thanthe support of this set minus the minimum supportthreshold. For instance, assume that the minimumsupport threshold is 50 and the supports of twoitemsets AB and AC are 60 and 80, respectively.

If the number of mismatches in AB has reached 11,then itemset ABC can not be frequent. For intervallists intersection, the number of mismatches is alittle hard to be recorded because of complicated setrelationships. Thus we have used the following rule:if the number of transactions not intersected yet isless than the minimum support threshold minus thenumber of matches, the intersection will be stopped.

2) Dynamic orderingReordering all the items in every node at each

level of the lexicographic tree in ascending orderof support can reduce the number of generatedcandidate itemsets and hence reduce the number ofneeded intersections. This property was first usedby Bayardo [13].

3) Save intersection with combinationThis technique comes from the following corol-

lary [3]: if the support of the itemsetX ∪ Y isequal to the support ofX, then the support ofthe itemsetX ∪ Y ∪ Z is equal to the support ofthe itemsetX ∪ Z. For example, if the support ofitemset{1,2} is equal to the support of{1}, thenthe support of the itemset{1,2,3} is equal to thesupport of itemset{1,3}. So we do not need toconduct the intersection between{1,2} and {1,3}.Correspondingly, if the supports of several itemsetsare all equal to the support of their common prefixitemset (subset) that is frequent, then any combina-tion of these itemsets will be frequent. For example,if the supports of itemsets{1,2}, {1,3} and {1,4}are all equal to the support of the frequent itemset{1}, then {1,2}, {1,3}, {1,4}, {1,2,3}, {1,2,4},{1,3,4}, and{1,2,3,4} are all frequent itemsets. Thisoptimization is similar to the single path solution inthe FP-growth algorithm.

All experiments were performed on a DELL2.4GHz Pentium PC with 1G of memory, runningWindows 2000. All times shown include time foroutputting all the frequent itemsets. The results arepresented in tables VII through XI and figures 5through 10.

Table VII shows the running time of the com-pared algorithms on T10I4D100K data with differ-ent minimum supports represented by percentage ofthe total transactions. Under large minimum sup-ports, dEclat runs faster than FP-Growth while run-ning slower than FP-Growth under small minimumsupports. TM algorithm runs faster than both algo-rithms under almost all minimum support values.On an average, TM algorithm runs almost 2 times

8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

TABLE VII

RUN TIME (S) FOR T10I4D100K DATA

support(%) FP-growth dEclat TM dTM MAFIA FP*

5 0.671 0.39 0.328 0.5 0.625 0.156

2 2.375 1.734 0.984 1.687 1.796 0.484

1 5.812 5.562 2.406 5.656 6.375 0.89

0.5 7.359 9.078 4.515 9.421 18.359 1.187

0.2 7.484 11.796 7.359 12.75 24.671 1.64

0.1 8.5 12.875 8.906 14.796 33.234 1.828

0.05 11.359 15.656 10.453 19.859 56.031 2.078

0.02 20.609 33.468 14.421 63.187 146.437 2.64

0.01 33.781 73.093 21.671 168.906 396.453 3.937

Fig. 5. Run time for T10I4D100k data (1)

Fig. 6. Run time for T10I4D100k data (2)

faster than the faster of FP-Growth and dEclat. Twographs (Fig. 5 and Fig. 6) are employed to displaythe performance comparison under large minimumsupport and small minimum support, respectively.

Table VIII and Fig. 7 show the performance com-parison of the compared algorithms on T2510D10Kdata. dEclat runs, in general, faster than FP-Growth

Fig. 7. Run time for T25I10D10K data

Fig. 8. Run time for T40I10D100K data

with some exceptions at some minimum supportvalues. TM algorithm runs twice faster than dEclaton an average.

Table IX and Fig. 8 show the performancecomparison of the compared algorithms onT40I10D100K data. TM algorithm runs fasterwhen the minimum support is larger while slowerwhen the minimum support is smaller.

SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 9

TABLE VIII

RUN TIME (S) FOR T25I10D10KDATA

support(%) FP-growth dEclat TM dTM MAFIA FP*

5 0.25 0.14 0.093 0.171 0.140 0.046

2 3.093 3.203 1.109 3.937 2.359 0.25

1 4.406 4.921 2.718 5.859 4.015 0.437

0.5 5.187 5.296 3.953 6.578 5.828 0.64

0.2 10.328 6.937 5.656 10.968 17.406 1.14

0.1 31.219 20.953 10.906 51.484 54.078 2.125

TABLE IX

RUN TIME (S) FOR T40104D100K DATA

support(%) FP-growth dEclat TM dTM MAFIA FP*

5 93.156 14.266 7.687 20.265 8.515 1.39

2 240.281 36.437 23.281 49.578 23.562 4.859

1 568.671 52.734 46.343 85.421 45.921 10.453

0.5 1531.92 121.718 178.078 260.328 262.937 23.031

0.2 4437.03 483.843 853.515 1374.86 1451.83 117.015

TABLE X

RUN TIME (S) FOR MUSHROOM DATA

support(%) FP-growth dEclat TM dTM MAFIA FP*

5 32.203 29.828 28.125 30.515 24.687 15.687

2 208.078 196.156 187.672 207.062 141.5 104.906

1 839.797 788.781 751.89 835.89 569.828 424.859

0.5 2822.11 2668.83 2640.83 2766.47 1938.25 1478.98

Table X and Fig. 9 compare the algorithms ofinterest on mushroom data. dEclat is better than FP-Growth while TM is better than dEclat.

Table XI and Fig. 10 show the relative per-formance of the algorithms on Connect-4 data.Connect-4 data is very dense and hence the smallestminimum support is 40 percent in this experiment.Similar to the result on mushroom data, dEclatis faster than FP-Growth while TM is faster thandEclat, though the difference is not significant.

B. Experiments with dTM

We have combined the TM algorithm with thedEclat algorithm in the following way: we represent

the diffset [11] in dEclat between a candidatek-itemset and its prefixk − 1-frequent itemset usingmapped transaction intervals, and compute the sup-port by subtracting the cardinality of diffset from thesupport of its prefixk−1-frequent itemset. We namethe corresponding algorithm as dTM algorithm. Weran the dTM algorithm on the five data sets andthe run times are shown in tables VII through XI.Unexpectedly, the performance of dTM is worsethan that of TM. The reason is that the computationof the difference interval sets between two itemsetsis more complicated than the computation of theintersection and has more overhead. For instance,consider interval set1 = [s1, e1], interval set2 = [s2,

10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

TABLE XI

RUN TIME (S) FOR CONNECT-4 DATA

support(%) FP-growth dEclat TM dTM MAFIA FP*

90 2.171 0.891 0.781 1.765 1.703 0.828

80 9.078 5.406 4.734 4.796 15.109 2.968

70 56.609 40.296 35.484 37.000 107.406 21.859

60 283.031 211.828 195.359 202.578 506.078 120.484

50 1204.67 935.109 871.359 928.218 2072.73 525.968

40 4814.59 3870.64 3579.38 4013.76 7764.91 2229.06

Fig. 9. Run time for Mushroom data

Fig. 10. Run time for Connect-4 data

e2], [s3, e3]. Both [s2, e2] and [s3, e3] are within[s1, e1]. The difference between [s3, e3] and [s1,e1] is dependent on the difference between [s2, e2]and [s1, e1]. So, there are more cases to considerhere than in the computation of the intersection oftwo sets.

C. Experiments with MAFIA and FP-growth*

In this experiment, we experimented with twoother algorithms mentioned in the introduction -

MAFIA and FP-growth*. The comparison, how-ever, is just for reference, because the imple-mentations of MAFIA and FP-growth* use dif-ferent libraries and data structures, which makesthe comparison unfair. The implementation ofMAFIA was downloaded from http://himalaya-tools.sourceforge.net/Mafia/#download, and the im-plementation of FP-growth* was downloaded fromhttp://www.cs.concordia.ca/db/dbdm/dm.html. Therun times for these two algorithms are also shownin tables VII through XI. TM is faster than MAFIAfor four data sets, while slower than MAFIA forjust mushroom data set. FP-growth* is the fastestamong all the algorithms experimented. The com-parison, however, is unfair. For example, FP-treeconstruction should be slower than the transactiontree construction, but in FP-growth*, the imple-mentation of FP-tree construction is faster than ourimplementation for transaction tree construction. Inthe case of a minimum support of 0.5%, FP-growth*runs in 1.187s, while the construction of transactiontree alone in the TM algorithm takes 1.281s. Therun time difference between FP-growth and FP-growth* is not so large in the paper of FP-growth*[7] as in this experiment. ( [7] uses a differentimplementation of FP-growth), which indicates thatthe implementation plays a great role.

VI. DISCUSSION

A. Overhead of constructing interval lists and in-terval comparison

One may be concerned with the fact that it takesextra effort to relabel while constructing intervallists. Fortunately, constructing the transaction treeis just done once and the relabeling of transactions

SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 11

is done just once by traversing the transaction treein depth-first order. Relabeling time is negligiblecompared to the intersection time. For example, forconnect4 data with a support of 0.5, the construc-tion of transaction tree takes 0.734s, constructinginterval lists takes less than 0.001s, and generatingfrequent sets takes 870.609s. In FP-growth algo-rithm, constructing the first FP-tree takes 2.844s,which is longer than the time to construct thetransaction tree because of building header table andnode link. There is overhead of interval comparison,i.e., the average number of interval comparisonsis 2, according to only three cases of relationshipbetween two intervals, which is greater than thenumber of comparisons for id intersection (which is1) used in Eclat algorithm. During the first severallevels, however, the interval compression ratio isbigger than 2. On the other hand, we keep track ofthis compression ratio (coefficient) and when it be-comes less than 2, we switch to single id transactionas in the Eclat algorithm. Therefore, our algorithmsomewhat combines the advantages of both FP-growth and Eclat. When data can be compressedwell by the transaction tree (one advantage of FP-growth is to use FP-tree to compress the data), weuse interval lists intersection; when we cannot, weswitch to id lists intersection as in Eclat.

B. Run time

The data sets we have used in our experimentshave often been used in previous research and thetimes shown include the time needed in all the steps.Our algorithm outperforms FP-growth and dEclat.Actually, it is also much faster than Eclat. We didnot show the comparison with Eclat, because dEclatwas claimed to outperform Eclat in [11]. We believethat our algorithms will be faster than the Apriorialgorithm. We did not compare TM and Apriorisince the algorithms FP-growth, Eclat and dEclathave been shown to outperform Apriori [4] [8] [11].

C. Storage cost

Storage cost for maintaining the intervals of item-sets is less than that for maintaining id lists in theEclat algorithm. Because once one interval is gener-ated, its corresponding node in the transaction treeis deleted. Once all the interval lists are generated,the transaction tree is removed, so we only need tostore interval lists. The storage is also less than that

of FP-tree (FP-tree has a header table and a nodelink). We use the lexicographic tree just to illustrateDFS procedure as in the Eclat algorithm. This treeis built on the fly and not built fully at all. So thelexicographic tree is not stored in full.

D. About comparisons

This paper focuses on algorithmic concepts ratherthan on implementations. For the same algorithm,the run time is different for different implemen-tations. We downloaded the dEclat and fp-growthimplementations of Goethals, and implemented ouralgorithm based on his codes. Data structures (Set,vector, multisets) and libraries (std) used are thesame and only the algorithmic parts are different.This makes the comparison fair. Although the im-plementations of MAFIA and FP-growth* used inthis experiment are all in C/C++, data structures andlibraries used are different, which makes the com-parison unfair for algorithms. For example, fp-treeconstruction should be slower than transaction treeconstruction. But in Fp-growth*, the implementa-tion of fp-tree construction is faster than our imple-mentation for transaction tree construction. Anotherexample is that [7] uses a different implementationof FP-growth, so the run time difference betweenFP-growth and FP-growth* is not so large as in thisexperiment. For the TM algorithm we just modifiedGoethals’ implementation for fp-tree constructionand did not use faster implementations, because wewant to make the comparison between TM and Fp-growth fair. Our implementation, however, could beimproved to make the run time faster. We feel that ifwe develop an implementation tailored for the TMalgorithm instead of just modifying the downloadedcodes, TM will be competitive with FP-growth*.

VII. CONCLUSIONS AND FUTURE WORK

In this paper, we have presented a new algo-rithm TM using the vertical database representation.Transaction ids of each itemset are transformedand compressed to continuous transaction intervallists in a different space using the transaction tree,and frequent itemsets are found by transaction in-tervals intersection along a lexicographic tree indepth first order. This compression greatly savesthe intersection time. Through experiments, TMalgorithm has been shown to gain significant per-formance improvement over FP-growth and dEclat

12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

on datasets with short frequent patterns, and alsosome improvement on datasets with long frequentpatterns. We have also performed the compressionand time analysis of transaction mapping using thetransaction tree and proved that transaction map-ping can greatly compress the transaction ids intocontinuous transaction intervals especially when theminimum support is high. Although FP-growth* isfaster than TM in this experiment, the comparisonis unfair. In our future work we plan to improve theimplementation of the TM algorithm and make afair comparison with FP-growth*.

ACKNOWLEDGMENT

This work has been supported in part by the NSFGrants CCR-9912395 and ITR-0326155.

REFERENCES

[1] R. Agrawal, T. Imielinski, and A.N. Swami, ”Mining associa-tion rules between sets of items in large databases,”Proceedingsof ACM SIGMOD International Conference on Management ofData, ACM Press, Washington DC, pp.207-216, May 1993.

[2] R. Agrawal, and R. Srikant, ”Fast algorithms for mining asso-ciation rules,”Proceedings of 20th International Conference onVery Large Data Bases,Morgan Kaufmann, pp. 487-499, 1994.

[3] B. Goethals, ”Survey on Frequent Pattern Mining,” Manuscript,2003.

[4] J. Han, J. Pei, and Y. Yin, ”Mining frequent patterns withoutcandidate generation,”Procedings of ACM SIGMOD Intna-tional Conference on Management of Data,ACM Press, Dallas,Texas, pp. 1-12, May 2000.

[5] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, ”Hmine:Hyper-structure mining of frequent patterns in large databases,”Proc. of IEEE Intl. Conference on Data Mining,pp. 441-448,2001.

[6] A. Pietracaprina, and D. Zandolin, ”Mining frequent itemsetsusing Patricia Tries,”FIMI ’03, Frequent Itemset Mining Im-plementations, Proceedings of the ICDM 2003 Workshop onFrequent Itemset Mining Implementations,Melbourne, Florida,December 2003.

[7] G. Grahne, and J. Zhu, ”Efficiently using prefix-trees in miningfrequent itemsets,”FIMI ’03, Frequent Itemset Mining Im-plementations, Proceedings of the ICDM 2003 Workshop onFrequent Itemset Mining Implementations,Melbourne, Florida,December 2003.

[8] M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, ”Newalgorithms for fast discovery of association rules,”Proceedingsof the Third International Conference on Knowledge Discoveryand Data Mining,AAAI Press, pp. 283-286, 1997.

[9] P. Shenoy, J. R. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa,and D. Shah, ”Turbo-charging vertical mining of large data-bases,”Procedings of ACM SIGMOD Intnational Conferenceon Management of Data,ACM Press, Dallas, Texas, pp. 22-23,May 2000,

[10] D. Burdick, M. Calimlim, and J. Gehrke, ”MAFIA: a max-imal frequent itemset algorithm for transactional databases,”Proceedings of International Conference on Data Engineering,Heidelberg, Germany, pp. 443-452, April 2001,

[11] M.J. Zaki, and K. Gouda, ”Fast vertical mining using diffsets,”Proceedings of the Nineth ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining, Washington,D.C., ACM Press, New York, pp. 326-335, 2003.

[12] R. Agrawal, C. Aggarwal, and V. Prasad, ”A Tree ProjectionAlgorithm for Generation of Frequent Item Sets,”Parallel andDistributed Computing, pp. 350-371, 2000.

[13] R. J. Bayardo, ”Efficiently mining long patterns from data-bases,”Procedings of ACM SIGMOD Intnational Conferenceon Management of Data,ACM Press, Seattle, Washington, pp.85-93, June 1998,

Mingjun Song received his first Ph.D. degreein remote sensing from Univeristy of Con-necticut. He is working in ADE Corporationas software research engineer and is in hissecond Ph.D. program in Computer Scienceand Engineering at the University of Connecti-cut. His research interests include algorithmsand complexity, data mining, pattern recog-nition, image processing, remote sensing and

geographical information system.

Sanguthevar Rajasekaranis a Full Professorand UTC Chair Professor of Computer Scienceand Engineering (CSE) at the University ofConnecticut. He is also the Director of BoothEngineering Center for Advanced Technolo-gies (BECAT) at UConn. Sanguthevar Ra-jasekaran received his M.E. degree in Au-tomation from the Indian Institute of Science(Bangalore) in 1983, and his Ph.D. degree in

Computer Science from Harvard University in 1988. Before joiningUConn, he has served as a faculty member in the CISE Department ofthe University of Florida and in the CIS Department of the Universityof Pennsylvania. During 2000-2002 he was the Chief Scientist forArcot Systems. His research interests include Parallel Algorithms,Bioinformatics, Data Mining, Randomized Computing, ComputerSimulations, and Combinatorial Optimization. He has published over130 articles in journals and conferences. He has co-authored two textson algorithms and co-edited four books on algorithms and relatedtopics. He is an elected member of the Connecticut Academy ofScience and Engineering (CASE).