13
Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger 1 , Jerry Chun-Wei Lin 2 , Tai Dinh 3 , Hoai Bac Le 4 1 School of Natural Sciences and Humanities, Harbin Institute of Technology Shenzhen Graduate School, China 2 School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, China 3 Faculty of Information Technology, Ho Chi Minh city Industry and Trade College 4 Faculty of Information Technology, University of Science, Vietnam [email protected], [email protected], [email protected], [email protected] Abstract. High-utility itemset mining is the task of finding the sets of items that yield a high utility (e.g. profit) in quantitative transac- tion databases. An important limitation of previous work on high-utility itemset mining is that utility is generally used as the sole criterion for assessing the interestingness of patterns. This leads to finding many item- sets that have a high profit but contain items that are weakly correlated. To address this issue, this paper proposes to integrate the concept of correlation in high-utility itemset mining to find profitable itemsets that are highly correlated, using the bond measure. An efficient algorithm named FCHM (Fast Correlated high-utility itemset Miner) is proposed to efficiently discover correlated high-utility itemsets. Experimental re- sults show that FCHM is highly-efficient and can prune a huge amount of weakly correlated HUIs. Keywords: high-utility itemset mining, correlation, bond measure, cor- related high-utility itemsets 1 Introduction High-Utility Itemset Mining (HUIM) [3, 12, 13, 15] is an important data mining task. Given a customer transaction database, HUIM consists of discovering high- utility itemsets. i.e. groups of items (itemsets) that have a high utility (e.g. yield a high profit). HUIM is a generalization of the problem of Frequent Itemset Min- ing (FIM) [1], where items can appear more than once in each transaction and where each item has a weight (e.g. unit profit). HUIM has a wide range of appli- cations such as website click stream analysis, cross-marketing in retail stores and biomedical applications [12, 15]. Although, numerous algorithms have been pro- posed for HUIM [3, 8, 11–13, 15], an important limitation of current algorithms is that utility is generally used as sole criterion for assessing the interestingness of

Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

Mining Correlated High-Utility Itemsetsusing the Bond Measure

Philippe Fournier-Viger1, Jerry Chun-Wei Lin2,Tai Dinh3, Hoai Bac Le4

1 School of Natural Sciences and Humanities, Harbin Institute of TechnologyShenzhen Graduate School, China

2 School of Computer Science and Technology, Harbin Institute of TechnologyShenzhen Graduate School, China

3 Faculty of Information Technology, Ho Chi Minh city Industry and Trade College4 Faculty of Information Technology, University of Science, [email protected], [email protected],

[email protected], [email protected]

Abstract. High-utility itemset mining is the task of finding the setsof items that yield a high utility (e.g. profit) in quantitative transac-tion databases. An important limitation of previous work on high-utilityitemset mining is that utility is generally used as the sole criterion forassessing the interestingness of patterns. This leads to finding many item-sets that have a high profit but contain items that are weakly correlated.To address this issue, this paper proposes to integrate the concept ofcorrelation in high-utility itemset mining to find profitable itemsets thatare highly correlated, using the bond measure. An efficient algorithmnamed FCHM (Fast Correlated high-utility itemset Miner) is proposedto efficiently discover correlated high-utility itemsets. Experimental re-sults show that FCHM is highly-efficient and can prune a huge amountof weakly correlated HUIs.

Keywords: high-utility itemset mining, correlation, bond measure, cor-related high-utility itemsets

1 Introduction

High-Utility Itemset Mining (HUIM) [3, 12, 13, 15] is an important data miningtask. Given a customer transaction database, HUIM consists of discovering high-utility itemsets. i.e. groups of items (itemsets) that have a high utility (e.g. yielda high profit). HUIM is a generalization of the problem of Frequent Itemset Min-ing (FIM) [1], where items can appear more than once in each transaction andwhere each item has a weight (e.g. unit profit). HUIM has a wide range of appli-cations such as website click stream analysis, cross-marketing in retail stores andbiomedical applications [12, 15]. Although, numerous algorithms have been pro-posed for HUIM [3, 8, 11–13, 15], an important limitation of current algorithms isthat utility is generally used as sole criterion for assessing the interestingness of

Page 2: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

patterns. This leads to finding many itemsets yielding a high profit but contain-ing items that are weakly correlated. Those itemsets are misleading or uselessfor taking marketing decisions. For example, consider the transaction databaseof a retail store. Current algorithms may find that buying a 50 inch plasma tele-vision and a pen is a high-utility itemset, because these two items have globallygenerated a high profit when sold together. But it would be a mistake to usethis pattern to promote plasma television to people who buy pens because ifone looks closely these two items are rarely sold together. The reason why thispattern may be a HUI despite that there is a very low correlation between pensand plasma televisions, is that plasma televisions are very expensive, and thusalmost any items combined with a plasma television may be a HUI. This limi-tation of current HUIM algorithm is important. In the experimental section ofthis paper, it will be shown that often less than 1% of the patterns found bytraditional HUIM algorithms contains items that are strongly correlated. It isthus a key research problem to design algorithms for discovering HUIs having ahigh correlation.

This paper addresses this issue by introducing the concept of correlation inhigh-utility itemset mining, using the bond measure [7, 6]. The contributionsof this paper are threefold. First, we combine the concept of bond correlationused in FIM with the concept of HUIs to define a new type of patterns namedcorrelated high-utility itemsets (CHIs), and study their properties. Second, wepresent a novel algorithm named FCHM (Fast Correlated high-utility itemsetMiner) to efficiently discover the CHIs in transaction databases. The proposedalgorithm integrates several strategies to discover correlated high-utility item-sets efficiently, namely (1) Directly Outputting Single items (DOS), (2) PruningSupersets of Non correlated itemsets (PSN), (3) Pruning using the Bond Matrix(PBM), and (4) Abandoning Utility-List construction early (AUL). Third, anextensive experimental evaluation is carried to compare the efficiency of FCHMwith the state-of-the-art FHM algorithm for HUIM. Experiments results showthat FCHM can be more than two orders of magnitude faster than FHM, andin some case discover more than five orders of magnitude less patterns by onlymining correlated HUIs, and avoiding weakly correlated HUIs.

The rest of this paper is organized as follows. Section 2, 3, 4, and 5 re-spectively presents preliminaries and related work related to HUIM, the FCHMalgorithm, the experimental evaluation and the conclusion.

2 Preliminaries and related work

This section introduces preliminary definitions related to high-utility itemsetmining and discussed related work. Let I be a set of items (symbols). A trans-action database is a set of transactions D = {T1, T2, ..., Tn} such that for eachtransaction Tc, Tc ⊆ I and Tc has a unique identifier c called its Tid. Eachitem i ∈ I is associated with a positive number p(i), called its external utility(e.g. representing the unit profit of this item). For each transaction Tc such thati ∈ Tc, a positive number q(i, Tc) is called the internal utility of i (e.g. represent-

Page 3: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

ing the purchase quantity of item i in transaction Tc). For example, consider thedatabase of Table 1, which will be used as our running example. This databasecontains five transactions (T1, T2...T5). For the sake of readability, internal utilityvalues are shown as integer values beside each item, in transactions. For exam-ple, transaction T4 indicates that items a, c, e and g appear in this transactionwith an internal utility of respectively 2, 6, 2 and 5. Table. 2 indicates that theexternal utility of these items are respectively 5, 1, 3 and 1.

Table 1: A transaction database

TID Transaction

T1 (a, 1), (b, 5), (c, 1), (d, 3), (e, 1), (f, 5)T2 (b, 4), (c, 3), (d, 3), (e, 1)T3 (a, 1), (c, 1), (d, 1)T4 (a, 2), (c, 6), (e, 2), (g, 5)T5 (b, 2), (c, 2), (e, 1), (g, 2)

Table 2: External utility values

Item a b c d e f g

Unit profit 5 2 1 2 3 1 1

The utility of an item i in a transaction Tc is denoted as u(i, Tc) and defined asp(i)×q(i, Tc). The utility of an itemset X (a group of items X ⊆ I) in a transac-tion Tc is denoted as u(X,Tc) and defined as u(X,Tc) =

∑i∈X u(i, Tc). The util-

ity of an itemset X is denoted as u(X) and defined as u(X) =∑

Tc∈g(X) u(X,Tc),

where g(X) is the set of transactions containing X. For example, the utility ofitem a in T4 is u(a, T4) = 5 × 2 = 10. The utility of the itemset {a, c} in T4 isu({a, c}, T4) = u(a, T4) +u(c, T4) = 5× 2 + 1× 6 = 16. The utility of the itemset{a, c} is u({a, c}) = u(a) + u(c) = u(a, T1) + u(a, T3) + u(a, T4) + u(c, T1) +u(c, T3)+u(c, T4) = 5+5+10+1+1+6 = 28. The problem of high-utility item-set mining is to discover all high-utility itemsets. An itemset X is a high-utilityitemset if its utility u(X) is no less than a user-specified minimum utility thresh-old minutil given by the user. Otherwise, X is a low-utility itemset. For instance,if minutil = 30, the complete set of HUIs is {a, c, e} : 31, {a, b, c, d, e, f} : 30,{b, c, d} : 34, {b, c, d, e} : 40, {b, c, e} : 37, {b, d} : 30, {b, d, e} : 36, and {b, e} : 31,where each HUI is annotated with its utility.

HUIM algorithms such as Two-Phase [13], BAHUI [11], and UPGrowth+[15] utilize a measure called the Transaction-Weighted Utilization (TWU) [13,15] to prune the search space. This measure has the useful properties of beingan upper-bound on the utility of itemsets and being anti-monotonic (unlike theutility measure). Thus, an itemset always has a TWU no less than the TWU ofits supersets. The aforementioned algorithms operate in two phases. In the firstphase, they identify itemsets that may be high-utility itemsets by consideringtheir TWUs. Then, in the second phase, they scan the database to calculate theexact utility of all candidates found in the first phase those having a low utility.The TWU measure and its pruning property are defined as follows.

Definition 1 (Transaction weighted utilization). The transaction utility(TU ) of a transaction Tc is the sum of the utility of all the items in Tc. i.e.

Page 4: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

TU(Tc) =∑

x∈Tcu(x, Tc). The transaction-weighted utilization (TWU ) of an

itemset X is defined as the sum of the transaction utility of transactions con-taining X, i.e. TWU(X) =

∑Tc∈g(X) TU(Tc).

Example 1. The TUs of T1, T2, T3, T4 and T5 are respectively 30, 20, 8, 27 and11. The TWU of single items a, b, c, d, e, f and g are respectively 65, 61, 96, 58,88, 30 and 38. TWU({c, d}) = TU(T1) + TU(T2) + TU(T3) = 30 + 20 + 8 = 58.

Property 1 (Pruning search space using the TWU). Let X be an itemset, ifTWU(X) < minutil, then X and its supersets are low utility. [13]

A drawback of algorithms working in two phases is that they may generatea huge amount of candidates. To address this issue, a more efficient depth-firstsearch algorithm named HUI-Miner [12] was proposed, which discovers HUIsdirectly using a single phase. A faster version of HUI-Miner was then proposedcalled FHM [3], by introducing a technique called co-occurrence pruning. Itwas shown that FHM can be up to 6 times faster than HUI-Miner. The FHMalgorithm associates a structure named utility-list to each itemset [3, 12]. Utility-lists allow calculating the utility of an itemset quickly by making join operationswith utility-lists of shorter patterns.

Definition 2 (Utility-list). Let � be any total order on items from I. Theutility-list ul(X) of an itemset X in a database D is a set of tuples such thatthere is a tuple (tid, iutil, rutil) for each transaction Ttid containing X. The iutilelement of a tuple is the utility of X in Ttid. i.e., u(X,Ttid). The rutil elementof a tuple is defined as

∑i∈Ttid∧i�x∀x∈X u(i, Ttid).

Example 2. Assume that � is the alphabetical order. The utility-list of {a} is{(T1, 5, 25), (T3, 5, 3), (T4, 10, 17)}. The utility-list of {d} is {(T1, 6, 3), (T2, 6, 3),(T3, 2, 0)}. The utility-list of {a, d} is {(T1, 11, 3), (T3, 7, 0)}.

The FHM algorithm scans the database once to create the utility-lists ofitemsets containing a single item. Then, the utility-lists of larger itemsets areconstructed by joining the utility-lists of smaller itemsets. The join operation forsingle items is performed as follows. Consider two items x, y such that x � y,and their utility-lists ul({x}) and ul({y}). The utility-list of {x, y} is obtainedby creating a tuple (ex.tid, ex.iutil + ey.iutil, ey.rutil) for each pairs of tuplesex ∈ ul({x}) and ey ∈ ul({y}) such that ex.tid = ey.tid. The join operation fortwo itemsets P ∪ {x} and P ∪ {y} such that x � y is performed as follows. Letul(P ), ul({x}) and ul({y}) be the utility-lists of P , {x} and {y}. The utility-list of P ∪ {x, y} is obtained by creating a tuple (ex.tid, ex.iutil + ey.iutil −ep.iutil, ey.rutil) for each set of tuples ex ∈ ul({x}), ey ∈ ul({y}), ep ∈ ul(P )such that ex.tid = ey.tid = ep.tid. The utility-list structure allows to calculatethe utility-list of itemsets and prune the search space as follows.

Property 2 (Calculating the utility of an itemset using its utility-list). The utilityof an itemset is the sum of iutil values in its utility-list [12].

Page 5: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

Property 3 (Pruning search space using a utility-list). Let X be an itemset. Letthe extensions of X be the itemsets that can be obtained by appending an itemy to X such that y � i, ∀i ∈ X. If the sum of iutil and rutil values in ul(X) isless than minutil, X and its extensions are low utility [12].

Although much work has been done on developing efficient HUIM algorithms,a key problem of current HUIM algorithms is that they may find a huge amountof itemsets containing items that are weakly correlated (as it will be shown in theexperimental study). A promising solution to this problem, which is explored inthis paper is to integrate the concept of correlation as an additional constraintin high-utility itemset mining. There have been several works on mining corre-lated patterns in the field of frequent pattern mining, using various correlationmeasures such as the all-confidence [14], frequence affinity [2], coherence [5], andbond [7, 6, 14]. To our knowledge, only the works of Ahmed et al. [2] and Lin etal. [10] have been applied in high-utility itemset mining by employing the mea-sure of frequency affinity. In this paper, we instead rely on the bond measure,which has recently attracted more attention [7, 6, 14].

The concept of correlated itemsets based on the bond measure is defined asfollows [7]: The conjunctive support of an itemset X in a database D is denotedas consup(X) and defined as |g(X)|, where |g(X)| is the number of transactionsin g(X)). The disjunctive support of an itemset X in a database D is denotedas dissup(X) and defined as |{Tc ∈ D|X ∩ Tc 6= ∅}|. The bond of itemset Xis defined as bond(X) = consup(X)/dissup(X). An itemset X is said to becorrelated if bond(X) ≥ minbond, for a given user-specified minbond threshold(0 ≤ minbond ≤ 1). The bond measure is anti-monotonic.

Property 4 (Anti-monotonicity of the bond measure). Let X and Y be two item-sets such that X ⊆ Y . It follows that bond(X) ≥ bond(Y ) [6].

Based on the above definitions, we define the problem of mining CorrelatedHigh-utility Itemsets (CHIs) using the bond measure.

Definition 3 (Correlated high-utility itemset mining). The problem ofcorrelated high-utility itemset mining is to discover all correlated high-utilityitemsets. An itemset X is a correlated high-utility itemset if it is a high-utilityitemset and its bond bond(X) is no less than a user-specified minimum bondthreshold minbond, specified by the user.

For example, if minutil = 30 and minbond = 0.5, the complete set of CHIs is:{b, d},{b,e}, and {b, c, e}, which respectively have a bond of 0.50, 0.75, and 0.60.Thus, three itemsets are CHIs among the seven HUIs for the running example.

3 The FCHM algorithm

This section presents the proposed FCHM algorithm. It first describes the mainprocedure, which is based on the FHM [3] algorithm, and discovers all HUIs.Then, it explains how this procedure is modified to find only CHIs.

Page 6: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

The main procedure of FCHM (Algorithm 1) takes a transaction databasewith utility values as input, and the minutil threshold. The algorithm first scansthe database to calculate the TWU of each item. Then, the algorithm identifiesthe set I∗ of all items having a TWU no less than minutil (other items areignored since they cannot be part of a high-utility itemset by Property 3). TheTWU values of items are then used to establish a total order � on items, whichis the order of ascending TWU values (as suggested in [12]). A database scan isthen performed. During this database scan, items in transactions are reorderedaccording to the total order �, the utility-list of each item i ∈ I∗ is built and astructure named EUCS (Estimated Utility Co-Occurrence Structure) is built [3].This latter structure is defined as a set of triples of the form (a, b, c) ∈ I∗×I∗×R.A triple (a,b,c) indicates that TWU({a, b}) = c. The EUCS can be implementedas a triangular matrix that stores these triples for all pairs of items. For example,the EUCS for the running example is shown in Fig. 1. The EUCS is very usefulas it stores the TWU of all pairs of items, an information that will be laterused for pruning the search space. For instance, the top-left cell indicates thatTWU({a, b}) = 30. Building the EUCS is very fast (it is performed with a singledatabase scan) and occupies a small amount of memory, bounded by |I∗| × |I∗|.The reader is referred to the paper about FHM [?] for more details about theconstruction of this structure and how it can be implemented using hashmaps.After the construction of the EUCS, the depth-first search exploration of itemsetsstarts by calling the recursive procedure Search with the empty itemset ∅, theset of single items I∗, minutil and the EUCS structure.TWU

Item a b c d e f

b 30

c 65 61

d 38 50 58

e 57 61 88 50

f 30 30 30 30 30

g 27 11 38 0 38 0

===== CONTENT OF ESCS =====

Item:1 -- 3 (TWU 65) 5 (TWU 57)

Item:2 -- 1 (TWU 30) 3 (TWU 61) 5 (TWU 61)

Item:3 --

Item:4 -- 1 (TWU 38) 2 (TWU 50) 3 (TWU 58) 5 (TWU 50)

Item:5 -- 3 (TWU 88)

Item:6 -- 1 (TWU 30) 2 (TWU 30) 3 (TWU 30) 4 (TWU 30) 5 (TWU 30)

Item:7 -- 1 (TWU 27) 2 (TWU 11) 3 (TWU 38) 5 (TWU 38)

Support/BOND

Item a b c d e f

b 1

c 3 3

d 2 2 3

e 2 3 4 2

f 1 1 1 1 1

g 1 1 2 0 2 0

===== CONTENT OF ESCS =====

Item:1 -- 3 (Support 3) 5 (Support 2)

Item:2 -- 1 (Support 1) 3 (Support 3) 5 (Support 3)

Item:3 --

Item:4 -- 1 (Support 2) 2 (Support 2) 3 (Support 3) 5 (Support 2)

Item:5 -- 3 (Support 4)

Item:6 -- 1 (Support 1) 2 (Support 1) 3 (Support 1) 4 (Support 1) 5

(Support 1)

Item:7 -- 1 (Support 1) 2 (Support 1) 3 (Support 2) 5 (Support 2)

Fig. 1: The EUCS

TWU

Item a b c d e f

b 30

c 65 61

d 38 50 58

e 57 61 88 50

f 30 30 30 30 30

g 27 11 38 0 38 0

===== CONTENT OF ESCS =====

Item:1 -- 3 (TWU 65) 5 (TWU 57)

Item:2 -- 1 (TWU 30) 3 (TWU 61) 5 (TWU 61)

Item:3 --

Item:4 -- 1 (TWU 38) 2 (TWU 50) 3 (TWU 58) 5 (TWU 50)

Item:5 -- 3 (TWU 88)

Item:6 -- 1 (TWU 30) 2 (TWU 30) 3 (TWU 30) 4 (TWU 30) 5 (TWU 30)

Item:7 -- 1 (TWU 27) 2 (TWU 11) 3 (TWU 38) 5 (TWU 38)

Support/BOND

Item a b c d e f

b 1

c 3 3

d 2 2 3

e 2 3 4 2

f 1 1 1 1 1

g 1 1 2 0 2 0

===== CONTENT OF ESCS =====

Item:1 -- 3 (Support 3) 5 (Support 2)

Item:2 -- 1 (Support 1) 3 (Support 3) 5 (Support 3)

Item:3 --

Item:4 -- 1 (Support 2) 2 (Support 2) 3 (Support 3) 5 (Support 2)

Item:5 -- 3 (Support 4)

Item:6 -- 1 (Support 1) 2 (Support 1) 3 (Support 1) 4 (Support 1) 5

(Support 1)

Item:7 -- 1 (Support 1) 2 (Support 1) 3 (Support 2) 5 (Support 2)

Fig. 2: The Bond Matrix

The Search procedure (Algorithm 2) takes as input (1) an itemset P , (2)extensions of P having the form Pz meaning that Pz was previously obtained byappending an item z to P , (3) minutil and (4) the EUCS. The search procedureoperates as follows. For each extension Px of P , if the sum of the iutil valuesof the utility-list of Px is no less than minutil, then Px is a high-utility itemsetand it is output (cf. Property 4). Then, if the sum of iutil and rutil values inthe utility-list of Px are no less than minutil, it means that extensions of Pxshould be explored. This is performed by merging Px with all extensions Py ofP such that y � x to form extensions of the form Pxy containing |Px|+1 items.The utility-list of Pxy is then constructed as in FHM by calling the Constructprocedure to join the utility-lists of P , Px and Py. This latter procedure is thesame as in FHM [12] and is thus not detailed here. Then, a recursive call to

Page 7: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

Algorithm 1: The FCHM algorithm

input : D: a transaction database, minutil: a user-specified thresholdoutput: the set of high-utility itemsets

1 Scan D to calculate the TWU of single items;2 I∗ ← each item i such that TWU(i) ≥ minutil;3 Let � be the total order of TWU ascending values on I∗;4 Scan D to built the utility-list of each item i ∈ I∗ and build the EUCS;5 Output each item i ∈ I∗ such that SUM({i}.utilitylist.iutils) ≥ minutil;6 Search (∅, I∗, minutil, EUCS);

the Search procedure with Pxy is done to calculate its utility and explore itsextension(s). Since the Search procedure starts from single items, it recursivelyexplores the search space of itemsets by appending single items and it only prunesthe search space based on Property 5. It can be easily seen based on Property 1,2 and 3 that this procedure is correct and complete to discover all high-utilityitemsets.

Algorithm 2: The Search procedure

input : P : an itemset, ExtensionsOfP: a set of extensions of P , minutil: auser-specified threshold, EUCS: the EUCS structure

output: the set of high-utility itemsets

1 foreach itemset Px ∈ ExtensionsOfP do2 if SUM(Px.utilitylist.iutils)+SUM(Px.utilitylist.rutils) ≥ minutil then3 ExtensionsOfPx← ∅;4 foreach itemset Py ∈ ExtensionsOfP such that y � x do5 if ∃(x, y, c) ∈ EUCS such that c ≥ minutil then6 Pxy ← Px ∪ Py;7 Pxy.utilitylist← Construct (P, Px, Py);8 ExtensionsOfPx← ExtensionsOfPx ∪ Pxy;9 if SUM(Pxy.utilitylist.iutils) ≥ minutil then output Px;

10 end

11 end12 Search (Px, ExtensionsOfPx, minutil);

13 end

14 end

We now explain how the algorithm is modified to mine only CHIs, rather thanall HUIs. The modified algorithm takes minbond as an additional parameter.

Calculating the bond measure efficiently. To mine only CHIs, it isimportant to be able to calculate the bond of any itemset efficiently. A naiveapproach would be to scan the database for each HUI to calculate its disjunc-tive support and conjunctive support, to then calculate its bond. However, this

Page 8: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

Algorithm 3: The Construct procedure

input : P : an itemset, Px: the extension of P with an item x, Py: theextension of P with an item y

output: the utility-list of Pxy

1 UtilityListOfPxy ← ∅;2 foreach tuple ex ∈ Px.utilitylist do3 if ∃ey ∈ Py.utilitylist and ex.tid = exy.tid then4 if P.utilitylist 6= ∅ then5 Search element e ∈ P.utilitylist such that e.tid = ex.tid.;6 exy ← (ex.tid, ex.iutil + ey.iutil − e.iutil, ey.rutil);

7 end8 else exy ← (ex.tid, ex.iutil + ey.iutil, ey.rutil);9 UtilityListOfPxy ← UtilityListOfPxy ∪ {exy};

10 end

11 end12 return UtilityListPxy ;

would be inefficient. A better approach, used in FCHM, is the following. A struc-ture called disjunctive bit vector [7] is appended to each utility-list to efficientlycalculate the disjunctive support of any itemset.

Definition 4 (Disjunctive bit vector). The disjunctive bit vector of an item-set X in a database D is denoted as bv(X). It contains |D| bits, where the j-thbit is set to 1 if ∃i ∈ X such that i ∈ Ti, and is otherwise set to 0.

For example, bv(a) = 10110, bv(b) = 11001 and bv(c) = 11111. The disjunc-tive bit vector of each item is created during the first database scan (Line 4 ofAlgorithm 1). Then, the disjunctive bit vector of any larger itemset Pxy exploredby the search procedure is obtained very efficiently by performing the logicalOR of the bit vectors of Px and Py. This is added before Line 1 of the utility-list construction procedure (Algorithm 3). For example, bv({a, b}) = 10110 OR11001 = 11111, and thus dissup({a, b} = |11111| = 5.

An interesting observation is that the cardinality |bv(Pxy)| of the disjunctivebit vector of an itemset Pxy is equal to its disjunctive support, and that thecardinality |ul(Pxy)| of its utility list is equal to its conjunctive support. Thus,the bond of an itemset Pxy can be easily obtained using its utility-list anddisjunctive bit vector, as follows.

Property 5 (Calculating the bond of an itemset using its utility-list). Let be anitemset X. The bond of X can be calculated as |ul(X)|/|bv(X)|, where |ul(X)|is the number of elements in the utility-list of X.

Using the above property, it is possible to calculate the bond of any itemsetgenerated by the search procedure after its utility-list has been constructed. Thealgorithm is modified to only output HUIs that are correlated itemsets (havinga bond no less than minbond). The resulting algorithm is correct and complete

Page 9: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

for mining CHIs. To improve the performance of FCHM, the next paragraphsdescribe four additional strategies incorporated in FCHM to more efficientlydiscover CHIs.

Strategy 1. Directly Outputting Single items (DOS). A first opti-mization is based on the observation that the bond of single items is alwaysequal to 1. Thus, HUIs containing a single item can be directly output withoutcalculating their bond.

Strategy 2. Pruning Supersets of Non correlated itemsets (PSN).Because the bond measure is anti-monotonic (Property 5), if the bond of anitemset Pxy is less than minbond, any extensions of Pxy should not be explored.Thus, Line 8 of Algorithm 1 is modified so that if bond(Pxy) < minutil for anitemset Pxy, the utility-list of Pxy is not added to the set ExtensionsOfPx.

Strategy 3. Pruning using the Bond Matrix (PBM). The third strat-egy is introduced to avoid constructing the utility-list of an itemset Pxy, andprune all its extensions. During the second database scan, a novel structurenamed Bond Matrix is created to store the bond of all itemsets containing twoitems from I∗. The design of this structure is similar to the EUCS. The BondMatrix is formally defined as follows. The Bond Matrix is a set of triples of theform (a, b, c) ∈ I∗× I∗×R. A triple (a,b,c) indicates that bond({a, b}) = c. Notethat it only stores tuples of the form (a, b, c) such that c 6= 0. For example, theBond Matrix for the running example is shown in Fig. 2.

Then, Line 5 of Algorithm 1 is modified to add the condition that theutility-list of Pxy should only be built if ∃(x, y, c) ∈ BondMatrix such thatc ≥ minbond. It can be easily seen that this pruning condition preserve the cor-rectness and completeness of FCHM, because any superset of an itemset {x, y}such that bond({x, y}) < minbond is not a CHIs by Property 5.

Strategy 4. Abandoning Utility-List construction early (AUL). Thefourth strategy introduced in FCHM is to stop constructing the utility-list of anitemset if some specific condition is met, indicating that the itemset may not bea CHI. The strategy is based on the following novel observation:

Property 6 (Required conjunctive support for an extension Pxy). An itemsetPxy is a correlated itemset only if its conjunctive support is no less than thevalue lowerBound(Pxy) = d|consup(Pxy)| ∗minBond)e transactions.

This property is directly derived from the definition of the bond measure,and proof is therefore omitted. Now, consider the construction of the utility-listof an itemset Pxy by the Construct procedure (Algorithm 3). As mentioned, afirst modification to this procedure is to create the disjunctive bit vector of Pxyby performing the OR operation with the bit vectors of Px and Py before Line 1.This allows obtaining the value consup(Pxy) = |bv(Pxy)|. A second modificationis to create a variable maxSupport that is initialized to the conjunctive supportof Px (consup(Px) = |bv(Px)|, before Line 1. The third modification is to thefollowing lines, where the utility-list of Pxy is constructed by checking if eachtuple in the utility-lists of Px appears in the utility-list of Py (Line 3). Foreach tuple not appearing in Py, the variable maxSupport is decremented by

Page 10: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

1. If maxSupport is smaller than lowerBound(Pxy), the construction of theutility-list of Pxy can be stopped because the conjunctive support of Pxy willnot be higher than lowerBound(Pxy). Thus Pxy is not a CHI by Property6, and its extensions can also be ignored by Property 5. As it will be shownin the experiment, this strategy is very effective, and decrease execution timeand memory usage by stopping utility-list construction early. A similar pruningstrategy based on the utility of itemset Pxy instead of the bond measure isalso integrated in FCHM. This strategy is called the LA-prune strategy and wasintroduced in HUP-Miner [8], and it is thus not described here.

4 Experimental Study

We performed an experiment to assess the performance of FCHM on a com-puter having a third generation 64 bit Core i5 processor running Windows 7,and 5 GB of free RAM. We compared the performance of the proposed FCHMalgorithm with the state-of-the-art FHM algorithm for mining HUIs. The ex-periment was carried on four real-life datasets commonly used in the HUIMlitterature: mushroom, retail, kosarak and foodmart. These datasets have var-ied characteristics and represent the main types of data typically encountered inreal-life scenarios (dense, sparse, and long transactions). Let |I|, |D| and A repre-sents the number of transactions, distinct items and average transaction length.mushroom is a dense dataset (|I| = 16,470, |D| = 88,162, A = 23). kosarak isa dataset that contains many long transactions (|I| = 41,270, |D| = 990,000, A= 8.09). retail is a sparse dataset with many different items (|I| = 16,470, |D|= 88,162, A = 10,30). foodmart is a sparse dataset (|I| = 1,559, |D| = 4,141,A = 4.4). foodmart contains real external and internal utility values. For theother datasets, external utilities for items are generated between 1 and 1,000by using a log-normal distribution and quantities of items are generated ran-domly between 1 and 5, as the settings of [12, 15]. The source code of algorithmsand datasets can be downloaded as part of the SPMF open source data min-ing library [4] at http://www.philippe-fournier-viger.com/spmf/. Memorymeasurements were done using the Java API. For the experiment, FCHM was runwith five different minbond threshold values (0.1, 0.3, 0.5, 0.7 and 0.9). There-after, the notation FCHMx represents FCHM with minbond = x. Algorithmswere first run on each dataset, while decreasing the minutil threshold until theybecame too long to execute, ran out of memory or a clear trend was observed.Fig. 3 compares the execution times of FCHM and FHM. Fig. 4, compares thenumber of CHIs and HUIs, respectively generated by these algorithms.

It can first be observed that mining CHIs using FCHM can be much fasterthan mining HUIs. The reason for the excellent performance of FCHM is thatit prunes a large part of the search space using its designed strategies for prun-ing using the bond measure. For datasets containing a huge amount of weaklycorrelated HUIs such as mushroom, foodmart, and kosarak, this leads to a mas-sive performance improvement. For example, for the lowest minutil values andminbond = 0.5 on these datasets, FCHM is respectively 103, 10 and 21 times

Page 11: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

1

10

100

1 000

10 000

100 000

1 000 000

1000000 10000000

Ru

nti

me

(s)

minutil

mushroom

1

10

100

2000K 15000K

Ru

nti

me

(s)

minutil

kosarak

50

100

150

200

250

1 5000

Ru

nti

me

(s)

minutil

foodmart

10

100

1 000

10 000

100 000

1 000 000

K 25K 50K

Ru

nti

me

(s)

minutil

retail

Pat

tern

co

un

t

FCHM0.1 FCHM0.3 FCHM0.5 FCHM0.7 FCHM0.9 FHM

Fig. 3: Execution times

1

10

100

1 000

10 000

100 000

1 000 000

10 000 000

1000000 10000000

Pat

tern

co

un

t

minutil

mushroom

1

10

100

2000K 15000K

Pat

tern

co

un

t

minutil

kosarak

50

100

150

200

250

1 5000

Pat

tern

co

un

t

minutil

foodmart

10

100

1 000

10 000

100 000

K 25K 50K

Pat

tern

co

un

t

minutil

retail

Pat

tern

co

un

t

FCHM0.1 FCHM0.3 FCHM0.5 FCHM0.7 FCHM0.9 FHM

Fig. 4: Number of patterns found

faster than FHM. In general, when minbond increases, the gap between theruntime of FHM and FCHM increases. For example, for minbond = 0.9 onmushroom, FCHM is 417 times faster than FHM.

A second observation is that the number of CHIs can be much less than thenumber of HUIs, even when using low minbond threshold values (see Fig. 4). Forexample, on mushroom, 3,538,181 HUIs are found for minutil = 3, 000, 000. Butonly 3,186 HUIs are CHIs for minbond = 0.3 and only 196 for minbond = 0.5(about 1 CHIs for 18,000 HUIs). Huge reduction in the number of patterns are

Page 12: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

also observed on the foodmart and kosarak datasets. On the retail dataset, thereduction is less because patterns are more correlated. However, there is still from2 to 10 times less CHIs than HUIs on this dataset, depending on the minutilvalues. These overall results show that the proposed FCHM algorithm is usefulas it can filter a huge amount of weakly correlated itemsets encountered in realdatasets, and can run faster.

Memory consumption was also compared, although detailed results are notshown as a figure due to space limitation. It was observed that FCHM can useup to 30 times less memory than FHM depending on how the minbond thresholdis set. For example, on mushroom and minutil = 5, 000, 000, FHM, FCHM0.9,FCHM0.7, FCHM0.5, FCHM0.3, and FCHM0.1, respectively consume 666 MB,424 MB, 136 MB, 59 MB, 26 MB and 20 MB.

Lastly, the efficiency of the proposed PBM and AUL strategies was also as-sessed. Detailed results are not shown due to space limitation. But these strate-gies were shown to prune a huge amount of candidates for most datasets, andit greatly increases when minbond is increased. For example, on mushroom andminutil = 5, 000, 000, FHM, FCHM0.9, FCHM0.7, FCHM0.5, FCHM0.3, andFCHM0.1, visit 1,282,377, 536,136, 5,254, 1,160, 883, and 848 candidates.

5 Conclusion

This paper proposed an algorithm named FCHM (Fast Correlated high-utilityitemset Miner) to efficiently discover correlated high-utility itemsets using thebond measure. An extensive experimental study shows that FCHM is up to twoorders of magnitude faster than FHM, and can discover more than five ordersof magnitude less patterns by only mining correlated HUIs. The source code ofalgorithms and datasets can be downloaded as part of the SPMF open sourcedata mining library [4] at http://www.philippe-fournier-viger.com/spmf/.For future work, ideas introduced in FCHM could be considered in incremen-tal high-utility itemset mining [9], and high-utility sequential rule mining [16].Besides, we will consider developping an algorithm for correlated high-utilityitemset mining based on the EFIM algorithm [17], since EFIM was shown tooutperform FHM for HUIM.

Acknowledgement. This research was partially supported by National Nat-ural Science Foundation of China (NSFC) under grant No.61503092.

References

1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in largedatabases. In: Proc. Int. Conf. Very Large Databases, pp. 487–499, (1994)

2. Ahmed, C. F., Tanbeer, S. K., Jeong, B. S., Choi, H. J.: A framework for mining in-teresting high utility patterns with a strong frequency affinity. Information Sciences.181(21), pp. 4878–4894 (2011)

3. Fournier-Viger, P., Wu, C.-W., Zida, S., Tseng, V. S.: FHM: Faster high-utilityitemset mining using estimated utility co-occurrence pruning. In: Proc. 21st Int.Symp. on Methodologies for Intell. Syst., pp. 83–92 (2014)

Page 13: Mining Correlated High-Utility Itemsets using the Bond Measure · Mining Correlated High-Utility Itemsets using the Bond Measure Philippe Fournier-Viger1, Jerry Chun-Wei Lin2, Tai

4. Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu., C., Tseng, V. S.:SPMF: a Java Open-Source Pattern Mining Library. Journal of Machine LearningResearch (JMLR), 15, pp. 3389-3393 (2014)

5. Barsky, M., Kim, S., Weninger, T., Han, J.: Mining flipping correlations from largedatasets with taxonomies. In: Proc. 38th Int. Conf. on Very Large Databases,pp. 370–381 (2012)

6. Ben Younes, N., Hamrouni, T., Ben Yahia, S.: Bridging conjunctive and disjunc-tive search spaces for mining a new concise and exact representation of correlatedpatterns. In: Proc. 13th Int. Conf. Discovery Science, pp. 189–204 (2010)

7. Bouasker, S., Ben Yahia, S.: Key correlation mining by simultaneous monotone andanti-monotone constraints checking. In: Proc. 30th Symp. on Applied Computing,pp. 851-856, 2015.

8. Krishnamoorthy, S.: Pruning strategies for mining high utility itemsets. ExpertSystems with Applications, 42(5), 2371-2381 (2015)

9. Lin, J. C. W., Gan, W., Hong, T. P., Pan, J. S.: Incrementally Updating High-UtilityItemsets with Transaction Insertion. Proc. 10th Int. Conf. on Advanced Data Miningand Applications, pp. 44–56 (2014)

10. Lin, J. C.-W., Gan, W., Fournier-Viger, P., Hong, T.-P. (2016). Mining Discrimi-native High Utility Patterns. Proc. 8th Asian Conference on Intelligent Informationand Database Systems, Springer, 10 pages.

11. Song, W., Liu, Y., Li, J.: BAHUI: Fast and memory efficient mining of high utilityitemsets based on bitmap. Int. J. Data Warehousing and Mining. 10(1), 1–15 (2014)

12. Liu, M., Qu, J.: Mining high utility itemsets without candidate generation. In:Proc. 22nd ACM Int. Conf. Info. and Know. Management, pp. 55–64 (2012)

13. Liu, Y., Liao, W., Choudhary, A.: A two-phase algorithm for fast discovery of highutility itemsets. In: Proc. 9th Pacific-Asia Conf. on Knowl. Discovery and DataMining, pp. 689–695 (2005)

14. Soulet, A., Raissi, C., Plantevit, M., Cremilleux, B.: Mining dominant patterns inthe sky. In: Proc. 11th IEEE Int. Conf. on Data Mining, pp. 655-664 (2011)

15. Tseng, V. S., Shie, B.-E., Wu, C.-W., Yu., P. S.: Efficient algorithms for mininghigh utility itemsets from transactional databases. IEEE Trans. Knowl. Data Eng.25(8), 1772–1786 (2013)

16. Zida, S., Fournier-Viger, P., Wu, C.-W., Lin, J. C. W., Tseng, V.S.: Efficient miningof high utility sequential rules. in: Proc. 11th Int. Conf. Machine Learning and DataMining, pp. 1–15 (2015)

17. Zida, S., Fournier-Viger, P., Lin, J. C.-W., Wu, C.-W., Tseng, V.S.: EFIM: AHighly Efficient Algorithm for High-Utility Itemset Mining. Proc. 14th MexicanInt. Conf. on Artificial Intelligence, pp. 530–546.