35
Mining Association Rules in Data Warehouses Haorianto Cokrowijoyo Tjioe, Monash University, Australia David Taniar, Monash University, Australia ABSTRACT Data mining applications have enormously altered the strategic decision-making processes of organizations. The application of association rules algorithms is one of the well-known data mining techniques that have been developed to cope with multidimensional databases. However, most of these algorithms focus on multidimensional data models for transactional data. As data warehouses can be presented using a multidimensional model, in this paper we provide another perspective to mine association rules in data warehouses by focusing on a measurement of summarized data. We propose four algorithms — VAvg, HAvg, WMAvg, and ModusFilter — to provide efficient data initialization for mining association rules in data warehouses by concentrating on the measurement of aggregate data. Then we apply those algorithms both on a non-repeatable predicate, which is known as mining normal association rules, using GenNLI, and a repeatable predicate using ComDims and GenHLI, which is known as mining hybrid association rules. Keywords: association rules; data mining; data preprocessing; data warehouse; multidimensional database design INTRODUCTION Association rules on transaction da- tabase were first introduced by Agrawal (1993). By using its Apriori algorithm, large itemsets satisfying the minimum support and association rules based on the minimum confidence could be discovered. Since then, a large number of efficient algorithms us- ing the hash-based technique (Park, Chen, & Yu, 1995), transaction reduction (Han & Fu, 1995), the partition technique (Mannila, Toivonen, & Verkamo, 1994), and sample datasets to prune the number of passes on the data (Toivonen, 1996) have been intro- duced. Association rules traditionally use transactional data that focus on a single dimension or predicate (Agrawal & Srikant, 1994; Han & Fu, 1995; Mannila, Toivonen, & Verkamo, 1994; Park, Chen, & Yu, 1995; Savasere, Omiecinski, & Navathe, 1995). However, this is not adequate since real life data usually involves more than one di- mension or predicate. Subsequently, tradi- tional association rules were developed to This paper appears in the journal International Journal of Data Warehousing and Mining edited by David Taniar. Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission 701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.idea-group.com ITJ2883 IDEA GROUP PUBLISHING

Mining Association Rules in Data Warehouses

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining Association Rules in Data Warehouses

28 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

Mining Association Rulesin Data Warehouses

Haorianto Cokrowijoyo Tjioe, Monash University, Australia

David Taniar, Monash University, Australia

ABSTRACT

Data mining applications have enormously altered the strategic decision-making processes oforganizations. The application of association rules algorithms is one of the well-known datamining techniques that have been developed to cope with multidimensional databases. However,most of these algorithms focus on multidimensional data models for transactional data. As datawarehouses can be presented using a multidimensional model, in this paper we provide anotherperspective to mine association rules in data warehouses by focusing on a measurement ofsummarized data. We propose four algorithms — VAvg, HAvg, WMAvg, and ModusFilter — toprovide efficient data initialization for mining association rules in data warehouses byconcentrating on the measurement of aggregate data. Then we apply those algorithms both ona non-repeatable predicate, which is known as mining normal association rules, using GenNLI,and a repeatable predicate using ComDims and GenHLI, which is known as mining hybridassociation rules.

Keywords: association rules; data mining; data preprocessing; data warehouse;multidimensional database design

INTRODUCTION

Association rules on transaction da-tabase were first introduced by Agrawal(1993). By using its Apriori algorithm, largeitemsets satisfying the minimum support andassociation rules based on the minimumconfidence could be discovered. Since then,a large number of efficient algorithms us-ing the hash-based technique (Park, Chen,& Yu, 1995), transaction reduction (Han &Fu, 1995), the partition technique (Mannila,Toivonen, & Verkamo, 1994), and sample

datasets to prune the number of passes onthe data (Toivonen, 1996) have been intro-duced.

Association rules traditionally usetransactional data that focus on a singledimension or predicate (Agrawal & Srikant,1994; Han & Fu, 1995; Mannila, Toivonen,& Verkamo, 1994; Park, Chen, & Yu, 1995;Savasere, Omiecinski, & Navathe, 1995).However, this is not adequate since reallife data usually involves more than one di-mension or predicate. Subsequently, tradi-tional association rules were developed to

This paper appears in the journal International Journal of Data Warehousing and Mining edited by David Taniar.Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission

701 E. Chocolate Avenue, Suite 200, Hershey PA 17033-1240, USATel: 717/533-8845; Fax 717/533-8661; URL-http://www.idea-group.com

�������

IDEA GROUP PUBLISHING

Page 2: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 29

solve the multidimensional model (Guenzel,Albrecht, & Lehner, 1999; Kamber et al.,1997). Kamber at al. (1997) exposed theidea of mining association rules in a multi-dimensional data model. Their algorithmfocuses only on presenting association rulesin a multidimensional model, which involvesmore than one dimension in transactionaldata. However, this algorithm did not dis-cuss the hierarchies that are also charac-teristic of a multidimensional model. Lateron, a new algorithm (Guenzel, Albrecht, &Lehner, 1999) was proposed to supportmining multidimensional databases by hi-erarchy using an online analytical process-ing (OLAP) approach.

Apparently, both concepts in Guenzel,Albrecht, and Lehner (1999) and Kamberet al. (1997) miss the most important at-tribute, which is the measurement of ag-gregate data in a Data Warehouse (DW).The data in a DW contains only summariseddata such as quantity sold, amount sold, andetcetera. No transactions data is stored. Inthis paper, we focus on providing a frame-work for mining association rules both onnon-repeatable predicates and repeatablepredicates in data warehouses by concen-trating on the measurement of aggregatedata.

Here, we propose four algorithms —HAvg, VAvg, ModusFilter, and WMAvg —to provide efficient data initialisation formining association rules in data warehousesby concentrating on the measurement ofaggregate data, specifically its quantity.Then we apply those algorithms both to non-repeatable predicates using GenNLI, andrepeatable predicates using ComDims, andGenHLI, which is known as mining hybridassociation rules.

As shown in Figure 1, we provide aframework for mining association rules ina data warehouse. We use quantity data ina fact table to explain our approach. There

are three steps to perform in order to de-rive an initialised data for mining associa-tion rules in a data warehouse. First, weselect the data warehouse that we want tomine. Secondly, we use user input variablesto decide the dimensions that will be usedalong with the single or interval data to findthe interesting patterns. Finally, we use ourapproach: HAvg, VAvg, ModusFilter, andWMAvg algorithms to produce data initial-isation based on the information derivedfrom the two previous steps (see Figure1). Then we mine the DW using data ini-tialisation from those proposed algorithmsto mine non-repeatable association rulesand the repeatable predicate, which isknown as mining hybrid association rules.

HAvg, VAvg and WMAvg algorithmswork by selecting the average measure-ment of aggregate data in a DW with mul-tidimensional structures such as averagequantity sold, average price, and so forth.We prune all the rows in the fact table,which have less than the average quantity,since we assume that rows with quantitiesless than its average will not form any as-sociation rules. The main differences be-tween these algorithms are: HAvg finds theaverage quantity of the defined dimensionshorizontally while VAvg finds the averagequantity of the defined dimensions verti-cally and WMAvg selects the weightedmoving average quantity of the defined di-mensions vertically.

Using the VAvg algorithm, we find theaverage quantity of the selected dimensionvertically. We illustrate how the VAvg al-gorithm works in Table 1. Here we use onlytwo dimensions: Times dimension for aweek’s sales only, and Products dimensionfor only Beer and Bread. Times dimensionworks as a grouping dimension. So the at-tributes of Products dimension will begrouped according to its time frames (seeFigure 2 for the detail of those dimensions).

Page 3: Mining Association Rules in Data Warehouses

30 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

Suppose we want to find the average quan-tity of Products dimension in a verticalmanner. First we search the vertical aver-age quantity for Beer from the first row,Monday, to the last row, Sunday. The av-erage quantity of that product is 217.14.Then we eliminate all attributes of that par-ticular product that are less than its aver-age quantity (e.g., 100,120,200). Finally, werepeat the same processes for Bread,counting the vertical average quantity in theProducts dimension (e.g., 142.14) and thendelete all attributes that are less than itshorizontal average quantity (e.g.,125,80,110,100).

Basically, the HAvg algorithm findsthe average quantity of the defined dimen-sions horizontally. As shown in Table 2, weuse two dimensions: Products dimensionand Customers dimension to illustrate howthe HAvg algorithm works (see Figure 2for the detail of those dimensions). Prod-ucts dimension is used as the grouping di-

mension with three different products (e.g.,sausage, milk, bread). Thus, attributes onthe Customers dimension will be groupedbased on Products dimension accordingly.On the Customers dimension, it used Lo-cation Id as the attributes which areClayton(CL), Caulfield(CA), Richmond(RIC),and Chadstone(Chad). First, we search theaverage quantity of Sausage horizontallyfrom the first row across the Customersdimension. The average quantity of thatproduct is 143. Then we delete all attributesfrom the Customers dimension that arefewer than its average quantity [e.g.,CL(100), Chad(125)]. Finally, we continueuntil we have finished counting all the av-erage quantities for Milk and Bread hori-zontally.

The WMAvg algorithm selects theweighted moving average measurement ofaggregate data in a DW, such as quantitysold, amount sold, and so forth. The inten-tion is to emphasise that the latest quantity

Figure 1. A generic framework for mining association rules in a data warehouse

Page 4: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 31

of purchases of a certain product will con-tribute the greatest chance of the next pur-chase of that product (Lowry et al., 1992).The weighted moving average is countedfrom the SUM of record data. Based onthis consideration, we are interested onlyin providing relevant data that satisfy theweighted moving average measurement ofaggregate data in a DW. This will reducethe number of rows in a DW that we arenot interested in mining and, consequently,this will reduce the time spent processingin order to find the interesting rules.

For example, we want to pre-processdata for mining association rules in a DWby focusing on a certain Times dimensionand a specific Products dimension (e.g.,Soap) and its quantity as the measurementof aggregate data. The detail of the datawarehouse structure can be seen in Figure2. In Table 3, we show a data order of oneweek’s transactions across the time dimen-sion from Monday to Sunday with only aproduct dimension of Soap. We find itsweighted moving average quantity. First,count the SUM of 7 records data is 1 + 2 +3 + 4 + 5 + 6 + 7 = 28. Then count each row

one by one from first record Time Dim (Mon-day), quantity 100 become 100 × (1/28) =3.571 until the last row record Time Dim(Sunday), quantity 280 become 280 × (7/28) = 57.857. ∑Processes as the weightedmoving average = 245.356. Finally, allrecords which are less than its weightedmoving average quantity (e.g., quantity 100,120, 200) will be pruned out.

Unlike others, ModusFilter algorithmbasically selects the modus measurementof aggregate data in DW such as quantitysold, amount sold, and so forth. Based onthis consideration, we are only interestedin providing relevant data that satisfy themodus measurement of aggregate data inDW. This will reduce the number of rowsin DW which are not interested to be minedand, consequently, will reduce the time pro-cessing to find the interesting rules. Forexample, we want to pre-process data formining association rules in DW by focus-ing on certain Times dimension, specificProducts dimension (e.g., Soap) and itsquantity as the measurement of aggregatedata (see Figure 2 for detail Sales DataWarehouse). As can be seen in Table 4,

Table 1. Example of how VAvg algorithms work

Table 2. An example of how the proposed HAvg algorithm works

Page 5: Mining Association Rules in Data Warehouses

32 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

we show a data order by one week trans-action across Times dimension from Mon-day to Sunday with Products dimensiononly Soap. We find quantity 400 is themodus measurement of quantity since it hasthe most frequent quantity that was boughtin that week. Then all the data whosequantity is not 400 (e.g., quantity 500, 600,700) will be ignored.

The rest of the section in this paper isorganised as follows. The second sectiondescribes the background of data ware-house modelling, association rules and hy-brid association rules both on transactionaldata and data warehouses. The third sec-tion explains the related work. It is thenfollowed by the description of our proposedalgorithms in the fourth section. In the fifthsection, we mine association rules in DWusing both non-repeatable and repeatablepredicates. The sixth section presents someperformance results showing the effective-ness of our methods, based on applyingthese methods to data from a sales data

warehouse. Finally, the last section con-cludes the paper.

BACKGROUNDIn this section we provide a short in-

troduction of star schema and a multidi-mensional model followed by mining asso-ciation rules. We will use both of these con-cepts in our proposed algorithms.

Star Schema and MultidimensionalModel in Data WarehousesA data warehouse is typically built

using star schema (Chaudhuri & Dayal,1999; Inmon, 1996; Kamber & Han, 2001),where it has more than one dimension andeach dimension corresponds to one or morefact tables. Dimensions store the descrip-tion of business dimensions (e.g., Products,Customers, Times, Channels and Promo-tions), while fact tables consist of aggre-gate data of measurements such as quan-tity sold, amount sold (Chaudhuri & Dayal,1999; Kamber & Han, 2001). A data ware-

Table 3. Example of WMAvg concept

Table 4. Example of ModusFilter concept

Page 6: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 33

house is an integrated database, which isused to store all information from differ-ent database formats and locations. It isseparated from the transactional databasewhere it consists only of historical data.It is built based on the particular purposesuch as sales, marketing, and so forth. Itis updated periodically on the basis of time,such as per week, months, quarter, and onlycan be updated by the backend user suchas database administrator (DBA; Inmon,1996). As shown in Figure 2, a data ware-house is built using a star schema. It hasfive dimensions and one sales fact table.The dimensions are Products, Times, Cus-tomers, Channels and Promotions dimen-sion. A fact table contains all referencekeys from all dimensions and two measure-ments data: quantity and dollar_sold.

As shown in Figure 3, we used amultidimensional model to view the datawarehouse (Chaudhuri & Dayal, 1999;Kamber & Han, 2001). As shown in theexample, there is a multidimensional modelthat involves three dimensions (products,customers and times). We can view infor-mation lays on those dimensions such asfinding information of products betweennumbers 10 and 100 and customers be-

tween numbers 20 and 100 in year 2000.Based on that enquiry, we find that thereare 10,000 units involved.

Data warehouses contain aggregatedata, which are a summary of all details ofoperational data from the Online Transac-tion Processing (OLTP). These data willbe kept based on the lowest granularity ofdata, which it will be stored in a data ware-house (Chaudhuri & Dayal, 1999; Kamber& Han, 2001). For example, in Times di-mension we have date in format dd-mm-yyyy as the lowest granularity of data, so ifwe have 100 transaction data on 19-10-2000 with different time and quantities inOLTP, these data will be transformed intosummarised data for the data warehousewhich has only a single transaction on 19-10-2000 that contains aggregate data asthe measurements in the fact table in thedata warehouse.

Association Rules andHybrid Association Rules

on Transactional DataIn general, data mining can be divided

into predictive and descriptive task models(Dunham, 2003; Kamber & Han, 2001). Apredictive mining task model is concerned

Figure 2. Sales data warehouse with a star schema

Page 7: Mining Association Rules in Data Warehouses

34 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

with making predictions based on the avail-ability of the current data. While the de-scriptive mining task model focuses on dis-covering interesting patterns in the currentdatabase. Association rules belong to thedescriptive mining task model. The idea isto discover the association between itemsin the large itemsets in database.

An association rule algorithm was firstintroduced by Agrawal & Srikant (1994).By using the Apriori algorithm, the largeitemsets, which satisfy the minimum sup-port and can generate rules based on theminimum confidence, can be discovered.An association rule concept was first in-troduced by Agrawal, Imielinski, & Swami,and it has been applied widely to marketbasket analysis (1993). Market basketanalysis is a process to discover the asso-ciations relating to buying behaviour be-tween items, which are bought togetherover a period of time. An example of anassociation rule is as follows:

cigarettes → lighter {sup = 25%, conf =75%}

This rule means that customers whobuy cigarettes and lighter together have asupport of only 25% of total transactions,while those who buy cigarettes have a con-fidence or probability of buying a lighter atthe same time of 75%. Thus, a formal modelof this association rule is Let I = {i

1,i

2,…,

In} be a set of items. Let T be a set of

transactions and each record transaction t⊆ I. Let an association rule A → B, whereA ⊂ I, B ⊂ I, and A ∩ B = ∅. Let confi-dence c where c% of transaction T thatsupport A and B. Also let support s wheres% in T that contains A ∪ B.

Unlike normal association rules fortransactional data, hybrid association rulesare association rules that engage two ormore repeated predicates and usually in-volve different types of predicates for trans-actional data. This hybrid association ruleuses more than one table in the transac-tional data (e.g., table customers, orders).An example of a hybrid association rule isas follow:

Age(25..35)^Living(Melb)^Buy(Beer) →Buy(Diaper) {sup = 30%, conf = 80%}

Figure 3. Sales data warehouse with a multidimensional model

Page 8: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 35

This rule means that there is 30%support of total transactions for customersaged between 25 and 35 who are living inMelbourne, buying beer at the same timethat they buy diapers, and those customersin Melbourne who buy beer have a confi-dence or probability of 80% of buying dia-pers simultaneously. The above exampleuses three different types of predicate,which are Age, Living from table Custom-ers and Buy from table Orders where thepredicate Buy is repeated. Unlike normalassociation rules, it uses only a single predi-cate, which is the predicate Buy. The for-mal model of a hybrid association rule issimilar to a normal association rule; how-ever, a hybrid association rule has to showits predicate’s type as well.

Association Rules andHybrid Association Rules

in Data WarehousesMining association rules in data ware-

houses using either non-repeatable or re-peatable dimensions is different from min-ing association rules in transactional data.When mining association rules in datawarehouses, it uses more than one dimen-sion, which is modeled using either a starschema or a multidimensional model. Thekey to mining association rules in datawarehouses is the ability to explore theintegrated information that exists in facttables, since in a fact table all informationkeys are connected to each dimension, andmeasurement of aggregate data is avail-able. When mining data warehouses, weno longer use the transaction id on trans-actional resources or a single table withmultiple fields such as table customers (e.g.,age, credit limits, number of cars) to dis-cover the interesting patterns. Here, we useall information from dimensions, and mea-surement of aggregate data (e.g., quantity),

because we want to find the interestingpatterns.

We define non-repeatable associationrules in data warehouses as associationrules, which engage two or more dimen-sions where only one dimension can havemultiple attributes while the others workas grouping dimensions. For example, sup-pose we want to discover interesting pat-terns limited to three dimensions: Times,Channels, and Products dimensions, as fol-lows (see Figure 3 for detail of data ware-house structure):

{Times Dimension, Channels Dimension,Products Dimension}

{d1(‘January 1998’), d

2(‘Direct Sales’),

d3(‘Households’)}

Here only the Products dimension canhave multiple attributes, while the rest ofthe dimensions work only as the groupingdimensions. The Times dimension worksas the first grouping dimension, and theChannels dimension works as the secondgrouping dimension. The possible associa-tion rules examples of those selected di-mensions are:

Times(Jan’98)̂ Channels(DirectSales)̂ Products(Bread)(≥ 250k) → Products(Milk) (≥ 500k){Sup = 25%; conf = 15%}

The rule means that in January 1998(Times dimension) with sold productsthrough direct sales (on Channels dimen-sion) found that Bread (on the productsdimension) has sold 250 thousands unitstogether with Milk sold is 500 thousandsunits a 25% share of the total transaction.The probability of those selected dimen-sions when Bread and Milk are sold to-gether is 15%.

We define hybrid association rules indata warehouses as association rules which

Page 9: Mining Association Rules in Data Warehouses

36 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

engage two or more repeated dimensionsin data warehouses where two or more ofits dimensions can have multiple attributes.With hybrid association rules, we are alsoconcerned about the interval, repeated di-mensions, and measurement of summariseddata in data warehouses, such as quantity.For example, suppose we want to discoverinteresting patterns limited to three dimen-sions: Times, Customers, and Products di-mension, as follows (see Figure 2 for detailof data warehouse structure):

{Times Dimension, Customers Dimen-sion, Products Dimension}

{d1(‘1998-2000’), d

2(‘Australia’),

d3(‘Cars’)}

Here only the Times dimension worksas the grouping dimension, while the restof the dimensions can have multiple at-tributes. An example of such a hybrid as-sociation rule in data warehouses is as fol-lows:

Times(1998…2000)̂ (Customers(Melb)̂ Products(Car))(≥ 500) → (Customers(Syd)̂ Products(Car))(≥ 250) {sup = 10%; conf = 20%}

As shown by the above rule, the Cus-tomers and Products dimensions becamerepeatable dimensions while the Times di-mension appears only once. Thus, the rulemeans that in the years between 1998 and2000, customers in Melbourne buy 500 carunits then customers in Sydney also buy250 car units with a support of 10% of to-tal sales across those dimensions, and inthose years those customers in Melbournebuying 500 car units, then customers inSydney also buying 250 car units have aconfidence or probability of 20%.

Thus, a formal model for mining as-sociation rules in data warehouses is asfollows: Let Val = {A

n…A

m} be an interval

or classification value on a selected dimen-sion. Let D = {d

1, d

2,…, d

m} be a set of

dimensions where d ⊆ D. Let FactTab ={D,quantity} be a fact table with a set ofdimensions and its quantity. Let duser ={A

n…A

m} be an interval or classification of

a user-selected dimension. Let UserVar ={duser

1, duser

2,…,duser

m} be a set of

user input dimensions with its variableswhere duser ⊆ UserVar.

Hybrid Association Rules:

d1(Val), d

2(Val),…, d

m(Val) {≥ quantity} →

d2(Val),…, d

m(Val) {≥ quantity}

Non-hybrid Association Rules:

d1(Val), d

2(Val),…, d

m(Val) {≥ quantity} →

dm(Val) {≥ quantity}

Based on explanations in the previ-ous section, the concept of association rulesand hybrid association rules for transac-tion data and the data warehouse will bedifferent. In association rules for transac-tional data, we find interesting patternsbased on a single data; in a data ware-house, we find interesting patterns basedon the summarised data. With our ap-proach, measurements in the fact table inthe DW are really important for mining non-hybrid and hybrid association rules in datawarehouses.

RELATED WORKSMining association rules are divided

into seven different types: Boolean asso-ciation rules (Agrawal, Imielinski, & Swami,1993; Agrawal & Srikant, 1994; Han & Fu,1995; Mannila, Toivonen, & Verkamo,1994; Park, Chen, & Yu, 1995; Savasere,Omiecinski, Navathe, 1995; Toivonen,1996); generalised association rules (Srikant

Page 10: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 37

& Agrawal, 1995); multilevel associationrules (Han & Fu, 1995); metarule-guidedassociation rules (Kamber, Han, & Chiang,1997); and multidimensional data mining(Guenzel, Albrecht, & Lehner, 1999). Bool-ean association rules, also known as tradi-tional association rules, were first intro-duced by Agrawal, Imielinski, and Swami(1993) and mine interesting associationsbetween items, which are bought togetherin a market basket. It uses minimum sup-port and confidence to discover the fre-quency of the items. This support and con-fidence eliminate uninteresting rules by re-quiring input from the user. A large numberof algorithms, which use the same conceptas Agrawal, Imielinski, and Swami (1993)have been proposed to efficiently discoverlarge itemsets (Agrawal & Srikant, 1994;Han & Fu, 1995; Mannila, Toivonen, &Verkamo, 1994; Park, Chen, & Yu, 1995;Savasere, Omiecinski, Navathe, 1995;Toivonen, 1996). The hash-based techniquecan be applied to reduce the number ofcandidates when finding large itemsets byusing a hash table (Park, Chen, & Yu,1995); transaction reduction can be usedto reduce the number of transactionsscanned in the future iteration (Han & Fu,1995); the partition technique (Mannila,Toivonen, & Verkamo, 1994) can be imple-mented by requiring two database scans todiscover the frequent large itemsets; and asampling technique (Toivonen, 1996) is usedto get a random sample data from a givedatabase and finding the frequent itemsetsfrom the sample data rather than from thedatabase. After finding the large itemsets,the next step of mining association rules isrule generation, where we generate rulesbased on large itemsets and a given mini-mum confidence such as:

A → B with confidence C% where C% =(support (A)/support (B)) × 100%.

Mining generalised associationrules are bottom-up processes which use ataxonomy concept to combine all lowerlevel attributes into a specific taxonomylevel for categorical data (Srikant &Agrawal, 1995). This method works to findfrequent large items at the lowest or rawlevel data and then progressively going upto its higher level of taxonomy. The inten-tion behind this method is that the formerworks on association rules (Agrawal,Imielinski, & Swami, 1993; Agrawal &Srikant, 1994; Han & Fu, 1995; Mannila,Toivonen, & Verkamo, 1994; Park, Chen,& Yu, 1995; Savasere, Omiecinski,Navathe, 1995; Toivonen, 1996), only con-sidering items in association rules in one-level items. Meanwhile, in reality, itemsusually are displayed according to theirtaxonomical level. For example, Soap isput under the bathroom taxonomy level.Moreover, this kind of mining associationrules consider only one uniform minimumsupport across taxonomical level. This min-ing generalised association rules works onlyon a single dimension Boolean data. It usestransactional data as the main resource formining association rules. This mining asso-ciation rules is also only applicable to intra-association rules relationships with a dis-tinct predicate.

As show in Figure 4, there are twokinds of taxonomical levels: Computer andComputer Accessories. The Computerstaxonomy consists of two different prod-ucts: Desktop Computers and Notebooks.The Desktop Level has two differentgroups of items: Built Up-Desktop (fac-tory brand desktop) and Local-Desktop(users build their own desktop). Assumethat its uniform minimum support is 20%.It commonly happens that the higher taxo-nomical level has a bigger minimum sup-port. For example, Desktop from Comput-ers taxonomical branch has a higher sup-

Page 11: Mining Association Rules in Data Warehouses

38 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

port then Local-Desktop since Desktopitems are combined from all items from theBuilt Up-Desktop and Local-Desktop.

Mining multilevel association rules aretop-down processes where there is a mul-tilevel of data abstraction, to group the itemsaccording to the specified levels (Han &Fu, 1995). This concept broadens the ca-pability of Apriori algorithm (Agrawal &Srikant, 1994) from leaf-level associationrules to cope with multilevel data abstrac-tions. The algorithm first discovers frequentlarge items at the highest level of data ab-straction and continually sharpens the pro-cess to the lower level of data abstraction.This approach has a similar concept withthe approach to mining generalised asso-ciation rules (Srikant & Agrawal, 1995).The main difference of this approach com-pared with that in Srikant and Agrawal(1995) is that it uses a different minimumsupport for each different data abstractionlevel and it works from the highest level ofdata abstraction to the lowest level. Thisapproach argues that a uniform minimumsupport, which is applicable to all abstrac-tion levels, will lead to discovering a lot ofnon-meaningful associations at its higherlevels. If the user puts the minimum sup-port for all levels too high, it will miss theinteresting association items at the lowestlevel, since the lowest level usually has a

low minimum support (see Figure 4). More-over, this mining concept works only on asingle dimension Boolean data. It uses trans-actional data as the main resource for min-ing association rules and is also applicableonly to an intra-association rules relation-ship with a distinct predicate.

Mining quantitative associationrules in large relational tables is the devel-opment of traditional association rules tohandle non-Boolean data (Srikant &Agrawal, 1996). The main idea of this ap-proach is to map non-Boolean attributevalues such as quantitative and categoricaldata to Boolean data values with a specificmapping code. For quantitative attributes,it uses an interval range to partition thosevalues, while with categorical attributes ituses only a single integer value as the map-ping code. Partial completeness of mea-surement is applied to minimise the infor-mation lost by partitioning. Through thismeasurement, equi-depth partitioning isused to make a partition for quantitativeattributes since it seems to work optimally(Srikant & Agrawal, 1996). An example ofan association rule is as follows:

<Income: 10k…20k> and <Married:No> → <NumCars: 1> {sup = 30%;conf = 15%}

Figure 4. An example of taxonomy

Page 12: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 39

The rule means that unmarried peoplewith an income of between ten thousanddollars and twenty thousand dollars havingonly one car have 30% of support and prob-ability of 15%. Income attribute is a quan-titative attribute where the decision to makethe number of partitions and width of parti-tion intervals depends on the user definedpartial completeness of measurement.Married attribute is a categorical attributewhose value will be mapped into an inte-ger value. NumCars is a quantitative at-tribute with no partition since it has only asingle integer value. Moreover, this miningconcept is suitable for non-Boolean datawhich uses transactional data as the mainresource for mining association rules andis also applicable only to inter-associationrules relationships with a distinct predicate.

Metarule-guided association rulesfor multidimensional databases are a set ofrule templates that are used to guide themining processes on multidimensional da-tabases in order to discover interesting pat-terns according to user-specific expecta-tion (Kamber, Han, & Chiang, 1997). Thisidea emphasises that multidimensional da-tabases are complex databases that involvea lot of cubes, resulting in an increase inthe number of rules. Thus in order to effi-ciently discover interesting rules, metaruleis important to reduce the search space.An example of such a metarule is as fol-lows:

P1(X, A) ∧∧∧∧∧ P

2(X, B) → buys(X, Games

Software) {sup ≥ 40%, conf ≥ 40%}

The example of a rule found is as fol-lows:

Lives(X,Melb) ∧∧∧∧∧ Occupation(X,Student) →buys(X, Games Software) {sup = 45%,conf = 50%}

The metarule means that it uses onlytwo different predicates on a multidimen-sional database where A and B are thepredicate values specifically where custom-ers only buy Games Software with mini-mum support of 40% and minimum confi-dence of 40%. The example of the rulestates that on the predicate Lives inMelbourne with Occupation as a student ithas support of 45% and confidence of 50%to buy Games Software. That examplesatisfies the user metarule. Moreover, thismining concept is suitable for non-Booleandata where it uses multidimensional dataas the main resources for mining associa-tion rules; it has no taxonomical or classifi-cation of data abstraction; all attributes areput into the same level; and are also onlyapplicable to an inter-association rules re-lationship with a distinct predicate (Kamber,Han, & Chiang, 1997).

Multidimensional-guided data miningworks by using inter-dimensional associa-tion rules where it has no repeated predi-cates with non-Boolean data (Guenzel,Albrecht, & Lehner, 1999). This approachtends to add more capability, which has notbeen proposed in Kamber, Han, & Chiang(1997), which uses classification by apply-ing the granularity levels which exist on di-mensions. This approach is slightly differ-ent from others; minimum support iscounted based on the quantity of the at-tribute, not on the frequency of the data.Attributes that do not satisfy the minimumquantity will be ignored. For example, tensold Soap in a specific shop during a par-ticular time interval corresponds to tensingle sales. Although it uses the data ware-house to mine association rules, it still keepsthe transaction id from operational data-bases while it transforms to the data ware-house database. An example of an asso-ciation rule is as follows:

Page 13: Mining Association Rules in Data Warehouses

40 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

Min Support = 1000 unitsProducts(Soap) → Customers(Melb) ∧ ∧ ∧ ∧ ∧

Times(Jan’98) {support = 1,250 units}Confidence = Support(Soap ∧ ∧ ∧ ∧ ∧ Melb ∧∧∧∧∧

Jan’98)/Support(All ∧ ∧ ∧ ∧ ∧ Soap ∧ ∧ ∧ ∧ ∧ All).

The previous example uses 1,000 unitsas the minimum support of all levels. Thereare three dimensions involved: Products,Customers, and Times. The rule means that1,250 units of Soap were sold to custom-ers in Melbourne on January 1998. The con-fidence is counted based on the supportquantity of Soap in Melbourne in January1998 divided by the support quantity ofSoap for all customers in the region and alltimes.

Mining association rules in data ware-houses is our proposed method. Miningassociation rules in data warehouses usingeither non-repeatable or repeatable dimen-sions are different when mining associa-tion rules in transactional data. When min-ing association rules in data warehouses, ituses more than one dimension, which ismodelled either using a star schema or amultidimensional model. The key to miningassociation rules in data warehouses is theability to explore the integrated informa-tion that exists in fact tables, since in a facttable all information keys connect to eachdimension, and measurements of aggregate

data are available. When mining data ware-houses, we no longer use the transaction idon transactional resources or a single tablewith multiple fields such as table custom-ers (e.g., age, credit limits, number of cars)to discover the interesting patterns. Here,we use all information from dimensions,measurement of aggregate data (e.g., quan-tity) with which we want to find the inter-esting patterns. In short, this mining con-cept is suitable for non-Boolean data andsummarisation data, where it uses datawarehouses as the main resources for min-ing association rules. It is capable of han-dling interval or classification of data ab-straction, all attributes that have been de-fined on its dimensions, and also only appli-cable to an inter-association rules relation-ship with either distinct predicates or re-peatable predicates.

As shown in Table 5, each miningtype is divided according to its dimension-ality, level capability and its predicates.Boolean association rules, generalised as-sociation rules and multilevel associationrules work only on an intra-dimension as-sociation relationship (rely on a single di-mension) whereas quantitative associationrules, metarule-guided association rules,multidimensional data mining, and miningassociation rules in data warehouses (ourproposed method) are capable in an inter-

Table 5. Comparison of mining association rules methods across dimension, level and predicate

Dimension Level Predicate Mining Type Intra Inter Single Multiple Non

Hybrid Hybrid

Boolean Association Rules • • • Generalized Association Rules • • • Multilevel Association Rules • • • Quantitative Association Rules • • • Metarule-guided Association Rules

• • •

Multidimensional Data Mining • • • Mining Association Rules in Data Warehouses (Proposed Method)

• • • •

Page 14: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 41

dimension association relationship (rely onmore than one dimension). On its level ca-pability, generalised association rules, mul-tilevel association rules, multidimensionaldata mining and mining association rules indata warehouses are capable of handlingboth interval and classification data on dif-ferent levels of data abstraction, while therest have only a single level of data ab-straction. Based on its occurrence of predi-cates, only the mining association rules indata warehouses (our proposed method)have the ability to have either distinct predi-cates or repeatable/hybrid predicates.

As shown in Table 6, Boolean asso-ciation rules, generalised association rules,multilevel association rules and quantitativeassociation rules are applied to transactionaldatabases with limited tables, whilemetarule-guided association rules and mul-tidimensional data mining technique are

suitable for multidimensional databases.The definition of multidimensional data-bases are databases which broaden thetransactional database to handle more thanone dimension, but still keep the detail trans-actional id from transaction data. Miningassociation rules in data warehouses workin both multidimensional databases and datawarehouses, since data warehouses can bebuilt using multidimensional models and arenot the same as multidimensional data-bases, which are built from transactionaldata. Data warehouses do not have thesame structure as their transactional re-sources since in data warehouses all thedata are summarised and aggregated.

As shown in Table 7, mining associa-tion rules in data warehouses can handleboth non-Boolean data and summariseddata (e.g., quantity, total prices). The sum-marised data here is one of the character-

Table 6. Comparison of mining association rules methods base on the database type

Table 7. Comparison of mining association rules methods base on the data type

Database Type Mining Type Transaction Multi

Dimension Data Warehouses

Boolean Association Rules • Generalized Association Rules • Multilevel Association Rules • Quantitative Association Rules • Metarule-guided Association Rules • Multidimensional Data Mining • Mining Association Rules in Data Warehouses (Proposed Method)

• •

Data Type Mining Type Boolean Non-Boolean Summarized Data

Boolean Association Rules • Generalized Association Rules • Multilevel Association Rules • Quantitative Association Rules • Metarule-guided Association Rules • Multidimensional Data Mining • Mining Association Rules in Data Warehouses (Proposed Method)

• •

Page 15: Mining Association Rules in Data Warehouses

42 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

istics of data warehouses. Metarule-guidedassociation rules and multidimensional datamining technique work on non-Boolean dataand can handle quantitative or categoricalattributes, while the rest can handle Bool-ean data only.

Finally, we can see in Agrawal andSrikant (1994), Han and Fu (1995), Mannila,Toivonen, and Verkamo (1994), Park, Chen,and Yu (1995), Savasere, Omiecinski, andNavathe (1995), and Toivonen (1996), thatthey all involve finding a frequent itemseton a single dimension without being con-cerned with the quantity or categorical at-tributes. Moreover, Srikant and Agrawal(1995, 1996) introduce the use of hierar-chical, quantitative and categorical at-tributes for mining association rules, but stilluse only transactional data. However, theseapproaches lack the relevancy of combi-nations in mining with the interval or clas-sification taxonomical level of data abstrac-tion. Furthermore, in Guenzel, Albrecht, andLehner (1999) and Kamber, Han, andChaing (1997), more than one predicate ordimension has been used. However, it onlyuses transactional data to be modelled in amultidimensional view without any classi-fication or use of measurement of aggre-gate data. Finally Guenzel, Albrecht, andLehner (1999) use quantity as one attributeof measurement of aggregate data, whichis also capable of handling classification.However, this approach is not good enoughsince using quantity as the minimum sup-port will prevent the user from finding in-teresting patterns. The user may not beaware of the quantity that is suitable to bemined. It would be better to have an algo-rithm that finds the average quantity of themeasurement of data as this will not elimi-nate the probability of erasing frequentitemsets, which it may important to mine.This has encouraged us to propose a con-cept for mining association rules in a data

warehouse by using the measurement ofdata as all data in a data warehouse hasbeen aggregated. Our approach can handleeither distinct predicates or hybrid predi-cates, and classification or interval data indimensions.

PRE-PROCESSALGORITHMS FORMINING DATAWAREHOUSES

We propose four algorithms to pro-duce the extracted data from a data ware-house to be used for mining associationrules. These algorithms are HAvg, VAvg,ModusFilter, and WMAvg. The algorithmsfocus on quantitative attribute of the facttable such as quantity in order to preparean initialised data for mining associationrules in a data warehouse. We prepare thedata from a data warehouse using the fol-lowing SELECT statement:

Syntax Query:SELECT <d

1>, <d

2>…,<d

m>,

SUM(quantity)AS QuantityFROM Fact TableWHERE (d

1 = duser

1) AND (d

2

= duser2) AND … AND (d

m = duser

m)

GROUP BY <d1>,<d

2>…,<d

m>;

Example:SELECT Time_id, Channel_id,Prod_id, SUM(quantity)AS Quan-tity

FROM Fact TableWHERE Time_id = ‘Jan 1998’)AND (Channel_id = ‘Direct Sales’)AND (Prod_id = ‘Men-Jeans’)

GROUP BY Time_id, Channel_id,Prod_id;

Page 16: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 43

The previous example uses three di-mensions: Times, Channels, and Productsdimensions (see Figure 2 for the data ware-house structure). It also uses quantity asthe measurement of the summarised data.The query finds all records data that ex-isted in January 1998, with direct sales ofspecific men’s jeans products.

VAvg AlgorithmUsing the VAvg algorithm, we find the

average quantity of the defined dimensionsvertically. The final product of this algo-rithm is to provide the initialised tableVinitTab to be used next for mining asso-ciation rules using either non-repeatabledimensions or repeatable dimensions in datawarehouses.

We use procedure VAvg to discoveran initialised table based on its vertical av-erage quantity (see Figure 5). Before cre-ating an initialised table, we create an av-erage quantity table (VAvgTab) to store

data dimension-m with its quantity, num-ber, and average (see Table 8). The firstlooping operation (lines 2-9) has a functionCheck_Key, which is used to checkwhether the selected row (Db

i) in the se-

lected fact table (Db) already exists onVAvgTab. If the return value is true, thenupdate the content of table VAvgTab (line4). We update its quantity, count and aver-age on the selected dm.key. Otherwise, weinsert a new record to table VAvgTab (line6). The second looping operation (lines 10-15) has a function Check_qtyVavg, whichis used to search all rows in the selectedfact table (Db) that satisfy the minimumquantity based in table VAvgTab. If the rowhas satisfied the minimum quantity, we theninsert a new row in table VInitTab.

We explain our VAvg algorithm in Fig-ure 6. Here we use only two dimensions:Times and Products. The Times dimensionworks as a grouping dimension, so the at-tributes of Products dimension will be

Table 8. Notations for VAvg, HAvg, and ModusFilter algorithms

Figure 5. The VAvg Algorithm

Notation Description Db Fact Table; contains(dm-1.key, dm.key, quantity) N Total rows in fact table VAvgTab Vertical Avg quantity table; contains(dm.key,dm.quantity,dm.count,dm.avg) VInitTab Vertical Initialize Table; contains(dm-1.key,dm.key,dm.quantity) HAvgTab Horizontal Avg quantity table; containts(dm-1.key,dm-1.quantity,dm-1.count,dm-1.avg) HInitTab Horizontal Initialize Table ; contains (dm-1.key,dm.key,dm.quantity) TmpModusFilterTab Temporary table; dm.key,dm.count,dm.quantity) ModusFilterTab Modus quantity table; contains (dm.key,dm.modusqty); ModusInitTab Initialize Table; (dm-1.key, dm.key, modusqty);

1. Procedure VAvg 2. For I = 1 to N loop 3. IF Check_Key(Dbi) then 4. Update_VAvgTab(Dbi); 5. Else 6. Insert_VAvgTab(Dbi); 7. End if; 8. End loop; 9. For J = 1 to N loop 10. IF Check_qtyVavg(Dbi) Then 11. Insert_VInitTab(Dbi); 12. End if; 13. End Loop;

Page 17: Mining Association Rules in Data Warehouses

44 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

grouped according to their time frames (seeFigure 2 for details of these dimensions).Suppose we want to find the average prod-uct dimension quantity vertically from thefirst row of the fact table to the last row.First we search the average quantity ofproduct_id = 1 from the first row May2000 to the last row Nov 2000. The aver-age quantity of that product is 14.25. Thenwe eliminate all attributes of that particularproduct that are less than its average quan-tity. Finally, we repeat the same processesuntil we have finished counting all the av-erages of quantities on the Products di-mension.

HAvg AlgorithmUsing the proposed HAvg algorithm,

we find the average quantity of the defineddimensions horizontally. The data formatin the selected fact table is (dm-1.key,dm.key, quantity). We apply the HAvgprocedure to create the initialised table(HInitTab) for mining association rules ina data warehouse (see Figure 7). Beforecreating the initialised table, we create anaverage quantity table (HAvgTab) to storethe data dimension m-1 with its quantity,number, and average (see Table 8). Asshown in Figure 7, the first looping opera-tion (lines 2-8) has a function Check_Key,which is used to check whether the selectedrow (Dbi) in the selected fact table already

exists in the HAvgTab (line 3). If the re-turn value is true, then we update the con-tent of table HAvgTab (line 4). We updateits quantity, count and average on selecteddm-1.key. Otherwise, we insert a newrecord in table HAvgTab (line 6). The sec-ond looping operation (lines 9-13) has afunction Check_qtyHAvg, which is usedto search all rows in the selected fact tablethat satisfy the minimum quantity based ontable HAvgTab (line 10). If the row hassatisfied the minimum quantity, then inserta new row in table HInitTab.

We explain our HAvg algorithm us-ing Table 9; we use two dimensions, Prod-ucts dimension and Customers dimension,to illustrate how the HAvg algorithm works(see Figure 2 for details of those dimen-sions). The Products dimension is used asthe grouping dimension. Thus, attributes onthe Customers dimension will be groupedbased on Products dimension accordingly.On the Customers dimension, Country_Idis used as the attribute. First, we searchthe average quantity of product_id = 10from the first row across Customers di-mension. The average quantity of that prod-uct is 62.83. Then we delete all attributesfrom the Customers dimension that are lessthan its average quantity [e.g., DE(8), IE(2),NL(32)]. Finally, we continue until we havefinished counting all the averages of theProducts quantity.

Figure 6. Example showing how the VAvg algorithm works

Page 18: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 45

ModusFilter AlgorithmUsing the ModusFilter algorithm, we

find the modus quantity of the defined di-mensions vertically. The data format in theselected fact table is (d

m-1.key, d

m.key,

quantity). We apply procedure ModusFilterto create an initialised table (ModInitTab)for mining association rules in data ware-house (see Figure 8). Before creating aninitialised table, we create a modus quan-tity table (ModusFilterTab) to store dis-tinct data dimension m with its modusquantity and a temporary table(TmpModusFilterTab) to store data di-mension m with its quantity and count (seeTable 8).

As shown in Figure 8, the first loop-ing operation (line 2-8) has a functionCheck_Key, which is used to check what-

ever the selected row (Dbi) on selected

fact table already exists onTmpModusFilterTab (line 3). If the returnvalue is true, then update the contents ofthe table TmpModusFilterTab (line 4).Otherwise, insert a new record into thetable TmpModusFilterTab (line 6). Next,call procedure Insert_ModusFTab tofind each distinct data on dimension m,which has the highest count, and insert thatrecord data (e.g., d

m.key and quantity) into

table ModusFilterTab. The second loop-ing operation (lines 10-14) has a functionCheck_ModusF which is used to searchall rows in the selected fact table that sat-isfy the modus quantity (line 11). If therecord has satisfied the modus quantity, theninsert a new record into the tableModusInitTab (line 12).

Figure 7. Algorithm HAvg

Table 9. An example of how the proposed HAvg algorithm works

1. Procedure HAvg 2. For I = 1 to N loop 3. IF Check_Key(Dbi) then 4. Update_HAvgTab(Dbi); 5. Else 6. Insert_HAvgTab(Dbi); 7. End if; 8. End loop; 9. For J = 1 to N loop 10. IF Check_qtyHAvg(Dbi) Then 11. Insert_HInitTab(Dbi); 12. End if; 13. End Loop;

Page 19: Mining Association Rules in Data Warehouses

46 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

As shown in Figure 9, we use theModusFilter algorithm to find the modusquantity of product dimension verticallyfrom the first row of the fact table to thelast row. This example uses Times as d

m-1

dimension and Products as dm dimension.

First, we search the modus quantity ofproducts.key = 1 from the first row 5 Jan2000 to the last row 30 Feb 2000. It foundthat the modus quantity is 20. Then, wedelete all rows of that specific product idthat are not equal to its modus quantity. Fi-nally, we continue searching until we finishcounting all the modus quantity of each

product on Products that is involved in theselected user dimensions.

WMAvg AlgorithmThe WMAvg algorithm finds the

weighted moving average quantity of thedefined dimensions vertically, in order toprepare an initialised data for mining asso-ciation rules in data warehouse. We usethe weighted moving average quantity inthe fact table on m-dimensions. We pruneall rows in the fact table that are less thanthe weighted moving average quantity,since we assume those rows which are less

Figure 8. The ModusFilter algorithm

Figure 9. An example of how the ModusFilter algorithm works

1. Procedure ModusFilter 2. For I = 1 to N loop 3. IF Check_Key(Dbi) then 4. Update_TmpModusFTab(Dbi); 5. Else 6. Insert_TmpModusFTab(Dbi); 7. End if; 8. End loop; 9. Insert_ModusFTab(TmpModusFTab); 10. For J = 1 to N loop 11. IF Check_ModusF(Dbi) Then 12. Insert_ModusInitTab(Dbi); 13. End if; 14. End Loop;

Page 20: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 47

than its weighted moving average quantitywill not form any interesting rules.

We apply the WMAvg algorithm tocreate an initialised table WMAInitTab formining hybrid association rules in DW (seeFigure 10). We need to create a weightedmoving average quantity table(WMAvgTab) to store distinct dimensionsdata with its weighted moving averagequantity and a temporary table TmpWMAto store distinct data dimensions data withits total record count (see Table 10). Line5 calculates the SUM weighted value. Line8 calculates each Quantity of selectedrecord fact table with their weighted value.Line 13 inserts a new record in tableWMAvgTab. Line 16 checks records in theselected fact table, which satisfied itsweighted moving average quantity based

on table WMAvgTab. Line 17 inserts a newrecord in table WMAInitTab.

We have user input dimensions andvariables, as follows (see Figure 2 for de-tailed DW structure):

UserVar = {Times Dimension, ChannelsDimension, Products Dimension}

{d1(‘January 1998’), d

2(‘Direct Sales’),

d3(‘Men-Jeans’)}

As shown in Table 11, both Chan-nels dimension and Products dimension arenot shown since they have the same valueon all record data; only Times dimensionand its quantity attribute different on eachrecord data. In order to count data on col-umn Weighted, first count how manyrecords exist (e.g., 11 records). Σ of 11

Table 10. Notations

Figure 10. WMAvg algorithm

Notation Meaning D Sets of dimensions and its values {d1, d2,…, dm} ComDim Combine Dimensions and its values {d2, d3,…, dm} FactTab Fact table {D,quantity} WMAvgTab Weighted moving average quantity table {D, MAvg_Qty} TmpWMA Temporary Weighted Table {D,count} WMAInitTab Initialize Table {D,quantity}

1. Procedure WMAvg 2. X = {total rows in TmpWMA table}; 3. Y = {total rows in FactTab table}; 4. MAvg_qty = ?{Quantity × Weighted}; 5. For I = 1 to X Loop //On Table TmpWMA 6. Z = CountWeighted(TmpWMAi.Count); 7. CountRec = 1; 8. While TmpWMAi !EOF FactTab Loop //On Fact Table 9. Tmp_MAVG_qty = CalculateWeightedValue(FactTab,CountRec/Z); 10. CountRec ++; 11. MAvg_qty += Tmp_MAVG_qty; 12. End Loop; 13. Insert_WMAvgTab(TmpWMAi,MAVG_qty); 14. End loop; 15. For K = 1 to Y loop 16. IF Check_WMAvg(FactTabK) Then 17. Insert_ WMAInitTab(FactTabK); 18. End if; 19. End Loop;

Page 21: Mining Association Rules in Data Warehouses

48 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

record data are 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 +9 + 10 + 11 = 66, and then divide each recordby 66. For example, on record one, thenthe weighted value is 1/66 and record two,then the weighted value is 2/66 and so on.Next, find each value by multiplying valuesbetween columns Quantity and Weighted.For example, on record one, value 8 fromcolumn quantity is multiplied by value 1/66from column weighted and the result is0.121. Finally, we get the weighted movingaverage quantity from ∑(Quantity ×Weighted) = 21.847. After finding theweighted moving average quantity, deleterecords that are less than its weighted mov-ing average quantity.

MINING ASSOCIATIONRULES IN DATAWAREHOUSES

Mining association rules for datawarehouses are categorised into miningnon-repeatable association rules and min-ing repeatable association rules. Non-re-peatable association rules in data ware-houses are association rules that engagetwo or more dimensions where only onedimension can have multiple attributes, andwhere the others work as grouping dimen-sions. Hybrid association rules in data ware-houses are association rules that engage

two or more repeated dimensions in datawarehouses where two or more of its di-mensions can have multiple attributes. Inthe following sections, we discuss in detailthe mining of non-hybrid and hybrid asso-ciation rules algorithms in data warehouses.

Mining Non-Repeatable AssociationRules in Data WarehousesIn order to discover interesting pat-

terns in data warehouses for non-repeat-able predicates association rules, we startto take the initialise table (InitTab), whichwe got from the pre-processes algorithmsin the fourth section. Before we explainthe GenNLI algorithm used to discoverlarge itemsets on non-repeatable mining as-sociation rules in data warehouses, we ex-plain the tables that we will use. TableTabKey is used to put distinct values fromdimension d

1, table TabProcess is used to

keep the information about dimension d1

value and a list of dimension dm values. We

do not use the rest of dimension values{d

2(Val), d

3(Val),…, d

m-1(Val)}, since each

of them is just working as a selector. Se-lector here means that a dimension whichalways has the same value (see Figure 11).Table LargeTab is used to store frequentlarge itemsets (see Table 12). Here are thedetails of our algorithm (see Figure 12):

Table 11. An example of weighted moving average quantity

Page 22: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 49

• Line 6 checks whatever d1 value exists

on table TabKey by searching all d1 value

from InitTab.• Line 7 inserts a new record: d

1 value on

table TabKey.• Lines 11-15 create list of dimension d

m

values taken from table InitTab and se-lected d

1 from table TabKey.

• Line 16 insert a new record on tableTabProcess with a specific dimensiond

1 value and a list of dimension d

m val-

ues.• Line 18 creates all large itemsets from

table TabProcess with specified userminimum support and inserts the resultinto table LargeTab. The idea is similarto the algorithm in Bodon (2003).

After discovering all the largeitemsets in table LargeTab, we will haveour non-hybrid association rules templateas follows:

d1(Val), d

2(Val),…, d

m(Val){≥ quantity} →

dm(Val){≥ quantity}

The explanations of these associationrules have been discussed in the third sec-tion. However, we want to emphasise thatthe quantity that we use here is minimumquantity based on the initialised table (e.g.,VInitTab, HInitTab, ModusInitTab,WMAInitTab).

For an example of our proposed algo-rithm, suppose we want to find large itemsetsaccording to the following dimensions,

Table 12. Notations for non-repeatable association rules in data warehouses

Notation Meaning Val a dimension value D Sets of dimensions and its values {d1(Val), d2(Val),…, dm(Val)} Selector Sets of selector dimensions and its values {d2(Val), d3(Val),…, dm-1(Val)} InitTab Initialize Table {D,quantity} DimTab Distinct dimensions Table with its minimum quantity {D,MinQty} TabKey Distinct dimension d1 value Table {d1(Val)} ListDmValue List of dimension dm values table {Val1,Val2,…,Valn} ListDmValueQty List of dimension dm values table with its quantities

{Val1,(qty),Val2, (qty),…,Valn(qty)} TabProcess Process Table {d1(Val), ListDmValue} LargeTab Large Items table {d1(Val),Selector, ListDmValue,MinSup}

Figure 11. Explanation of dimensions for non-repeatable association rules in data warehouses

Page 23: Mining Association Rules in Data Warehouses

50 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

Figure 12. GenNLI algorithm

Figure 13. Example of how the VAvg algorithm fits into the Apriori algorithm

1. Procedure GenNLI 2. N = {total attributes of selected dm}; 3. For I = 1 to X Loop //on table InitTab 4. IF !CheckKey(d1) Then 5. InsertTabKey(d1); 6. End IF; 7. End Loop; 8. For J = 1 to Y Loop //on table TabKey 9. Insert into ListDmValuej 10. Select Val1,Val2,…,ValN 11. From InitTab a, TabKey b 12. Where a.(d1)=b.(d1) 13. And b.(d1)= TabKeyj.(d1); 14. InsertNonHybridTabProcess(TabKeyj.(d1),ListDmValuej);15. End Loop; 16. Apriori_Gen_ALL(NonHybridTabProcess,LargeTab,MinSup);

Page 24: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 51

InitTab = d1(Times Dimension),

d2(Channels Dimension), d

3(Products

Dimension)TmpTabProcess = d

1(Times Dimen-

sion), d3(Products Dimension)

Here we use three dimensions:Times, Channels and Products. The Chan-nels dimension works as the selector and itwill not be used for the mining process. Weuse the Times dimension only as the group-ing dimension and Products dimension’s at-tributes will be grouped according to theTimes dimension. Figure 13 gives an ex-ample of the use of the VAvg and Apriorialgorithm to discover the large itemset in adata warehouse. Suppose a user wants tofind rules on the Channel dimension = “Di-rect Sales,” the Time dimension on intervalMay2000…Nov2000 and all members of theProduct dimension with Support = 25% andConfidence = 25%. First, we use the VAvgalgorithm to provide the initialised table, andthen Apriori uses the initialised table forfinding large itemsets in a data warehouse.Finally, we generate rules based on thelarge itemsets.

Mining Hybrid Association Rulesin Data Warehouses

We apply CombineDims algorithmusing selected dimensions: {d

2,d

3,…,d

m}

from initialised table (table from pre-pro-

cess algorithms) into one distinct mappingcode, which is stored on MapTab (see Fig-ure 13). For example, we want to make amapping code as follows,

(Times Dimension, Channels Dimension/Products Dimension, Mapping Code)(‘January 1998’, ‘Direct Sales/Men-Jeans’, ‘0001’)

Here we combine two dimensions:Channels and Products, into one new map-ping code “001.” Before that, we explainour algorithms. As shown in Figure 14, weapply algorithm to combine selected dimen-sions before doing hybrid association rules.Line 4 checks selected dimensions:{d

2,d

3,…,d

m} whatever those already have

its mapping code. Line 5 generates andstores a new mapping code for selecteddimensions. Line 9 searches a mappingcode on table MapTab for selected dimen-sions on table DimTab. Line 10 inserts anew record in table HybridTab using threeparameters, which are selected d

1.key from

initialise table, mapping code and quantity.After creating table HybridTab, we

use that table on GenHLI algorithm to dis-cover large itemsets on hybrid mining as-sociation rules in data warehouses (see Fig-ure 15). Here are the details of our algo-rithm:

Table 13. Notations for hybrid association rules in data warehouses

Notation Meaning Val A dimension value D Sets of dimensions and its values {d1(Val), d2(Val),…, dm(Val)} ComDim Combine Dimensions and its values {d2(Val), d3(Val),…, dm(Val)} InitTab Initialize Table {D,quantity} HybridTabKey Hybrid Key Table {d1} MapTab Mapping Table {MapCode,ComDim} HybridTab Hybrid Table {d1,MapCode,quantity} HybridTabProcess Process Hybrid Table; contains {d1, List of MapCode} TmpLargeTab Temp Large Itemsets Table { d1(Val), List of MapCode,Sup} LargeHybridTab Large Itemsets Table {d1(Val),List of ComDim,MinSup}

Page 25: Mining Association Rules in Data Warehouses

52 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

• Line 3 checks whatever d1 value exists

on table HybridTabKey.• Line 4 inserts a new record: d

1 value on

table HybridTabKey.• Lines 8-12 create list of mapping codes

taken from table HybridTab and se-lected d

1 from table HybridTabKey.

• Line 13 inserts a new record on tableHybridTabProcess.

• Line 15 creates all large itemsets fromtable HybridTabProcess with specifieduser minimum support and inserts the re-sult into table TmpLargeTab. The idea issimilar to the algorithm in Bodon (2003).

• Line 16 changes the mapping code fromtable TmpLargeTab into the real dimen-sions value. Thus, all the large itemsets

after mapping the code are stored intable LargeHybridTab.

After discovering all the hybrid largeitemsets in table LargeHybridTab, we willhave our hybrid association rules templateas follows:

d1(Val), d

2(Val),…, d

m(Val) {≥ quantity} →

d2(Val),…, d

m(Val) {≥ quantity}

The explanations of these hybrid as-sociation rules have been discussed in the thirdsection. However, we want to emphasise thatthe quantity that we use here is the mini-mum quantity based on the initialised table(e.g., VInitTab, HInitTab, ModusInitTab,WMAInitTab).

Figure 14. CombineDims algorithm

1. Procedure CombineDimensions 2. X = {total rows of table InitTab}; 3. For I = 1 to X Loop //on table InitTab 4. IF !CheckMapCode(d2,d3,…,dm) Then 5. GenMapCode(d2,d3,…,dm); 6. End IF; 7. End Loop; 8. For J = 1 to X Loop //on table InitTab 9. S=FindMapCode(d2,d3,…,dm); 10. InsertHybridTab(InitTabj(d1.key),S,InitTabj(Qty)); 11. End Loop;

Figure 15. GenHLI Algorithm

1. Procedure GenHLI 2. For I = 1 to X Loop //on table HybridTab 3. IF !CheckKey(d1) Then 4. InsertHybridTabKey(d1); 5. End IF; 6. End Loop; 7. For J = 1 to Y Loop //on table HybridTabKey 8. Insert into ListMapCodej 9. Select MapCode1,MapCode2,…,MapCodem 10. From HybridTab a, HybridKey b 11. Where a.(d1)=b.(d1) 12. And b.(d1)= HybridKeyj.(d1); 13. InsertHybridTabProcess(HybridKeyj.(d1),ListMapCodej); 14. End Loop; 15. Apriori_Gen_ALL(HybridTabProcess,TmpLargeTab,MinSup); 16. Trans_MapCode(TmpLargeTab,MapTab,LargeHybridTab);

Page 26: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 53

PERFORMANCEEVALUATION

In our performance experimentations,we use a sample sales data warehouse(Lowry et al. 1992), which contains fivedimensions (e.g., customers dimension,products dimension, promotions dimension,times dimension, and channels dimension)and one sales fact table (see Table 14).For a detailed structure of Sales Data Ware-house see Figure 2. The data warehouse isbuilt using relational OLAP (ROLAP) andis modelled in a star schema, which con-tains dimension tables for the hierarchiesand a fact table for the dimensional at-tributes and measures. We perform ourexperiments on a Pentium IV 1,8 GigahertzCPU with 512MB. We use Oracle9i Da-tabase as the data warehouse repository.

Pre-Process Algorithms ResultsWe count number of rows pruned by

the VAvg, HAvg, WMavg and ModusFilteralgorithms and compare the effectivenessof those approaches with a No Methodapproach using single attribute and intervalor classification cross dimensions. Asshown in Figure 16, we compare the num-ber of rows produced by our approach us-ing a single attribute with that of NoMethod. As we can see, from dimensionone to four there is a significant reductionof rows when compared with the NoMethod. Starting from one dimension, onlyModusFilter has reduced about 89% when

compared with the No Method approach,while other approaches have reduced upto 60% when compared with the NoMethod approach. On two-dimensionaldata, again only ModusFilter has reducedabout 82% when compared with NoMethod approach and other approacheshave reduced up to 59%. On three- andfour-dimensional data, only ModusFilterhas reduced about 70% when comparedwith the No Method approach, but thispercentage is lower than its previous di-mensional data (e.g., one- and two-dimen-sional data). The overall study of this typeof experiment showed that our proposedmethod has a similar trend across dimen-sions where both have reduced the rowsup to 60% for most of our proposed meth-ods, except for ModusFilter, where it hasreduced the rows up to 89% for one-di-mensional data and than gradually lowerabout 70% for four-dimensional data. Incontrast to other dimensional data, whenfive dimensions are used, all approachesproduce a similar number of rows and theyapparently have the same number of rowsas the No Method approach. This is be-cause the numbers of rows involved inthese dimensions are few.

As shown in Figure 17, we compareour approaches with the No Method in clas-sification or interval attributes across fivedimensions. The result has shown that ourproposed methods have been significantlyaffected by the number of rows produced

Table 14. Sales data warehouse specifications

Table Name Records Customers Dimension 50,000 Products Dimension 501 Promotions Dimension 10,000 Times Dimension 1,030 Channels Dimension 5 Sales Fact Table 1,016,271

Page 27: Mining Association Rules in Data Warehouses

54 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

when compared with the No Method ap-proach. The overall study of this type ofexperiment is similar to the results shownin Figure 16, which shows that our proposedmethod has a similarity trend across dimen-sions where both have reduced the rowsup to 60% for one-dimensional data to five-dimensional data for most of our proposedmethods, except for ModusFilter, whereit has reduced the rows up to 88% for one-dimensional data and then gradually lowerto about 68% for four-dimensional data.

Furthermore, as shown in Figure 18,we use a combination of five dimensionswith interval and single attribute and com-pare the effectiveness with our approach.For the Times dimension, HAvg has reduced31% when compared with the No Methodapproach, while others have a similar num-ber of rows as the No Method approach.For the Products dimension, VAvg andWMAvg have reduced 22% when com-pared with the No Method approach, whileothers have a similar number of rows as

Figure 16. Comparison of our proposed methods and No Method using single attribute

Figure 17. Comparison of our proposed methods and No Method using interval or classificationattribute

Trends of Using Single Attribute

050000

100000150000200000250000300000350000

Dimensions

Ro

ws

Vavg 322844 97076 45178 38890 131

Havg 324917 97737 42982 36951 168

WMAvg 318910 96712 45001 38792 131

ModusFilter 85671 41112 27773 26014 159

No Method 787766 234702 106523 91219 168

1 Dim 2 Dim 3 Dim 4 Dim 5 Dim

Trends of Using Interval or Classification Attribute

0

20000

40000

60000

80000

100000

Dimensions

Ro

ws

Vavg 76807 49154 26281 22103 7093

Havg 76968 49237 25920 21695 6766

WMAvg 75837 48937 26307 22060 7090

ModusFilter 20868 15624 15011 13235 5652

No Method 186428 119026 68868 57836 18092

1 Dim 2 Dim 3 Dim 4 Dim 5 Dim

Page 28: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 55

the No Method approach. For the Chan-nels dimension, ModusFilter approach hasreduced 65% and VAvg and WMAvg havereduced 59% when compared with the NoMethod approach, while only Havg hasreduced 10%. For the Customers and Pro-motions dimensions, our proposed methodshave reduced very little or similar to theNo Method approach. The overall studyon this type of experiment has shown afewer number of rows is reduced with ourproposed methods when compared with theNo Method approach. This can be ex-plained since the numbers of data involvedin this experiment are too few. Our pro-posed methods work well with a high vol-ume of data. This is ideal, since in real lifewe need a huge volume of data to find in-teresting patterns extracted from data ware-houses.

Mining Non-RepeatableAssociation Rules ResultsHere, we count the number of rows

pruned by the VAvg, HAvg, WMavg andModusFilter algorithms and compare theeffectiveness of those approaches with the

No Method approach across three-dimen-sional data to five-dimensional data. We willuse those data as the initialised table formining non-repeatable association rules inFigures 20, 21 and 22. As shown in Figure19, for three-dimensional data to five-di-mensional data, our proposed methods,Vavg, Havg and WMAvg, have pruned therows up to 59%. While ModusFilter ap-proach has reduced 88% for three-dimen-sional data to four-dimensional data and forfive-dimensional data, it has pruned 86%,which is 2% higher than other dimensionswhen compared with the No Method ap-proach. The overall results of this experi-ment have shown that our proposed meth-ods have reduced significant rows whencompared with the No Method approachwhere ModusFilter method has eliminatedmore rows than others.

As shown on Figure 20, we use three-dimensional data with four different mini-mum supports (e.g., 0.75%, 1%, 1.5% and2%) to discover large itemsets for non-re-peatable association rules in data ware-houses using our proposed methods withthe comparison with No Method approach.

Figure 18. Comparison of our proposed methods and No Method using combinations fivedimensions

Trends of Using Combinations Five Dimensions

0

30

60

90

120

150

180

Dimensions

Ro

ws

Vavg 158 131 45 27 29

Havg 108 168 100 34 45

WMAvg 158 131 45 27 29

ModusFilter 158 159 39 34 23

No Method 158 168 112 35 49

Times Products Channels Customers Promotions

Page 29: Mining Association Rules in Data Warehouses

56 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

For all minimum supports, the ModusFiltermethod has produced 99% large items lessthan No Method approach. While otherapproaches have discovered 99% to 96%less large items for 0.75% minimum sup-port to 2% minimum support.

As shown in Figure 21, we use four-dimensional data with four different mini-mum supports (e.g., 0.75%, 1%, 1.5% and2%) to discover large itemsets for non-re-

peatable association rules in data ware-houses using our proposed methods withthe comparison with No Method approach.For minimum support 0.75%, VAvg andWMAvg have discovered 98% large itemsfewer than the No Method approach whileHAvg and ModusFilter have produced99% fewer large items than the No Methodapproach. For minimum support 1% to 2%,ModusFilter has discovered 99% and 98%

Figure 19. Comparison of our proposed methods and No Method data for mining normalassociation rules

Figure 20. Comparison of our proposed methods and No Method on 3-dimensional associationrules

Dimensional Trends

0

1000

2000

3000

4000

5000

Dimensions

Ro

ws

No Method 8948 7668 5351

Vavg 3690 3157 2193

Havg 3773 3241 2339

WMAvg 3634 3111 2171

ModusFilter 1023 897 699

3 Dims 4 Dims 5 Dims

3 Dimensional Data

0

100

200

300

400

500

Supports

Lar

ge

Item

s

No Method 37644 16566 6096 3530

Vavg 429 274 152 114

Havg 346 225 150 111

WMAvg 410 253 152 110

ModusFilter 55 46 39 29

0.75% 1% 1.50% 2%

Page 30: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 57

fewer large items if compared with the NoMethod approach, while others have pro-duced 97% to 95% fewer large items thanthe No Method approach.

As shown in Figure 22, we use five-dimensional data with four different mini-mum supports (e.g., 0.75%, 1%, 1.5% and2%) to discover large itemsets for non-re-peatable association rules in data ware-houses using our proposed methods in com-

parison with the No Method approach. Forminimum support 0.75% to 2%,ModusFilter has reduced the number ofdiscovered large items within interval 2%from 98% to 92%. Meanwhile, the Havgalgorithm has discovered fewer large itemswithin interval 2% from 96% to 90% andothers have produced 95% to 88% largeitems fewer when compared with the NoMethod approach.

Figure 21. Comparison of our proposed methods and No Method on 4-dimensional associationrules

Figure 22. Comparison of our proposed methods and No Method on 5-dimensional associationrules

4 Dimensional Data

0

80

160

240

320

400

Supports

Lar

ge

Item

s

No Method 25614 7692 3253 1872

Vavg 317 229 136 87

Havg 246 165 115 83

WMAvg 302 218 135 88

ModusFilter 56 51 40 33

0.75% 1% 1.50% 2%

5 Dimensional Data

0

40

80

120

160

200

Supports

Lar

ge

Item

s

No Method 3300 1338 629 407

Vavg 165 124 65 48

Havg 118 81 53 37

WMAvg 162 110 66 49

ModusFilter 54 45 35 31

0.75% 1% 1.50% 2%

Page 31: Mining Association Rules in Data Warehouses

58 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

The overall study for discovering largeitems for mining non-repeatable associa-tion rules in data warehouses in Figures 20,21 and 22 have shown that our proposedmethods have discovered up to 99% fewerlarge items compared with the No Methodapproach where ModusFilter algorithmproduces fewer large items than other pro-posed methods.

Mining Hybrid AssociationRules Results

On this mining hybrid associationrules, we use all data taken from Figure 23as the initialised data to discover large itemsas in Figures 24, 25 and 26. We use ourproposed methods, VAvg, HAvg, WMavgand ModusFilter algorithms, to comparethe effectiveness of our approaches withthe No Method approach across three-di-

Figure 23. Comparison of our proposed methods and No Method data for mining hybridassociation rules

Figure 24. Comparison of our proposed methods and No Method on 3-dimensional hybridassociation rules

Dimensional Trends

0

2000

4000

6000

8000

Dimensions

Ro

ws

No Method 11785 11985 12539

Vavg 5110 5675 6738

Havg 4894 4973 5270

WMAvg 5089 5664 6701

ModusFilter 2765 4143 6969

3 Dims 4 Dims 5 Dims

3 Dimensional Data

0

80

160

240

320

400

Supports

Lar

ge

Item

s

No Method 33019 4105 1310 755

Vavg 371 199 114 82

Havg 334 191 108 79

WMAvg 396 208 114 83

ModusFilter 192 125 59 34

0.75% 1% 1.50% 2%

Page 32: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 59

mensional data to five-dimensional data. Asshown in Figure 23, for three- and four-dimensional data, our proposed methods,Vavg, Havg and WMAvg, have pruned therows within interval 6% from 58% and52%. Meanwhile for ModusFilter ap-proach has reduced the rules 76% and65%. For five-dimensional data, HAvg ap-proach has pruned 57% while VAvg andWMAvg have reduced 9% higher if com-

pared with the No Method approach, andonly ModusFilter eliminates lower % age(44%) than others. The overall results ofthis experiment have shown that our pro-posed methods have reduced significantrows, on average 60%, if compared withthe No Method approach whereModusFilter tends to eliminate fewer rowsfrom three-dimensional data to five-dimen-sional data than others.

Figure 25. Comparison of our proposed methods and No Method on 4-dimensional hybridassociation rules

Figure 26. Comparison of our proposed methods and No Method on 5-dimensional hybridassociation rules

5 Dimensional Data

0

80

160

240

320

400

Supports

Lar

ge

Item

s

No Method 50609 1364 170 159

Vavg 253 90 44 18

Havg 196 97 48 29

WMAvg 288 109 46 20

ModusFilter 1339 36 9 3

0.75% 1% 1.50% 2%

4 Dimensional Data

0

80

160

240

320

400

Supports

Lar

ge

Item

s

No Method 31314 2970 794 473

Vavg 274 155 86 65

Havg 243 146 81 66

WMAvg 312 161 88 68

ModusFilter 210 77 43 23

0.75% 1% 1.50% 2%

Page 33: Mining Association Rules in Data Warehouses

60 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

As shown in Figures 24 and 25, weuse three- and four-dimensional data withfour different minimum supports (e.g.,0.75%, 1%, 1.5% and 2%) to discover largeitemsets for hybrid association rules in datawarehouses using our proposed methodswith the comparison with No Method ap-proach. For 0.75% minimum support, allour proposed methods have produced 99%large items fewer than No Method ap-proach. For minimum supports 1% to 2%,Vavg, Havg and WMAvg have discovered94% to 86% large items fewer than theNo Method approach, while Modus Fil-ter has produced 97% to 95% large itemsless.

As shown in Figure 26, we use five-dimensional data with four different mini-mum supports (e.g., 0.75%, 1%, 1.5% and2%) to discover large itemsets for hybridassociation rules in data warehouses usingour proposed methods with the compari-son with the No Method approach. Forminimum support 0.75%, ModusFilter hasdiscovered more large items (97%) if com-pared with other proposed methods (99%).However, for other supports 1% to 2%,ModusFilter has discovered fewer largeitems up to 95% less than others. This canbe explained since on minimum support0.75% ModusFilter has a lot of large itemswith a similar supports, which is 0.99%, thenwhen it comes to minimum support of 1%,all the large items which are less then 1%support will be deleted. This condition hasmade on support 1% the large items aremuch less than before minimum support.The overall study on discovering large itemsfor mining hybrid association rules in datawarehouses in Figures 24, 25 and 26 haveshown that our proposed methods have dis-covered up to 99% fewer large items whencompared with the No Method approach,where this is quite similar to mining non-

repeatable association rules in data ware-houses in Figures 20, 21 and 22.

CONCLUSION ANDFUTURE WORK

The need for a framework for min-ing data warehouses is really important, asdata warehouses have been used widelyto store the integrated database in businessapplications. Without a specific frameworkit would be difficult to mine interesting ruleshidden in data warehouses.

We have proposed the VAvg, HAvg,WMAvg and ModusFilter algorithms toprovide efficient data initialisation for min-ing association rules in data warehousesby concentrating on the measurement ofaggregate data, specifically on its quantity.Those algorithms mainly work by filteringthe data taken from a data warehouse. Onlydata that has satisfied the user input vari-ables and the minimum average of quan-tity for VAvg, HAvg, and WMAvg will beused or those that satisfy the modus quan-tity will be used. These algorithms are veryimportant as pre-process steps to filter datafrom data warehouses, since data ware-houses have a large volume of data, butwe only want to find prospective data,which may produce a high quality of asso-ciation rules.

We also have proposed the GenNLIalgorithm to discover large itemsets on min-ing non-repeatable and ComDims andGenHLI to find large itemsets on mininghybrid association rules in data warehouses.Those algorithms took data from our pre-process algorithms, which are VAvg, HAvg,WMAvg and ModusFilter algorithms, todiscover interesting patterns in data ware-houses.

We have also tested VAvg, HAvg,WMAvg and ModusFilter algorithms. Theoverall studies found that, by using our al-

Page 34: Mining Association Rules in Data Warehouses

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005 61

gorithms, we can significantly reduce thenumber of rows to be used as the initial-ised table for mining association rules in adata warehouse. Moreover, by using ourpre-process algorithms for mining non-re-peatable and hybrid association rules in datawarehouse, the experimental results haveshown that up to 99% fewer large itemsetsare discovered when compared with theNo Method approach.

For future works, we will considerdeveloping new algorithms to provide a di-rect connection between those proposedalgorithms, which may enhance the effec-tiveness to discover interesting largeitemsets in data warehouses, and investi-gate the quality of rule generation on ourproposed algorithms.

REFERENCESAgrawal, R., Imielinski, T., & Swami, A.

(1993). Mining association rules be-tween sets of items in large databases.In SIGMOD’93, (pp. 207-216).

Agrawal, R., & Srikant, R. (1994). Fastalgorithms for mining association rules.In Proceedings of the 1994 Interna-tional Conference on Very Large DataBases (VLDB’94), (pp. 487-499).

Bodon, F. (2003). A fast Apriori implemen-tation. In FIMI’03, November.

Chaudhuri, S., & Dayal, U. (1997). Anoverview of data warehousing andOLAP technology. ACM SIGMODRecord, 26, 65-74.

Dunham, M.H. (2003). Data mining: In-troductory and advanced topics.Prentice Hall.

Guenzel, H., Albrecht, J., & Lehner, W.(1999). Data mining in a multidimen-sional environment. In Proceedings ofAdvances in Database and Informa-tion Systems (ADBIS’99), (pp. 191-204).

Han, J., & Fu, Y. (1995). Discovery ofmultiple-level association rules fromlarge databases. In Proceedings of theInternational Conference Very LargeData Bases (VLDB’95), September,(pp. 420-431).

Inmon, W.H. (1996). Building the DataWarehouse (2nd ed.). New York: JohnWiley & Sons, Inc..

Kamber, M., & Han, J. (2001). Data min-ing: Concepts and techniques. Mor-gan Kaufmann.

Kamber, M., Han, J., & Chiang, J.Y.(1997). Metarule-guided mining of multi-dimensional association rules using datacubes. In KDD’97, August, (pp.207-210).

Kimbal, R. (1996). The data warehousetoolkit. New York: John Wiley & Sons.

Lowry, C.A., Woodall, W.H., Champ,C.W., & Rigdon, S. E. (1992). A multi-variate exponentially weighted movingaverage chart. Technometrics, 34.

Mannila, H., Toivonen, H., & Verkamo, A.I.(1994). Efficient algorithms for discov-ering association rules. In Proceedingsof the Conference on Knowledge Dis-covery and Data Mining (KDD’94),(pp.181-192).

Oracle. (2001). Oracle 9i data warehouseguide. Retrieved from http://www.oracle.com

Park, J.S., Chen M.S., & Yu, P.S. (1995).An effective hash based algorithm formining association rules. In Proceedingsof the 1995 ACM SIGMOD Interna-tional Conference on Management ofData (ACM-SIGMOD’95), May, (pp.175-186).

Savasere, A., Omiecinski, E., & NavatheS. (1995). An efficient algorithm for min-ing association rules in large databases.In Proceedings of the 21st Interna-tional Conference on Very Large DataBases (VLDB’95), (pp.432-444).

Page 35: Mining Association Rules in Data Warehouses

62 International Journal of Data Warehousing & Mining, 1(3), 28-62, July-September 2005

Copyright © 2005, Idea Group Inc. Copying or distributing in print or electronic forms without writtenpermission of Idea Group Inc. is prohibited.

Srikant, R., & Agrawal, R. (1995). Mininggeneralized association rules. In Pro-ceedings of the 21st InternationalConference on Very Large Databases(VLDB’95), (pp.407-419).

Srikant, R., & Agrawal, R. (1996). Miningquantitative association rules in largerelational tables. In Proceedings of the

1996 ACM SIGMOD InternationalConference on Management of Data(ACM SIGMOD’96), (pp.1-12).

Toivonen, H. (1996). Sampling large data-bases for association rules. In Proceed-ings of the International Conferenceon Very Large Databases (VLDB’96),(pp.134-145).

Haorianto Cokrowijoyo Tjioe ([email protected]) is a research studentin the School of Business Systems, Faculty of Information Technology, Monash University,Australia. His research interests are data warehousing, data mining, and OLAP. Since January2004, he has actively published his work on several international conferences and journals.

David Taniar ([email protected]) holds a PhD in computer science, witha particular speciality in high performance databases. His research area has now expanded todata warehousing and mining, mobile and grid databases, and Web information systems. Hehas published more than 30 journal papers and over 100 conference papers. He has publishedsix books, including the forthcoming Object-Oriented Oracle. Dr. Taniar is now with the Schoolof Business Systems, Faculty of Information Technology, Monash University, Australia. He iseditor-in-chief of several international journals, including Data Warehousing and Mining,Business Intelligence and Data Mining, Web and Grid Services, Web Information Systems, MobileInformation Systems, and Mobile Multimedia. He is also a fellow of the Institute for ManagementInformation Systems (FIMIS).