Upload
lamtram
View
216
Download
0
Embed Size (px)
Citation preview
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12088
ISSN: 0975-766X CODEN: IJPTFI
Available Online through Research Article
www.ijptonline.com MINING CLOSED HIGH UTILITY DATASETS FOR CONCISE AND LOSSLESS
TRANSACTIONS R. Divya
1, Dr. G. Mathivanan*
2
PG Student, Department of Information Technology, Sathyabama University, Chennai.
Head of the department, Department of Information Technology, Sathyabama University, Chennai.
Email: [email protected]
Received on 29-04-2016 Accepted on 29-05-2016
Abstract
Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different
perspectives. Then the frequent item set mining is the fundamental research topic. Utility mining is the important task of
data mining, although several studies have been carried out, current methods may present too many high utility item sets
for users, which degrades the performance of the mining task in terms of execution and memory efficiency. This
introduced a novel framework in this paper for analyze the frequent and non-frequent Itemset using by Min Ex
algorithm and pruning techniques. And it find out the High Utility Item Set from the No of transaction using MinEx
algorithm, It explores the item set lattice level wise, starting from the empty set and stopping at the level of the largest
frequent free-sets. Then the free-sets that can be extracted efficiently, even on dense data sets and mining closed high
utility item sets, which serves as a compact and lossless representation of high utility item sets. Using an efficient
algorithms called DAHU (Discovery of All High utility), MinEX algorithm Its outperform the state of art algorithm.
Keyword: Utility mining; frequent item set; closed+ high utility item set; lossless and concise representation.
I. Introduction
Frequent item set mining (abbreviated as FIM) may be an elementary analysis topic in data processing. One in every of
its fashionable applications is market basket analysis, that refers to the invention of sets of things (item sets) that area
unit often purchased along by customers. However, during this application, the normal model of FIM might discover an
oversized quantity of frequent item sets with low profit and lose the knowledge on valuable item sets having low
mercantilism frequencies. These issues area unit caused by the facts that (1) FIM treats all things as having constant
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12089
importance/unit profit/weight and (2) it assumes that each item in an exceedingly dealings seems in an exceedingly
binary kind, i.e., AN item are often either gift or absent in an exceedingly dealings, that doesn’t indicate its purchase
amount within the dealings. Hence, FIM cannot satisfy the necessity of users United Nations agency need to get item
sets with high utilities like high profits.
To address these problems, utility mining emerges as a very important topic in data processing. In utility mining, every
item encompasses a weight (e.g. unit profit) and might seem over once in every dealings. The utility of an item set
represents its importance, which might be measured in terms of weight, profit, cost, amount or different info counting on
the user preference. An item set is termed a high utility item set (abbreviated as HUI) if its utility is not any but a user-
specified minimum utility threshold. Utility mining encompasses a big selection of applications like web site click
stream analysis [5], cross-marketing analysis [6] and medical specialty domains.
However, HUIs mining isn't a simple task since the downward closure property [1] in FIM doesn't hold in utility mining.
The search area can't be directly cropped to seek out HUIs as in FIM since a superset of a coffee utility item set are often
a high utility item set. Several studies [2]were projected for mining HUIs, however they usually gift an oversized range
of high utility item sets to users specified comprehension of the results becomes tough. Meanwhile, the algorithms
become inefficient in terms of your time and memory demand. above all, the performance of the mining task decreases
greatly beneath low minimum utility thresholds or dense databases.
To reduce the process price in FIM whereas presenting fewer and a lot of necessary patterns to users, several studies
developed concise representations, like free sets [3], non-derivable sets [4], top item sets and closed item sets.
These representations with success scale back the set of item sets found, however they were developed for frequent item
set mining rather than high utility item set mining. Therefore, a very important analysis question is “Is it potential to
conceive a compact and lossless illustration of high utility item sets galvanized by these representations to handle the
same problems in HUI mining?”.
Answering this question completely isn't straightforward. Developing an elliptic and complete illustration of HUIs poses
many challenges:
1. Desegregation ideas of elliptic illustrations from FIM into HUI mining could turn out a lossy illustration of all HUIs
or a representation that's not significant to the users.
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12090
2. The illustration might not bring home the bacon a major reduction within the variety of extracted patterns to justify
victimization the illustration.
3. Algorithms for extracting the illustration might not be economical. they'll be slower than the simplest algorithms for
mining all HUIs.
4. It’s going to be onerous to develop associate economical methodology for sick all HUIs from the illustration.
In this paper, it tend to address all of those challenges by proposing a condensed and significant illustration of HUIs
named Closed+ High Utility Item sets (Closed+ HUIs), that integrates the construct of closed item set into HUI mining.
Our contributions square measure four-fold in correspondence to breakdown the four challenges mentioned previously:
The projected illustration is lossless by employing a new structure named utility unit array that enables sick all HUIs and
their utilities with efficiency.
The projected illustration is additionally compact. Experiments show that it reduces the quantity of item sets by many
orders of magnitude, particularly for datasets containing long HUIs (up to 800 times).
It tend to enhance associate economical formula, named CHUD (Closed+ High Utility item set Discovery), to search out
this illustration. It includes 3 novel methods named REG, RML and DCM that greatly enhance its performance. Results
show that CHUD is way quicker than current best ways for mining all HUIs [2].
To migrate a top-down methodology named DAHU (Derive Definition HUIs from the set of Closed+ HUIs. the mixture
of CHUD and DAHU provides a replacement thanks to get all HUIs and it outperforms UP Growth [4], the progressive
formula for mining HUIs.
The remainder of this paper is organized as follows. In Section II, this tend to introduce the background for compact
representations and utility mining. Section III defines the illustration of closed+ HUIs and presents our ways.
Existing System
In existing system, Many were proposed for mining HUIs, but they often present a large number of high utility itemsets
to users. The system used two efficient one-pass algorithms, MHUI-BIT and MHUI-TID, for mining high utility
itemsets from data streams within a transaction-sensitive sliding window. Two effective representations of item
information and an extended lexicographical tree-based summary data structure are developed to improve the efficiency
of mining high utility itemsets. These representations successfully reduce the number of itemsets found, but they are
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12091
developed for FIM instead of HUI mining.
Proposed System
This system introduced a novel framework for mining closed high utility itemsets (CHUIs ), which serves as a compact
and lossless representation of HUIs. Further, a method called DAHU (Derive All High Utility itemsets) is proposed to
recover all high utility itemsets from the set of closed + high utility itemsets without accessing the original database.
Results of experiments on real and synthetic datasets show that CHUD and DAHU are very efficient with a massive in
the number of high utility itemsets. In addition, when all high utility itemsets are recovered by DAHU, the approach
combining CHUD and DAHU also outperforms the state-of-the-art algorithms in mining high utility item sets.
Related Works
In this section, there is a tendency to introduce the preliminaries related to high utility item set mining and compact
representations.
A. Closed Item set Mining
In this segment, it tend to introduce definitions and properties associated with closed item sets and mention relevant
ways. For a lot of details concerning closed item sets.
Mining frequent closed item set refers to the invention of all the closed item sets whose supports are not any but a user-
specified threshold. it's well known that the amount of frequent closed item sets is a lot of smaller than the set of
frequent item sets for real-life databases which mining frequent closed item sets can even be a lot of quicker and
memory economical than mining frequent item sets. The set of closed item sets is lossless since all frequent item sets
and their supports is simply derived from it by property four while not scanning the first information [5-7]. several
economical ways were planned for mining frequent closed item sets, like A-Close, CLOSET+ , CHARM and DCI-
Closed. However, these ways don't think about the utility of item sets. Therefore, they'll gift countless closed item sets
with low utilities to users and omit many high utility item sets.
B. Compact Representations of High Utility
Itemsets
To gift representative HUIs to users, some epigrammatic representations of HUIs were planned. Chan et al.
introduced the construct of utility frequent closed patterns [7]. However, it's supported a definition of high utility itemset
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12092
that's totally different from [3] our work.
Shieetal. planned a compact illustration of high utility itemsets, referred to as top high utility item set and therefore the
GUIDE formula for mining it [7]. A HUI is claimed to be top if it's not a set of the other HUI. for instance, once
min_utility = ten, the set of top HUIs is {, }. though this illustration reduces the amount of extracted HUIs, it's not
lossless.
The rationale is that the utilities of the subsets of a top HUI can't be illustrious while not scanning the information.
Besides, ill all HUIs from top HUIs is terribly inefficient as a result of several subsets of a top HUI is low utility.
Another drawback is that the GUIDE formula cannot capture the entire set of top HUIs.
Calculate the TU and Find TWU
In this phase, the high transaction for item sets is calculated using AprioriCH algorithm techniques. First find the
Absolute Utility from the transactions. Multiply the no of item and that items finite Profit unit, the result will produce
the Absolute Utility of that transaction item. The Transaction Utility (TU), is calculated from sum of the Absolute
utility. This value gives the total occurrence of items in the transaction.
TU can be used to find and analyze the transaction-weighted utilization (TWU) of all the transactions.
Server
Fig.1 System Design.
Discovery Of item Transaction
Set Discovery Find HUI
Transaction Closed Itemset
analysis
TU Analysis
Find close High
Find Absolute Utility utility Item set
Find Transaction Profit analysis
Graphical
Notation
Utility
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12093
CLOSED + HIGH UTILITY ITEMSET
Mining
In this section, there is a tendency to incorporate the idea of closed itemset with high utility itemset mining to develop an
illustration named closed+ high utility itemset. To have a tendency to on paper prove that this new illustration is
meaning, lossless and not larger than the set of all HUIs.
During this case, a high utility item set is alleged to be losed if it's no correct superset having identical utility. However,
this definition is unlikely to realize a high reduction of the quantity of extracted item sets since not several item sets have
precisely the same utility as their supersets in real datasets. as an example, there area unit seven HUIs in Example one
and just one item set is non-closed, since and u() = u() = twelve.
A. Pushing Closed Property into HUI Mining
A second risk is to outline the closure on the first purpose that it should always discuss is the way to incorporate the
closed constraint into high utility itemset mining. First, that are able to outline the closure on the utility of itemsets.
supports of item sets. During this case, there area unit 2 potential definitions betting on the be part of order between the
closed constraint and also the utility constraint:
B. Economical Discovery of Closed+ High
Utility Itemsets
In this section, it tend to gift AN economical algorithmic program named CHUD (Closed+ High Utility itemset
Discovery) for mining closed+ HUIs. CHUD is AN extension of DCI-Closed [4], one among the best strategies for
mining closed item sets, and it additionally integrates the TWU model and effective methods to prune low utility
itemsets. CHUD consists of 2 phases. In phase I, CHUD discovers candidates for closed+ HUIs. In clinical test, the
closed+ HUIs square measure known from the set of candidates found in clinical test and their utility unit arrays square
measure computed by scanning the info once.
Similar to the DCI-Closed algorithmic program, CHUD adopts AN IT-Tree (Itemset-Tidset try Tree) to search out
closed+ HUIs. In AN IT-Tree, every node N(X) consists of AN itemset X, its Tidset g(X), and 2 ordered sets of things
named PREV-SET(X) and POST-SET(X). The IT-Tree is recursively explored by the CHUD algorithmic program till
all closed itemsets that square measure HTWUIs square measure generated. completely different from the DCI-Closed
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12094
algorithmic program, every node N(X) of the IT-Tree is connected with AN calculable utility price EstU(X).
A data structure known as TU-Table (Transaction Utility Table) [3] is adopted for storing the dealing utilities of
transactions. it's an inventory of pairs, TU(TR); wherever the primary price could be a TID R and therefore the second
price is that the dealing utility of TR. Given a TID R, the worth TU(TR) is expeditiously retrieved from the TU-Table.
Given a node N(X) with its Tidset g(X) and a TU-Table TU, the calculable utility of the itemset X is expeditiously
calculated by the procedure shown in Figure one.
The main procedure of CHUD is known as main and is shown in Figure two. It takes as parameter a info D and therefore
the min_utility threshold. CHUD 1st scans D once to convert D into a vertical info. At an equivalent time, CHUD
computes the dealing utility for every dealing TR and calculates TWU of things. once a dealing is retrieved, its Tid and
dealing utility square measure loaded into a worldwide TU-Table named GTU. AN item is termed a promising item if its
calculable utility (e.g. its TWU) isn't any but min_utility. when the primary scan of info, promising things square
measure collected into AN ordered list O = <a1, a2,…,an>, sorted per a hard and fast order like increasing order of
support. solely promising things square measure unbroken in O since supersets of unfortunate things square measure low
utility itemsets per [7], the utilities of unfortunate things is off from the GTU table. This step is performed at line two of
the most procedure. Then, CHUD generates candidates in an exceedingly algorithmic manner, ranging from candidates
containing one promising item and recursively connection things to them to create larger candidates. To do so, CHUD
takes advantage of the actual fact that by victimization the whole order , the entire set of itemsets is divided into n non-
overlapping subspaces, wherever the k-th topological space is that the set of itemsets containing the item AK however
no item ai AK [4]. for every item AK O, CHUD creates a node N() and puts things a1 to ak-1 into PREV- SET() and
things ak+1 to an into POST-SET(). Then CHUD calls the CHUD Phase-I procedure for every node N() to provide all
the candidates containing the item AK however no item ai AK. Finally, the most procedure performs clinical test on
these candidates to get all closed+ HUIs.
Pseudocode : HUIMining
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12095
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
In this section, it tend to compare the performance of CHUD and DAHU with UP Growth [2], that is to our greatest
data, the progressive technique for prime utility itemset mining. though CHUD and UP Growth manufacture totally
different results, each of them incorporates 2 phases. In Phase I, CHUD and UP Growth severally generate candidates
for CHUIs and HUIs. In clinical trial, CHUD and UP Growth severally establish CHUIs and HUIs from candidates
created in their phase I. the mix of CHUD and DAHU is denoted as CHUD+DAHU, that 1st applies CHUD to seek out
all closed+ high utility itemsets so uses DAHU to derive all high utility itemsets from the set of closed+ high utility
itemsets generated by CHUD. the method of CHUD+DAHU in phase I is that the same as that of CHUD. In clinical
trial, CHUD+DAHU 1st identifies CHUIs from the set of candidates so uses CHUIs to derive all HUIs. Experiments
were performed on a personal computer with associate Intel® Core two Quad Processor @ two.66 GHz running
Windows XP and a couple of GB of RAM. CHUD and DAHU were enforced in Java. The implementation of UP
Growth was obtained from Tseng et al. [2], that is additionally enforced in Java. All memory measurements were done
by victimization the Java API Real datasets Mushroom and BMSWebView1 were obtained from FIMI Repository [2].
foodmart may be a real dataset obtained from the Microsoft food mart 2000 information. Except the foodmart dataset,
the external and internal utility of every item are generated with the settings employed in [2]. Food mart already contains
unit profits and buy quantities of things. T
he whole utility of foodmart is one hundred twenty,160.84. Table IV shows the characteristics of the on top of datasets.
Mushroom may be a real-life dense dataset, every dealing containing twenty three things. foodmart may be a real-life
thin dataset from a mercantile establishment, with real utility values. BMSWebView1 may be a real-life thin information
set of click-stream data with a mixture of short and long transactions (up to 267 items). T10I8D200K may be a giant thin
dataset with a mean dealing length of ten.
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12096
A. Experiments on Mushroom Dataset
The first experiment consisted of running UP Growth, CHUD, and DAHU on the Mushroom dataset, whereas varied
min_utility from 100% to I Chronicles. The execution time of UP Growth, CHUD, and CHUD+DAHU is shown in
Figure eight for phase I and clinical trial. Results show that CHUD outperforms UP Growth for each phases, and also
the performance gap accrued as min_utility was set lower. for instance, once min_utility = one hundred and twenty fifth,
CHUD is fifty times quicker than UP Growth for part one and sixty three times quicker for clinical trial. Moreover, once
CHUD is combined with DAHU to find all high utility item sets, the mix for the most part outperforms UP Growth and
was solely slightly slower than CHUD. The smaller range of candidates generated by CHUD in phase I is what makes
CHUD perform higher than UP Growth in clinical trial and for the whole execution time (because clinical trial is a lot of
expensive than phase I [20]). Lastly, it tend to measure the reduction achieved by the illustration of closed+ high utility
itemsets generated by CHUD compared to the set of all high utility itemsets generated by UP Growth. As shown in
Table V, a large reduction is obtained (up to 796 times). Moreover, by running DAHU, it's doable to recover all high
utility itemsets.
B. Experiments on Foodmart Dataset
The second experiment consists of running UP Growth, CHUD and DAHU on the Foodmart dataset, whereas varied
min_utility from zero.10% to 0.005 capitalize on the whole utility within the information. Execution times for phase I
and clinical trial are shown in Figure nine. the whole execution time of UP Growth is a smaller amount than CHUD,
initially. however because the min_utility threshold became smaller, CHUD becomes quicker (up to 2 times quicker
than UP Growth).
The rationale why the performance gap between CHUD and UP Growth is smaller for Foodmart than for Mushroom is
owing to the very fact that Foodmart may be a thin dataset. As a consequence the reduction achieved by mining closed+
high utility itemsets is a smaller amount. Fig.2 Mining closed High UD note that achieving a smaller reduction for thin
datasets may be a well-known development in frequent closed itemset mining. an identical development happens in
closed+ HUI mining. Besides, once DAHU was combined withes".
CHUD, the execution time of CHUD+DAHU was up to 2 times quicker than UP Growth for low minimum utility
thresholds and slightly slower than CHUD.
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12097
C. Experiments Dataset
The third experiment consists of running UP Growth, CHUD and CHUD+DAHU on BMSWebView1 whereas varied
min_utility from 100% to 1 Chronicles of the entire utility of the info. For min_utility = a pair of, UP Growth cannot
terminate among the closing date of a hundred,000 seconds and it generates over one,000,000 candidates in phase I
clinical trial, whereas CHUD terminates in eighty seconds and produces solely seven closed+ HUIs from thirty two
candidates. the explanation why CHUD performs therefore well is that it achieves an enormous reduction within the
range of candidates by solely generating a couple of long thing sets containing up to 149 items, whereas UP Growth
should take into account a huge quantity of redundant subsets (for a closed thing set of 149 items, there are often up to
2149-2 non-empty correct subsets that area unit redundant). DAHU additionally suffers from the very fact that there area
unit too several HUIs. It runs out of memory for min_utility < II Chronicles once making an attempt to recover all
HUIs as a result of it's to get too several subsets.
D. Experiments on Sythetic Dataset
The fourth experiment is to run the algorithms on T12I8D200K with min_utility varied from zero.1% to 0.02% of the
info total utility. Results area unit conferred in Figure eleven and Table VIII. For this dataset, CHUD is quicker than UP
Growth for the entire execution time. though the reduction on this artificial dataset isn't nearly as good (since it made
constant result as UP Growth), CHUD is quicker as a result of it generates concerning thrice less candidates in phase I
clinical trial. CHUD takes additional times to get candidates in phase I clinical trial. however the entire execution time
of CHUD is a smaller amount than UP Growth as a result of clinical trial is additional pricey than phase I clinical trial.
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12098
CHUD+DAHU additionally outperforms UP Growth, since DAHU solely pay one second to derive all HUIs.
Conclusion
In this paper, there is a tendency to address the matter of redundancy in high utility itemset mining by proposing a
compact illustration of all high utility itemsets named closed+ high utility itemsets. To our information, this is often the
primary study on compact and lossless illustration of high utility itemsets. To mine this new type of itemsets, There is a
tendency to planned associate degree economical algorithmic program named CHUD. 3 effective ways named REG,
RML and DCM were additional planned to reinforce the performance of CHUD. To expeditiously recover all high
utility itemsets from this illustration, there is a tendency to planned a top-down technique named DAHU. Real and
artificial datasets with varied characteristics were wont to perform a radical performance analysis. Results show that the
planned illustration achieves a huge reduction within the range of high utility itemsets (e.g. a discount of up to 800 times
for Mushroom and thirty two times for Foodmart datasets). Besides, CHUD outperforms UP Growth, this best
algorithmic program by many orders of magnitude below low minimum utility thresholds (e.g. CHUD terminates in
eighty seconds on BMSWebView1 for min_utility = two, whereas UP Growth cannot terminate inside twenty four
hours). the mixture of CHUD and DAHU is additionally quicker than UP Growth once DAHU may be applied.
References
1. R. Agrawal and R. Srikant, “Fast algorithms for mining associa-tion rules,” in Proc. 20th Int. Conf. Very Large Data
Bases, 1994, pp. 487– 499.
2. C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, and Y.-K. Lee, “Efficient tree structures for high utility pattern mining in
incremental data-bases,”IEEE Trans. Knowl. Data Eng., vol. 21, no. 12, pp. 1708–1721, Dec. 2009.
3. J.-F. Boulicaut, A. Bykowski, and C. Rigotti, “Free-sets: A con-densed representation of Boolean data for the
approximation of frequency queries,” Data Mining Knowl. Discovery, vol. 7, no. 1, pp. 5–22, 2003.
4. T. Calders and B. Goethals, “Mining all non-derivable frequent itemsets,” in Proc. Int. Conf. Eur. Conf. Principles
Data Mining Knowl. Discovery, 2002, pp. 74–85.
5. K. Chuang, J. Huang, and M. Chen, “Mining top-k frequent pat-terns in the presence of the memory constraint,”
VLDB J., vol. 17, pp. 1321–1344, 2008.
6. R. Chan, Q. Yang, and Y. Shen, “Mining high utility itemsets,” in Proc. IEEE Int. Conf. Data Min., 2003, pp. 19–
Dr. G. Mathivanan* et al. International Journal Of Pharmacy & Technology
IJPT| June-2016 | Vol. 8 | Issue No.2 | 12088-12099 Page 12099
26.
7. A. Erwin, R. P. Gopalan, and N. R. Achuthan, “Efficient mining of high utility itemsets from large datasets,” in
Proc. Int. Conf. Pacific-Asia Conf. Knowl. Discovery Data Mining , 2008, pp. 554–561.
Corresponding Author:
R. Divya*,
Email: [email protected]