22
DATA MINING MAFIA: A MAXIMAL FREQUENT ITEM SET ALGORITHM P.Radhika, R.Sambavi, [email protected], [email protected], 2/4 CSE, 2/4 CSE, Gayathri vidya Parishad College of engg. Gayathri vidya Parishad College ofengg. Madhurawada, Madhurawada, Visakhapatnam. Visakhapatnam. ABSTRACT : In this Information Age, we are deluged by data-scientific data, medical data, demographic data, financial data and marketing data. People have no time to look at this data. Human attention has become a precious

p139 Data Mining Mafia

Embed Size (px)

Citation preview

Page 1: p139 Data Mining Mafia

DATA MINING

MAFIA: A MAXIMAL FREQUENT ITEM SET ALGORITHM

P.Radhika, R.Sambavi,

[email protected], [email protected],

2/4 CSE, 2/4 CSE,

Gayathri vidya Parishad College of engg. Gayathri vidya Parishad College ofengg.

Madhurawada, Madhurawada,

Visakhapatnam. Visakhapatnam.

ABSTRACT: In this Information Age, we are deluged by data-scientific data, medical data,

demographic data, financial data and marketing data. People have no time to look at this

data. Human attention has become a precious resource. So, we must find base to

automatically discover and characterize trends in it. Our capabilities of both generating

and collecting data have been increasing rapidly in the last several decades. This

explosive growth in stored data has generated an urgent need for new techniques that can

assist in transforming the vast amounts of data into useful information.

To analyze this data, we introduce the concepts and techniques of data mining, a

promising and flourishing frontier in database systems and new database applications.

We also deal with those algorithms that are specially designed for data mining.

Our paper focuses on a new algorithm for mining maximal frequent item sets

from a transactional database. The search strategy of the algorithm integrates a depth first

traversal of the item set lattice with effective pruning mechanisms that significantly

Page 2: p139 Data Mining Mafia

improve mining performance. Our implementation for support counting combines a

vertical bitmap representation of the data with an efficient bitmap compression scheme

Our analysis show that MAFIA performs best when mining long item sets and

outperforms other algorithms on dense data by a factor of three to 30.

INTRODUCTION:The major reason that data mining has attracted a great deal of attention in the

information industry in recent years is due to wide availability of huge amounts of data

and the imminent need for turning such data into useful information and knowledge

which can be used for applications ranging from business management, production

control and market analysis to engineering design and science exploration.

Data mining refers to extracting or mining knowledge from large amounts of data.

It is the task of discovering interesting patterns from large amounts of data where the data

can be stored in databases, data warehouses or other information repositories. It is an

interdisciplinary field merging ideas from statistics, machine learning, databases and

parallel computing. It is an essential step in the process of knowledge discovery in

databases. Data mining functionalities are used to specify the kind of pattern which

represents knowledge to be found in data mining tasks. Data mining tasks can be

classified into two categories: descriptive and predictive. Descriptive mining tasks

characterize the general properties of the data in the database. Predictive mining tasks

inference on the current data in order to make predictions. The functionalities include the

discovery of concept descriptions, associations, classification, prediction, clustering,

trend analysis, deviation analysis and similarity analysis.

Among the areas of data mining, the problem of deriving associations from data

has received a great deal of attention. Association analysis is widely used for market

basket or transaction data analysis. Here we are given a set of items and a large collection

of transactions which are subsets (baskets) of these items. The problem is to analyze

customers buying habits by finding associations between the different items that

customers place in their shopping baskets. The discovery of such association rules help in

the development of marketing strategies by gaining insight into matters like “which items

are most frequently purchased by customers”. It also helps in inventory management,

sales promotion strategies etc. Hence the discovery of association rules is solely

Page 3: p139 Data Mining Mafia

dependent on the discovery of frequent sets. Many algorithms have been proposed for the

efficient mining of association rules.

ASSOCIATION ANALYSIS: It is a discovery of association rule showing attribute-value conditions that occur

frequently together in a given set of data. Association rule mining searches for interesting

relationships among items in a given data set.

ASSOCIATION RULE: Given a set of items I ={I1,I2,………,In} and a database of

transactions D={t1,t2,…….,t n} where t i={I i1,I i2,……..,I ik } and I ij I, an

association rule is an implication of the form XY where X,Y I are sets of items

called item sets and X Y =Ø.

The support (S) for an association rule XY is the percentage of transactions in the

database that contains XY. The confidence or the strength () for an association rule

XY is the ratio of the number of transactions that contain XY to the number of

transactions that contains X. The association rule problem is to identify all association

rules XY with a minimum support and confidence. These values (S, ) are given as

input to the problem. The efficiency of association rule algorithms usually is discussed

with respect to the number of scans in the database that are referred and the maximum

number of item sets that must be counted.

The problem of mining association rules can be decomposed into two sub

problems:

Find all sets of items (item sets) whose support is greater than the user specified

minimum support, S. Such item sets are called frequent item sets.

Use the frequent item sets to generate the desired rules. For e.g., if ABCD and AB

are frequent item sets, and then we can determine if the rule ABCD holds by

checking the following inequality.

S ({A,B,C,D})

S ({A,B})

Where S(X) is the support of X in T.

FREQUENT SET: Let T be a transactional database and S be the user specified minimum

support. An item set X A is said to be a frequent item set in T with respect to S if

S(X) T S

Page 4: p139 Data Mining Mafia

Discovering all frequent item sets and their support is a non-trivial problem if the

cardinality of A, the set of items and the database T are large. The potentially large item

sets are called candidates and the set of all potentially large item sets are called Candidate

item sets.

For ex: if |A|=m, the number of possible distinct item sets is2^m. The problem is to

identify which of these are frequent in the given set of transactions. One way to achieve

this is to set up 2^m counters, one for each distinct item set and count the support for

every item set by scanning the database once. However this approach is impractical for

many applications where m can be more than 1000.

To reduce the combinatorial search space, all algorithms implement the following two

properties:

Downward Closure Property: Any subset of a frequent set is a frequent set.

Upward Closure Property: Any superset of an infrequent set is an infrequent set.

We denote the set of all frequent item sets by FI. If X is frequent and no superset of X is

frequent, we say that X is a Maximally Frequent Item set. We denote the set of all

maximally frequent item sets by MFI.A frequent item set X is said to be closed and is

called a Frequent Closed Item set if there does not exist any proper subset YX with

S(X) = S(Y).

Hence it holds as follows: MFIFCIFI.

A PRIORI ALGORITHM:

It is also called level-wise algorithm. It is the most popular algorithm to find all the

frequent sets. It makes use of the downward closure property. The algorithm is a bottom

up search, moving upward level-wise in the lattice.

The basic idea of the A Priori Algorithm is to generate candidates item sets of a

particular size and then scan the database to count these to see if they are frequent.

During scan i, candidates of size i, Ci are counted. Only those candidates that are frequent

are used to generate candidates for the next pass. That is Li (set of frequent item sets

during scan i) are used to generate Ci+1. An item set is considered a candidate only if all

its subsets are large. To generate candidates of size i+1, joins are made of frequent item

sets found in the previous pass. An algorithm called A Priori Gen is used to generate the

candidate item sets for each pass after the first. All singleton item sets are used as

Page 5: p139 Data Mining Mafia

candidates in the first pass. Here the set of frequent item sets of the previous pass L i-1 is

joined with itself to determine the candidates. After the first scan, every frequent item set

is counted with every other frequent item set.

The A Priori Algorithm traverses the search space in pure breadth first manner and

finds support information explicitly generating and counting each node. When the

frequent patterns are long (more than 15 to 20 items), FI and even FCI become large and

more traditional methods count too many item sets to be feasible. Straight A Priori –

based algorithms count all of the 2^k subsets of each item set they discover, and thus

donot scale well for long items. This approach limits the effectiveness of the look aheads

since useful longer frequent patterns have not yet been discovered.

Recently, the merits of depth first approach have been recognized. Here we present a

new algorithm named MAFIA (A Maximal Frequent Item Set Algorithm). MAFIA uses a

vertical bitmap representation for counting and effective pruning mechanisms for

searching the item set lattice. By changing some of the pruning tools, MAFIA can also

generate all frequent item sets and closed frequent item sets, though the algorithm is

optimized for mining only maximum frequent item sets. The set of maximal frequent

item sets is the smallest representation of data that can still be used to generate the set FI.

Once the set is generated, the support information can be easily recomputed from

transactional database.

MAFIA focuses on the hardness of traversing the search space efficiently if there are

long item sets instead of minimizing I/O costs. In a thorough experimental evaluation, we

first quantify the effect of each individual pruning component on the performance of

MAFIA. We then demonstrate the benefits of using compression on the bit maps to speed

counting and yield large savings in computations. Finally we study the performance of

MAFIA versus other current algorithms for mining MFI. Because of our strong pruning

mechanisms, MAFIA performs best on dense data sets where large subsets are removed

from the search space.

PRELIMINARIES: In this section, we describe the conceptual frame work of item

subset lattice (fig. 1). Assume there is a total ordering <L of the items I in the database

i.e. lexicographic ordering. Here if item i occurs before item j, we denote this by i<Lj.

Page 6: p139 Data Mining Mafia

Fig(2) shows the subset lattice reduced to lexicographic sub tree. The item set identifying

each node is referred to as node’s head while possible extensions of the node are called

the tail.

{ }

{a} {b} {c} {d}

{a, b} {a, c} {a, d} {b, c} {b, d} {c, d}

{a, b, c} {a, b, d} {a, c, d} {b, c, d}

{a, b, c, d}

fig 1: Subset lattice for four items

{ }

P

{a} {b} {c} {d}

Cut {a,b} {a,c} {a,d} {b, c} {b, d} {c ,d}

{a,b,c} {a,b,d} {a,c,d} {b,c,d}

{a, b, c, d} Fig 2: Lexicographic subset tree for four items

For example: Consider node P. P’s head is {a} and the tail is the set {b,c,d}. Note that the

tail contains all items lexicographically larger than any element on the head. Here HT is

{a,b,c,d}. The problem of mining the frequent item sets from the lattice can be viewed as

finding a cut through the lattice such that all elements above the cut are frequent item sets

Page 7: p139 Data Mining Mafia

and all item sets below are infrequent. For a node C, we call all items in the tail of C as

the 1-extensions of C. Given a transaction T, we define the projected transaction T(C)

with respect to an item set C as follows: If C is not present in T, then T(C) = . If C is

not present in T then T(C) is defined to be all items present in the transaction T that are

also frequent 1-extensions of C.

MAFIA ALGORITHM: In this section, various pruning techniques are used to reduce

the search space. First we describe a simple depth first traversal with no pruning. We use

this algorithm to motivate the pruning and ordering improvements and latter for effective

MFI superset checking.

SIMPLE DEPTH FIRST TRAVERSAL:

Here we traverse the lexicographic tree in pure depth first order. At each node, each

element in the node’s tail is generated and counted as a 1-extension. If the support of

{C’s head}{1-extension} is less than minimum support we can stop by the A Priori

principle since any item set in the sub tree rooted at {C’s head}{1-extension} would be

infrequent. If none of the 1-extensions of C leads to a frequent item set, the node is a leaf.

When we reach a leaf C in the tree, we have a candidate for entry in the MFI.

However, a frequent superset of C would have been already discovered. Therefore, we

need to check if the superset of the candidate item set C is already in the MFI. Only if no

superset exists we add the candidate item set C to the MFI.

ALGORITHM:

Simple (Current node C, MFI)

1. For each item i in C.tail

2. Cn=C{i}

3. If (Cn is frequent)

4. Simple(Cn,MFI)

5. If (C is a leaf and C.head is not in MFI)

6. Add C.head to MFI

SEARCH SPACE PRUNING: The set of maximum possible solutions for a problem is

called search space. The simple depth first traversal is ultimately no better than a

Page 8: p139 Data Mining Mafia

comparable breadth first traversal since exactly the same search space is generated and

counted. To realize performance gains, we must prune out parts of the search space.

EFFECTIVE PRUNING TECHNIQUES:

1. PARENT EQUILENT PRUNING (PEP): One method of pruning involves

comparing the transaction sets of each parent/child pair. Let X be node C’s head and y

be an element in C’s tail, t(X) =t(X{y}), then any transaction containing X contains y.

This guarantees that any frequent item set containing X but not y has the frequent

superset (Z{y}). Since we only want the maximal frequent item sets, it is necessary to

count item sets containing X and not y. Therefore we can move item y from the tail to the

head. For node C, X=X{y} and element y is removed from C’s tail. This can yield

significant savings since the sub tree rooted at C no longer has to count y as an extension

for node in the sub tree.

ALGORITHM: PEP (Current node C, MFI)

1. For each item i in C. tail

2. Cn= C {i}

3. If (Cn.support==C.support)

4. Move i from C.tail to C.head

5. Else if (Cn is frequent)

6. PEP (Cn, MFI)

7. If (C is a leaf and C.head is not in MFI)

8. Add C.head to MFI

2. FHUT:

Another type of pruning is superset pruning. We observe that, at node C, the largest

possible frequent item set contained in the sub tree rooted at C is C’s HT. If C’s H T

is discovered to be frequent, we never have to explore subsets of the HT and thus we

can prune out the entire sub tree rooted at node C. We refer to this method pruning as

FHUT (Frequent Head Union Tail ) pruning which can be computed by exploring the left

most part of the sub tree at each node.

ALGORITHM: FHUT (Current node C, MFI, Boolean isHUT)

1. For each item i in C.tail

Page 9: p139 Data Mining Mafia

2. Cn= C{i}

3. isHUT = whether i is the left most child in the tail

4. If (Cn is frequent)

5. FHUT(Cn, MFI, isHUT )

6. If ( C is a leaf and C.head is not in MFI)

7. Add C.head to MFI

8. If (isHUT and all extensions are frequent)

9. Stop exploring sub tree and go back up tree to when

isHUT was changed to true.

3. HUTMFI:

There are two methods for determining whether an item set is frequent:

(1) Direct counting of the support of X.

(2) Checking if a superset of X has already been declared frequent.

FHUT uses the first method. The latter approach determines if a super set of

the HUT is in the MFI. If a superset does exist, then the HUT must be frequent and the

sub tree rooted at the node corresponding to x can be pruned away. We call this type of

superset pruning HUTMFI. HUTMFI does not expand any children to check for

successful superset pruning, unlike FHUT, where the left most branch of the sub tree is

explored. Hence HUTMFI is preferable to FHUT pruning since HUTMFI counts fewer

item sets.

ALGORITHM: HUTMFI(Current node C, MFI, isHUT)

1. HUT = C.head C.tail

2. If (HUT is in MFI)

3. Stop searching and return

4. For each item i in C.tail

5. Cn = C{i}

6. isHUT = whether i is the leftmost child in the tail

7. If (Cn is frequent)

8. HUTMFI(Cn, MFI, isHUT)

9. If (C is a leaf and C.head is not in MFI)

10. Add C.head to MFI

Page 10: p139 Data Mining Mafia

4. DYNAMIC REORDERING:

Dynamic reordering involves rearranging the children of each node by increasing the

support instead of lexicographically. As the size of tree grows, dynamic reordering helps

to trim out many branches of the search tree. The benefit of dynamically reordering the

children of each node based on support is significant. An algorithm that trims the tail to

only frequent extensions at a higher level will save a lot of computation. The order of tail

elements is an important consideration. Ordering the tail elements by increasing the

support will keep the search space as small as possible.

Dynamic reordering greatly increases the effectiveness of pruning mechanisms.

Since PEP depends on the support of each child relative to the parent, we can move all

elements for which PEP holds from the tail to head at once, quickly reducing the size of

the tail. For both FHUT and HUTMFI, ordering by increasing the support yields

significant savings. The infrequent extensions keep the left side of the sub tree small. On

the right side of the sub tree, where extensions are more frequent, FHUT and HUTMFI

are more effective in trimming the search space.

ALGORITHM: MAFIA (Current node C, MFI, Boolean isHUT)

1. HUT = C.head C.tail

2. If (HUT is in MFI)

3. Stop searching and return

4. Count all children, use PEP to trim the tail, and reorder by

increasing support

5. For each item i in C.trimmed.tail

6. isHUT = whether i is the first item in the tail

7. Cn = C {i}

8. MAFIA(Cn, MFI, isHUT)

9. If (isHUT and all extensions are frequent)

10. Stop exploring sub tree and go back up sub tree

11. If (C is a leaf and C.head is not in MFI)

12. Add C.head to MFI

Page 11: p139 Data Mining Mafia

EFFECTIVE MFI SUPERSET CHECKING:In order to enumerate the exact set of

maximally frequent item sets, before adding any item set to the MFI, we must check the

entire MFI to ensure that no superset of the item set has already been found. This check is

done often and significant performance improvement can be realized if it is done

efficiently. To ensure this, we adapt the progressive focusing. The basic idea is that,

while the entire MFI may be large, at any given node, only a fraction of MFI are possible

supersets of the item set at the node.

We therefore maintain, for each node, an LMFI (Local MFI), which is a subset of the

MFI that is relevant while performing superset checks at the node. Initially the LMFI for

the root is the null set. Assume that we are examining node C and are about to recurse on

Cn where Cn = C {y}. The LMFI for Cn consists of all item sets in the LMFI for C

with the added condition that they also contain the item with which we extended C to

found Cn. When the recursive call to Cn is finished, we add to the LMFI of C all item

sets that were added to the LMFI of Cn during the call. In addition, each time we add an

item set to the MFI, we add the item set to the LMFI of the node we are examining while

adding the item set to the MFI. Now candidate item sets no longer have to do superset

checks in the MFI. The LMFI contains all supersets of the current node. Therefore, if the

LMFI of a candidate node is empty, no supersets will be found in the entire MFI and,

conversely, if the LMFI is not empty, then a superset is guaranteed.

ALGORITHM:

MAFIALMFI (Current node C, MFI, Boolean isHUT)

1. HUT = C.head C.tail

2. If (HUT is in MFI)

3. Stop searching and return

4. Count all children, use PEP to trim the tail and reorder

by increasing the support

5. For each item i in C. trimmed. tail

6. isHUT = whether i is the first item in the tail

7. Cn = C{i}

8. Sort MFI by new item i and update left and right

pointers for Cn

Page 12: p139 Data Mining Mafia

9. MAFIALMFI (Cn, MFI, isHUT)

10. Adjust right LMFI pointers of C for any new item sets

added to MFI

11. If (isHUT and all extensions are frequent)

12. Stop exploring sub tree and go back up sub tree

13. If (C is a leaf and C’s LMFI is not empty)

14. Add C.head to MFI

ALGORITHMIC ANALYSIS:

First we present a full analysis of each pruning component of the MAFIA algorithm.

There are three types of pruning used to trim the tree: FHUT, HUTMFI and PEP.

FHUT and HUTMFI are both forms of superset pruning and thus will tend to overlap in

their efficiency for reducing these search space. In addition, dynamic reordering can

significantly reduce the size of the search space by removing infrequent items from each

node’s tail. The dense data sets support the idea that MAFIA runs the fastest on longer

item sets and it has the best performance.

CONCLUSIONS:

In this paper, we present a detailed performance analysis of MAFIA. The powerful

pruning techniques such as PEP and superset checking are very beneficial in reducing the

search space. Thus MAFIA is highly optimized for mining long item sets and on dense

data, it consistently outperforms other algorithms by a factor of 3 to 30.

REFERENCES:

“DATA MINING-Concepts and Techniques” by Jiawei Han,

Micheline Kamber.

“DATA MINING-Techniques“ by Arun K Pujari.

“DATA MINING” by Pieter Adriaans, Dolf Zantinge.

IEEE Transactions On Knowledge And Data Engineering.

“MINING LARGE ITEM SETS FOR ASSOCIATION RULES”

by C.C Aggarwal .

Page 13: p139 Data Mining Mafia