Mining Approximate Frequent Itemsets in the Presence of Noise

Mining Approximate Frequent Itemsets in the Presence of Noise

By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins

Presentation by- Apurv Awasthi

Title Statement

This paper introduces an approach to implement noise tolerant frequent itemset mining of the binary matrix representation of the database

Index Introduction to Frequent Itemset Mining

• Frequent Itemset Mining• Binary Matrix Representation Model• Problems

Motivation Proposed Model Proposed Algorithm AFI Mining vs. Exact Frequent Itemset Mining Related Works Experimental Results Discussion Conclusion

Introduction to Frequent Itemset Mining

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

Originally developed to discover association rules Applications

• Bio-molecular applications:oDNA sequence analysis, protein structure

analysis• Business applications:

oMarket basket analysis, sale campaign analysis

The Binary Matrix Representation Model

Model for representing relational databases Rows correspond to objects Columns correspond to attributes of the objects

• ‘1’ indicates presence• ‘0’ indicates absence

• Frequent Itemset mining is a key technique for analyzing such data

Apply Apriori algorithm

Item -->I1 I2 I3 I4 I5

Transaction

T1 1 0 1 1 0

T2 0 1 1 0 1

T3 1 1 1 0 1

T4 0 1 0 0 1

T5 1 0 0 0 0

Problem with Frequent Itemset Mining

• The traditional model for mining frequent itemsets requires that every item must occur in each supporting transaction

• Real data is typically subject to noise• Reasons for noise

o Human erroro Measurement erroro Vagaries of human behavioro Stochastic nature of studied biological behavior

• NOT a practical assumption!

Effect of Noise Fragmentation of Patterns by Noise

• Discover multiple small fragments of the true itemset• Miss the true itemset itself!

Example• Exact frequent itemset

mining algorithm will miss the main itemset ‘A’

• Observe three fragmented itemsets – Itemset 1,2 and 3

• Fragmented itemsets may not satisfy the minimum support criteria and will therefore be discarded

Mathematical Proof of Fragmentation

With probability 1, M(Y) <= 2logan− 2loga(logan)

when n is sufficiently large

i.e. in the presence of noise, only a fraction of the initial block of 1s can be recovered

Where –Matrix X: contains actual values recorded in the absence of any noiseMatrix Z: binary noise matrix whose entries are independent Bernoulli’s random

variable such that Z ~ Bern(p) for 0<=p<=0.5M(Y): is the largest k such that Y contains k transactions having k common itemsY = X xor Z, a = (1 − p)−1

From: Significance and Recovery of block structures in binary matrices with noise - by X. Sun & A.B. Nobel

Motivation

The failure of classical frequent itemset mining to detect simple patterns in the presence of random errors (i.e. noise) compromises the ability of these algorithms to detect association, cluster items or build classifiers when such errors are present

DRAWBACK: “Free riders” like column h (for matrix C) and row 6

(for matrix B)

Possible Solutions

Let the matrix contain a small fraction of 0s

SOLUTION: Limit the number of 0s in each row and column

Proposed Model

1. Use Approximate Frequent Itemset (AFI) AFI characteristics

• Sub-matrix contains large fraction of 1s• Supporting transaction should contain most of the items

i.e. number of 0s in every row must fall below user defined threshold (єr)

• Supporting item should occur in most of the transaction i.e. number of 0s in every column must fall below user defined threshold (єc)

• Number of rows > minimum support

AFI

Mathematical definition• For a given binary matrix D having I0 items and T0 transactions,

an itemset I c I0 is an approximate frequent itemset AFI(єr,єc) if there exists a set of transactions T c T0 with |T| ≥ |T0|.minsup such that

• Similarly, define weak AFI(є)

AFI example

A, B and C are weak AFI (0.25)

A: valid AFI(0.25,0.25) B: weak AFI(*,0.25) C: weak AFI(0.25,*)

Drawback of AFI

Apriori Property: all sub-itemsets of a frequent itemset must be frequent

But, sub-itemset of an AFI need not be AFI e.g. A is a valid AFI for minSupport = 4, but {b,c,e}, {b,c,d} etc are not valid AFIs

• PROBLEM – now minimum support can not be used as a pruning technique

• SOLUTION – a generalization of Apriori properties for noisy conditions (called Noise Tolerant Support Pruning)

AFI criteria violates the Apriori property!

Proposed Model

1. Use Approximate Frequent Itemset (AFI)2. Noise Tolerant Support Pruning – to prune and

generate candidate itemsets3. 0/1 Extension - to count the support of a noise -

tolerant itemset based on the support set of its sub-itemsets

Noise Tolerant Support Pruning

For a given єr, єc and minsup the noise tolerant pruning support for a length-k itemset is-

Proof

0/1 Extensions Starting from singleton itemsets, generate (k+1) itemsets

from k itemsets in sequential manner The number of 0s allowed in the itemset grows with the

length of the itemset in a discrete manner 1 Extension

If then the transaction set of a (k+1) itemset I is the intersection of the transaction sets of its length k subsets

0 ExtensionIf then the transaction set of a (k+1) itemset I is the union of the transaction sets of its length k subsets

Proof

Proposed Algorithm

AFI vs. Exact Frequent ItemsetAFI Miningєr, єc = 1/3; n=8; minsup =1

AFI vs. Exact Frequent ItemsetExact Frequent Itemset Mining

Transaction Item

T1 a,b,c

T2 a,b

T3 a,c

T4 b,c

T5 a,b,c,d

T6 d

T7 b,d

T8 a

Itemset Support

a 5

b 5

c 4

MinSup = 0.5 i.e. 4 transactionsn = 8

1-candidatesFreq 1-itemsets 2-candidates

Freq 2-itemsetsItemset Support

ab 3

ac 3

bc 3

Itemset Support

a 5

b 5

c 4

d 3

Itemset

Null

AFI vs. Exact Frequent Itemset - Result

Approximate Frequent Itemset Exact Frequent Itemset

Generates the frequent itemset {a,b,c}

Can not generate any frequent itemset in the

presence of noise for the given minimum support

value

Related Works Yang et al. (2001) proposed two error-tolerant models, termed weak

error-tolerant itemsets or weak ETI [which is equivalent to weak AFI] and strong ETI which is [equivalent to AFI(єr,*)]

DRAWBACK No efficient pruning technique – rely on heuristics and sampling techniquesDo not preclude columns of 0

Steinbach et al. (2004) proposed a “support envelope” which is a tool for exploration and visualization of the high-level structures of association patterns. A symmetric ETI model is proposed such that the same fraction of errors are allowed in both rows and columns.

DRAWBACK Implements same error co-efficient for rows and columns i.e. єr= єc

Admits only a fixed number of 0s in the itemset. Fraction of noise does not vary with size of itemset sub-matrix

Related Works Seppänen and Mannila (2004) proposed to mine the dense itemsets in

the presence of noise where the dense itemsets are the itemsets with a sufficiently large sub-matrix that exceeds a given density threshold of attributes present.

DRAWBACK Enforces the constraint that all sub-itemsets of a dense itemset must be frequent – will fail to identify larger itemsets that have sufficient support because all sub-itemsets might not have enough supportRequires repeated scans of the database

Experimental Results - Scalability

Scalability• Database of 10,000

transactions and 100 items

• Run time increases as noise tolerance increases

• Reducing item wise error constraint leads to greater reduction in run time as compared to transaction wise error constraint

Experimental Results – Synthetic Data

Quality Testing for single cluster• Create data with an

embedded pattern• Add noise by flipping

each entry with probability p where 0.01 ≤ p ≤ 0.2

Quality Testing for multiple clusters• Create data with

multiple embedded pattern

• Add noise by flipping each entry with probability p where 0.01 ≤ p ≤ 0.2

Experimental Results – Synthetic Data

Experimental Results – Real World Data

Zoo Data Set• Database contained 101 instances and 18 attribute• All the instances are classified into 7 classes e.g. Mammals, fish etc

Exact ETI (єr) AFI (єr,єc)Generated subsets of animal in each class

Then found subsets of their common features

Identified "fins" and "domestic" as common

features NOT necessarily true

Only AFI was able to recover 3 classes with 100%

accuracy

Discussion

Advantages• Flexibility of placing constraints independently along rows

and columns• Generalized Apriori technique for pruning• Avoids repeated scans of database by using 0/1 extension

Summary

The paper outlines an algorithm for mining approximate frequent itemsets from noisy data

It introduces • an AFI model• Generalized Apriori property for pruning

The proposed algorithm generates more useful itemsets compared to existing algorithms and is also computationally more efficient

Thank You!

Extra Slides for Questionnaire

Applying Apriori Algorithm

TID ItemsT1 a, c, dT2 b, c, eT3 a, b, c, eT4 b, e

Min_sup=2

Itemset Supa 2b 3c 3d 1e 3

Data base D 1-candidates

Scan D

Itemset Supa 2b 3c 3e 3

Freq 1-itemsets

Itemsetabacaebcbece

2-candidates

Itemset Supab 1ac 2ae 1bc 2be 3ce 2

Counting

Scan D

Itemset Supac 2bc 2be 3ce 2

Freq 2-itemsetsItemset

bce

3-candidates

Itemset Supbce 2

Freq 3-itemsets

Scan D

Item -->a b c d e

Transaction

T1 1 0 1 1 0T2 0 1 1 0 1T3 1 1 1 0 1T4 0 1 0 0 1T5 0 0 0 0 0

Noise Tolerant Support Pruning - Proof

0/1 Extensions Proof Number of zeroes allowed in an itemset grows with the length of

the itemset

Documents

Mining Approximate Frequent Itemsets in the Presence of Noise