Upload
maxim
View
39
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Mining Approximate Frequent Itemsets in the Presence of Noise. By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins. Presentation by- Apurv Awasthi. Title Statement. - PowerPoint PPT Presentation
Citation preview
Mining Approximate Frequent Itemsets in the Presence of Noise
By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins
Presentation by- Apurv Awasthi
Title Statement
This paper introduces an approach to implement noise tolerant frequent itemset mining of the binary matrix representation of the database
Index Introduction to Frequent Itemset Mining
• Frequent Itemset Mining• Binary Matrix Representation Model• Problems
Motivation Proposed Model Proposed Algorithm AFI Mining vs. Exact Frequent Itemset Mining Related Works Experimental Results Discussion Conclusion
Introduction to Frequent Itemset Mining
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
Originally developed to discover association rules Applications
• Bio-molecular applications:oDNA sequence analysis, protein structure
analysis• Business applications:
oMarket basket analysis, sale campaign analysis
The Binary Matrix Representation Model
Model for representing relational databases Rows correspond to objects Columns correspond to attributes of the objects
• ‘1’ indicates presence• ‘0’ indicates absence
• Frequent Itemset mining is a key technique for analyzing such data
Apply Apriori algorithm
Item -->I1 I2 I3 I4 I5
Transaction
T1 1 0 1 1 0
T2 0 1 1 0 1
T3 1 1 1 0 1
T4 0 1 0 0 1
T5 1 0 0 0 0
Problem with Frequent Itemset Mining
• The traditional model for mining frequent itemsets requires that every item must occur in each supporting transaction
• Real data is typically subject to noise• Reasons for noise
o Human erroro Measurement erroro Vagaries of human behavioro Stochastic nature of studied biological behavior
• NOT a practical assumption!
Effect of Noise Fragmentation of Patterns by Noise
• Discover multiple small fragments of the true itemset• Miss the true itemset itself!
Example• Exact frequent itemset
mining algorithm will miss the main itemset ‘A’
• Observe three fragmented itemsets – Itemset 1,2 and 3
• Fragmented itemsets may not satisfy the minimum support criteria and will therefore be discarded
Mathematical Proof of Fragmentation
With probability 1, M(Y) <= 2logan− 2loga(logan)
when n is sufficiently large
i.e. in the presence of noise, only a fraction of the initial block of 1s can be recovered
Where –Matrix X: contains actual values recorded in the absence of any noiseMatrix Z: binary noise matrix whose entries are independent Bernoulli’s random
variable such that Z ~ Bern(p) for 0<=p<=0.5M(Y): is the largest k such that Y contains k transactions having k common itemsY = X xor Z, a = (1 − p)−1
From: Significance and Recovery of block structures in binary matrices with noise - by X. Sun & A.B. Nobel
Motivation
The failure of classical frequent itemset mining to detect simple patterns in the presence of random errors (i.e. noise) compromises the ability of these algorithms to detect association, cluster items or build classifiers when such errors are present
DRAWBACK: “Free riders” like column h (for matrix C) and row 6
(for matrix B)
Possible Solutions
Let the matrix contain a small fraction of 0s
SOLUTION: Limit the number of 0s in each row and column
Proposed Model
1. Use Approximate Frequent Itemset (AFI) AFI characteristics
• Sub-matrix contains large fraction of 1s• Supporting transaction should contain most of the items
i.e. number of 0s in every row must fall below user defined threshold (єr)
• Supporting item should occur in most of the transaction i.e. number of 0s in every column must fall below user defined threshold (єc)
• Number of rows > minimum support
AFI
Mathematical definition• For a given binary matrix D having I0 items and T0 transactions,
an itemset I c I0 is an approximate frequent itemset AFI(єr,єc) if there exists a set of transactions T c T0 with |T| ≥ |T0|.minsup such that
• Similarly, define weak AFI(є)
AFI example
A, B and C are weak AFI (0.25)
A: valid AFI(0.25,0.25) B: weak AFI(*,0.25) C: weak AFI(0.25,*)
Drawback of AFI
Apriori Property: all sub-itemsets of a frequent itemset must be frequent
But, sub-itemset of an AFI need not be AFI e.g. A is a valid AFI for minSupport = 4, but {b,c,e}, {b,c,d} etc are not valid AFIs
• PROBLEM – now minimum support can not be used as a pruning technique
• SOLUTION – a generalization of Apriori properties for noisy conditions (called Noise Tolerant Support Pruning)
AFI criteria violates the Apriori property!
Proposed Model
1. Use Approximate Frequent Itemset (AFI)2. Noise Tolerant Support Pruning – to prune and
generate candidate itemsets3. 0/1 Extension - to count the support of a noise -
tolerant itemset based on the support set of its sub-itemsets
Noise Tolerant Support Pruning
For a given єr, єc and minsup the noise tolerant pruning support for a length-k itemset is-
Proof
0/1 Extensions Starting from singleton itemsets, generate (k+1) itemsets
from k itemsets in sequential manner The number of 0s allowed in the itemset grows with the
length of the itemset in a discrete manner 1 Extension
If then the transaction set of a (k+1) itemset I is the intersection of the transaction sets of its length k subsets
0 ExtensionIf then the transaction set of a (k+1) itemset I is the union of the transaction sets of its length k subsets
Proof
Proposed Algorithm
AFI vs. Exact Frequent ItemsetAFI Miningєr, єc = 1/3; n=8; minsup =1
AFI vs. Exact Frequent ItemsetExact Frequent Itemset Mining
Transaction Item
T1 a,b,c
T2 a,b
T3 a,c
T4 b,c
T5 a,b,c,d
T6 d
T7 b,d
T8 a
Itemset Support
a 5
b 5
c 4
MinSup = 0.5 i.e. 4 transactionsn = 8
1-candidatesFreq 1-itemsets 2-candidates
Freq 2-itemsetsItemset Support
ab 3
ac 3
bc 3
Itemset Support
a 5
b 5
c 4
d 3
Itemset
Null
AFI vs. Exact Frequent Itemset - Result
Approximate Frequent Itemset Exact Frequent Itemset
Generates the frequent itemset {a,b,c}
Can not generate any frequent itemset in the
presence of noise for the given minimum support
value
Related Works Yang et al. (2001) proposed two error-tolerant models, termed weak
error-tolerant itemsets or weak ETI [which is equivalent to weak AFI] and strong ETI which is [equivalent to AFI(єr,*)]
DRAWBACK No efficient pruning technique – rely on heuristics and sampling techniquesDo not preclude columns of 0
Steinbach et al. (2004) proposed a “support envelope” which is a tool for exploration and visualization of the high-level structures of association patterns. A symmetric ETI model is proposed such that the same fraction of errors are allowed in both rows and columns.
DRAWBACK Implements same error co-efficient for rows and columns i.e. єr= єc
Admits only a fixed number of 0s in the itemset. Fraction of noise does not vary with size of itemset sub-matrix
Related Works Seppänen and Mannila (2004) proposed to mine the dense itemsets in
the presence of noise where the dense itemsets are the itemsets with a sufficiently large sub-matrix that exceeds a given density threshold of attributes present.
DRAWBACK Enforces the constraint that all sub-itemsets of a dense itemset must be frequent – will fail to identify larger itemsets that have sufficient support because all sub-itemsets might not have enough supportRequires repeated scans of the database
Experimental Results - Scalability
Scalability• Database of 10,000
transactions and 100 items
• Run time increases as noise tolerance increases
• Reducing item wise error constraint leads to greater reduction in run time as compared to transaction wise error constraint
Experimental Results – Synthetic Data
Quality Testing for single cluster• Create data with an
embedded pattern• Add noise by flipping
each entry with probability p where 0.01 ≤ p ≤ 0.2
Quality Testing for multiple clusters• Create data with
multiple embedded pattern
• Add noise by flipping each entry with probability p where 0.01 ≤ p ≤ 0.2
Experimental Results – Synthetic Data
Experimental Results – Real World Data
Zoo Data Set• Database contained 101 instances and 18 attribute• All the instances are classified into 7 classes e.g. Mammals, fish etc
Exact ETI (єr) AFI (єr,єc)Generated subsets of animal in each class
Then found subsets of their common features
Identified "fins" and "domestic" as common
features NOT necessarily true
Only AFI was able to recover 3 classes with 100%
accuracy
Discussion
Advantages• Flexibility of placing constraints independently along rows
and columns• Generalized Apriori technique for pruning• Avoids repeated scans of database by using 0/1 extension
Summary
The paper outlines an algorithm for mining approximate frequent itemsets from noisy data
It introduces • an AFI model• Generalized Apriori property for pruning
The proposed algorithm generates more useful itemsets compared to existing algorithms and is also computationally more efficient
Thank You!
Extra Slides for Questionnaire
Applying Apriori Algorithm
TID ItemsT1 a, c, dT2 b, c, eT3 a, b, c, eT4 b, e
Min_sup=2
Itemset Supa 2b 3c 3d 1e 3
Data base D 1-candidates
Scan D
Itemset Supa 2b 3c 3e 3
Freq 1-itemsets
Itemsetabacaebcbece
2-candidates
Itemset Supab 1ac 2ae 1bc 2be 3ce 2
Counting
Scan D
Itemset Supac 2bc 2be 3ce 2
Freq 2-itemsetsItemset
bce
3-candidates
Itemset Supbce 2
Freq 3-itemsets
Scan D
Item -->a b c d e
Transaction
T1 1 0 1 1 0T2 0 1 1 0 1T3 1 1 1 0 1T4 0 1 0 0 1T5 0 0 0 0 0
Noise Tolerant Support Pruning - Proof
0/1 Extensions Proof Number of zeroes allowed in an itemset grows with the length of
the itemset