Upload
mochammad-adji-firmansyah
View
14
Download
3
Embed Size (px)
DESCRIPTION
Data Mining
Citation preview
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 1/58
Bab 5Bab 5Mining Association Mining Association
RulesRules
Arif Djunaidye-mail: [email protected]
URL: www.its-sby.edu/~arif
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 2/58
Outline What is association rules mining? The Apriori algorithm Iceberg Queries Methods to improve Apriori’s efficiency Mining frequent patterns without
candidate generation Interestingness measurements Multiple-level associations rules mining
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 3/58
Association rule mining: Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
Applications: Basket data analysis, cross-marketing, catalog design,
clustering, classification, etc. Examples.
buys(x, “computer”) buys(x, “software”) [2%, 75%] age(x, “mature”) ^ takes(x, “DM”) grade(x, “A”) [5%, 75%]
What Is Association Rules Mining?
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 4/58
Association Rules Mining: Basic Principle
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Also known as market basket analysis
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 5/58
Definition: Frequent Itemset Itemset
A collection of one or more items• Example: {Milk, Bread, Diaper}
k-itemset• An itemset that contains k items
Support count () Frequency of occurrence of an
itemset E.g. ({Milk, Bread,Diaper}) = 2
Support Fraction of transactions that
contain an itemset E.g. s({Milk, Bread, Diaper}) =
2/5 Frequent Itemset
An itemset whose support is greater than or equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 6/58
Definition: Association Rule
Example:Beer}Diaper,Milk{
4.052
|T|)BeerDiaper,,Milk( s
67.032
)Diaper,Milk()BeerDiaper,Milk,(
c
Association Rule An implication expression of the
form X Y, where X and Y are itemsets
Example: {Milk, Diaper} {Beer}
Rule Evaluation Metrics Support (s)
• Fraction of transactions that contain both X and Y
Confidence (c)• Measures how often items in Y
appear in transactions thatcontain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 7/58
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold
High confidence = strong pattern High support = occurs often
Less likely to be random occurrence Larger potential benefit from acting on the rule
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 8/58
Application 1 (Retail Stores)
Real market baskets chain stores keep TBs of customer purchase
info Value?
• how typical customers navigate stores• positioning tempting items• suggests cross-sell opportunities – e.g.,
hamburger sale while raising ketchup price• …
High support needed, or no $$’s
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 9/58
Application 2 (Information Retrieval)
Scenario 1 baskets = documents items = words in documents frequent word-groups = linked concepts.
Scenario 2 items = sentences baskets = documents containing sentences frequent sentence-groups = possible
plagiarism
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 10/58
Application 3 (Web Search)
Scenario 1 baskets = web pages items = outgoing links pages with similar references about same
topic Scenario 2
baskets = web pages items = incoming links pages with similar in-links mirrors, or same
topic
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 11/58
Mining Association RulesExample of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Observations:• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different confidence
• Thus, we may decouple the support and confidence requirements
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 12/58
Mining Association Rules Goal – find all association rules such that
Support s confidence c
Reduction to Frequent Itemsets Problems Find all frequent itemsets X Given X={A1, …,Ak}, generate all rules X-Aj Aj Confidence = sup(X)/sup(X-Aj) Support = sup(X) Exclude rules whose confidence is too low
Observe X-Aj also frequent support known Finiding all frequent itemsets is the hard part!
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 13/58
Association Rule Mining: A Road Map
Boolean vs. quantitative associations (Based on the types of values handled) buys(x, “WINDOWS 2K”) ^ buys(x, “SQLServer”) buys(x,
“DBMiner”) [0.2%, 50%] age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”)
[1%, 75%] Single dimension vs. multiple dimensional
associations (see ex. Above) Single level vs. multiple-level analysis
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 14/58
How are association rules mined form large databases?
Association rule mining is a two-step process.1. Find all frequent itemsets:
By definition, each of these itemsets will occur at least as frequent as a pre-determined minimum support count.
2. Generate strong association rules form the frequent itemsets:By definition, these rules must satisfy minimum support and minimum confidence
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 15/58
Itemset Lattice: An Examplenull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given m items, there are 2m-1 possible candidate itemsets
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 16/58
Scale of Problem WalMart
sells m=100,000 items tracks n=1,000,000,000 baskets
Web several billion pages approximately one new “word” per page
Exponential number of itemsets m items → 2m-1 possible itemsets Cannot possibly example all itemsets for large m Even itemsets of size 2 may be too many m=100,000 → 5 trillion item pairs
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 17/58
Frequent Itemsets in SQL DBMSs are poorly suited to association rule mining Star schema
Sales Fact Transaction ID degenerate dimension Item dimension
Finding frequent 3-itemsets:SELECT Fact1.ItemID, Fact2.ItemID, Fact3.ItemID, COUNT(*)FROM Fact1 JOIN Fact2 ON Fact1.TID = Fact2.TID AND Fact1.ItemID < Fact2.ItemIDJOIN Fact3 ON Fact1.TID = Fact3.TID AND Fact1.ItemID < Fact2.ItemID AND Fact2.ItemID < Fact3.ItemIDGROUP BY Fact1.ItemID, Fact2.ItemID, Fact3.ItemIDHAVING COUNT(*) > 1000
Finding frequent k-itemsets requires joining k copies of fact table Joins are non-equijoins Impossibly expensive!
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 18/58
Association Rules and Data Warehouses
Typical procedure: Use data warehouse to apply filters
• Mine association rules for certain regions, dates Export all fact rows matching filters to flat file
• Sort by transaction ID• Items in same transaction are grouped together
Perform association rule mining on flat file An alternative:
Database vendors are beginning to add specialized data mining capabilities
Efficient algorithms for common data mining tasks are built in to the database system
• Decisions trees, association rules, clustering, etc. Not standardized yet
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 19/58
Finding Frequent Pairs
Frequent 2-Sets hard case already focus for now, later extend to k-sets
Naïve Algorithm Counters – all m(m–1)/2 item pairs (m = # of distinct
items) Single pass – scanning all baskets Basket of size b – increments b(b–1)/2 counters
Failure? if memory < m(m–1)/2 m=100,000 → 5 trillion item pairs Naïve algorithm is impractical for large m
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 20/58
Pruning Candidate ItemsetsMonotonicity principle:
If an itemset is frequent, then all of its subsets must also be frequent
Monotonicity principle holds due to the following property of the support measure:
Converse: If an itemset is infrequent, then all of its
supersets must also be infrequent
)()()(:, YsXsYXYX
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 21/58
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Monotonicity Principlenull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 22/58
Mining Frequent Itemsets: the Key StepThe Apriori principle:Any subset of a frequent itemset must be frequent Find the frequent itemsets: the sets of items that
have minimum support A subset of a frequent itemset must also be a frequent
itemset• i.e., if {AB} is a frequent itemset, both {A} and {B} should be
a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to
k (k-itemset) Use the frequent itemsets to generate association
rules.
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 23/58
The Apriori Algorithm Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset Pseudo-code:
Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_supportendreturn k Lk;
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 24/58
The Apriori Algorithm — Example (sup_min=2)
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
itemset sup.{1} 2{2} 3{3} 3{5} 3
Scan D
C1L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2
Scan D
C3 L3itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 25/58
L2=(1,3)
1->3
Sup(1U3)=2
conf(1->3) = sup(1U3)/sup(1)=2/2=100%
3->1Sup(1U3)=2conf(3->1) = sup(1U3)/sup(3)=2/3=67%
Generateing Associatin Rules form Frequent Itemsets
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 26/58
L3 = (2,3,5)
2 U 3 -> 5sup (2U3U5) = 2, conf (2U3 -> 5) = sup(2U3U5)/sup(2U3) = 2/2 = 100%
2 -> 3 U 5sup (2U3U5) = 2, conf (2 -> 3 U 5) = sup(2U3U5)/sup(2) = 2/3 = 67%
2 U 5 -> 3sup (2U3U5) = 2, conf (2U5 -> 3) = sup(2U3U5)/sup(2U5) = 2/3 = 67%
3U5 -> 2sup (2U3U5) = 2, conf (3U5 -> 2) = sup(2U3U5)/sup(3U5) = 2/2 = 100%
3 -> 2U5sup (2U3U5) = 2, conf (3 ->2U 5) = sup(2U3U5)/sup(3) = 2/3 = 67%
5 -> 2U3sup (2U3U5) = 2, conf (5 -> 2U3) = sup(2U3U5)/sup(5) = 2/3 = 67%
Generateing Associatin Rules form Frequent Itemsets
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 27/58
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from Ck
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 28/58
Example of Generating Candidates
L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3
abcd from abc and abd acde from acd and ace
Pruning: acde is removed because ade is not in L3
C4={abcd}
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 29/58
Iceberg Queries
Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold
Example:select P.custID, P.itemID, sum(P.qty)from purchase Pgroup by P.custID, P.itemIDhaving sum(P.qty) >= 10
Compute iceberg queries efficiently by Apriori: First compute lower dimensions Then compute higher dimensions only when all the lower
ones are above the threshold
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 30/58
Iceberg Queries (Cont.)
Generate cust_list, a list of customer who bought three or more items in total, for example,
select P.cust_IDfrom Purchases Pgroup by P.cust_IDhaving SUM(P.qty)>=3;
Generate item_list, a list ofitems that were purchased by any customer in quantuties of three or more, for example,
select P.item_IDfrom Purchases Pgroup by P.item_IDhaving SUM(P.qty)>=3;
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 31/58
Is Apriori Fast Enough? — Performance Bottlenecks
The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-
itemsets Use database scan and pattern matching to collect counts for
the candidate itemsets The bottleneck of Apriori: candidate generation
Huge candidate sets:• 104 frequent 1-itemset will generate 107 candidate 2-
itemsets• To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100 1030 candidates. Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest pattern
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 32/58
Methods to Improve Apriori’s Efficiency Transaction reduction:
A transaction that does not contain any frequent k-itemset is
useless in subsequent scans because it can not contain any
fewquent (k+1)-itemsets. Therefore, such a transaction can be
removed from further consideration.
Partitioning: Any itemset that is potentially frequent in DB must be frequent
in at least one of the partitions of DB
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 33/58
Partitioning
Transactions in D
Divide D into n partitions
Find the frequent itemsets local to each partition
(1 scan)
Combine all local frequent itemsets to form candidate itemset
Find global frequent itemsets among candidates
(1 scan)
Frequent itemsets in D
Phase IIPhase I
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 34/58
Scan once Algorithm (Support count: 3)
Item a Item b Item c Item d Item e
Transaction 1 1 1 0 1 1
Transaction 2 0 1 1 0 1
Transaction 3 1 1 0 1 1
Transaction 4 1 1 1 0 1
Transaction 5 1 1 1 1 1
Transaction 6 0 1 1 1 0
Table – Boolean relational database D
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 35/58
Scan once Algorithm
Figure: A complete itemset tree for the five items a, b, c, d and e exemplified in database shown in the table
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Level 0 (C15)
Level 1 (C25)
Level 2 (C35)
Level 3 (C45)
Level 4 (C55)
d
d d dc
a c
c
db
b d c d d
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 36/58
Support Count
T6 T5 T4 T3 T2 T1 Itemset T1-a T1-b T1-d T1-e … T6-b T6-c T6-d
45 1
11
11
11 1
11
ab
11 1
44
11
11
11
11
cd 1
11
54
11
11
11
1 11
eab 1/2 1/2
11/2
23
11
11 1
acad
1/21/2 1/2
1/21/2
44 1
11
11
11
1 aebc
1/21/2
1/21/2 1/2
4 1 1 1 1 bd 1/2 1/2 1/2 1/252 1
11
1 1 1 1 becd
1/21/2
1/2 1/21/2 1/2
33
11
11
11
cede 1/2
1/21/2
1/21/2
23
11
11 1
abcabd
1/31/3
1/31/3 1/3
1/31/3
1/31/3
41
11
1 1 1 abeacd
1/31/3
1/31/3
1/3 1/31/3 1/3
23
11
11 1
aceade
1/31/3 1/3
1/31/3
1/31/3
23
1 11 1 1
bcdbce
1/31/3
1/31/3
1/31/3
1/31/3
1/3
31
11
1 1 bdecde
1/31/3
1/31/3 1/3
1/31/3
12
11 1
abcdabce
1/41/4
1/41/4
1/41/4
1/41/4
1/41/4
1/4
31
11
1 1 abdeacde
1/41/4
1/4 1/41/4
1/41/4
1/41/4
1/41/4
11
11
bcdeabcde 1/5
1/41/5
1/41/5
1/41/5
1/41/5
1/41/5
1/41/5
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 37/58
Mining Frequent Patterns Without Candidate Generation
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern
mining avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining
tasks into smaller ones Avoid candidate generation: sub-database test only!
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 38/58
Construct FP-tree from a Transaction DB
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 0.5
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps:1. Scan DB once, find
frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 39/58
Benefits of the FP-tree Structure
Completeness: preserves complete information for frequent pattern mining
Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are
more likely to be shared never be larger than the original database (if not count
node-links and counts)
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 40/58
Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree
Method For each item, construct its conditional pattern-base, and
then its conditional FP-tree Repeat the process on each newly created conditional FP-
tree Until the resulting FP-tree is empty, or it contains only
one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 41/58
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in the FP-tree
2) Construct conditional FP-tree from each conditional pattern-base
3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 42/58
Step 1: From FP-tree to Conditional Pattern Base
Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form
a conditional pattern base
Conditional pattern basesitem cond. pattern basec f:3a fc:3b fca:1, f:1, c:1m fca:2, fcab:1p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 43/58
Step 2: Construct Conditional FP-tree
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
All frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcam
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header TableItem frequency head f 4c 4a 3b 3m 3p 3
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 44/58
Mining Frequent Patterns by Creating Conditional Pattern-Bases
EmptyEmptyf{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}aEmpty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 45/58
Single FP-tree Path Generation
Suppose an FP-tree T has a single path P The complete set of frequent pattern of T can be
generated by enumeration of all the combinations of the sub-paths of P
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcam
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 46/58
Principles of Frequent Pattern Growth
Pattern growth property Let be a frequent itemset in DB, B be 's conditional
pattern base, and be an itemset in B. Then is a frequent itemset in DB iff is frequent in B.
“abcdef ” is a frequent pattern, if and only if “abcde ” is a frequent pattern, and “f ” is frequent in the set of transactions containing
“abcde ”
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 47/58
Why Is Frequent Pattern Growth Fast?
Our performance study shows FP-growth is an order of magnitude faster than Apriori,
and is also faster than tree-projection Reasoning
No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 48/58
Interestingness Measurements Objective measures
Two popular measurements: support; and confidence
Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 49/58
Criticism to Support and Confidence
Example 1: (Aggarwal & Yu, PODS98) Among 5000 students
• 3000 play basketball• 3750 eat cereal• 2000 both play basket ball and eat cereal
play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.
play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence
basketball not basketball sum(row)cereal 2000 1750 3750not cereal 1000 250 1250sum(col.) 3000 2000 5000
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 50/58
Criticism to Support and Confidence (Cont.)
Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates
We need a measure of dependent or correlated events
P(B|A)/P(B) is also called the lift of rule A => B
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%)()(
)(, BPAP
BAPcorr BA
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 51/58
Other Interestingness Measures: Interest
Interest (correlation, lift)
taking both P(A) and P(B) in consideration P(AUB)=P(B)*P(A), if A and B are independent events A and B negatively correlated, if the value is less than 1;
otherwise A and B positively correlated
)()()(
BPAPBAP
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Itemset Support InterestX,Y 25% 2X,Z 37.50% 0.9Y,Z 12.50% 0.57
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 52/58
Multiple-Level Association Rules
Items often form hierarchy. Items at the lower level are
expected to have lower support.
Rules regarding itemsets at appropriate levels could be
quite useful. Transaction database can
be encoded based on dimensions and levels
We can explore shared multi-level mining
All
PrinterComputer
Desktop
CompaqIBM
Laptop B/WColor
TID Items T1 {111, 121, 211, 221} T2 {111, 211, 222, 323} T3 {112, 122, 221, 411} T4 {111, 121} T5 {111, 122, 211, 221, 413}
SonyHP
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 53/58
Mining Multi-Level Associations
A top_down, progressive deepening approach: First find high-level strong rules:
computer printer [20%, 60%]. Then find their lower-level “weaker” rules:
desktop printer [6%, 50%]. Variations at mining multiple-level association
rules. Level-crossed association rules:
desktop Sony color printer Association rules with multiple, alternative
hierarchies: desktop Color printer
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 54/58
Uniform SupportMulti-level mining with uniform support
Computer
[support = 10%]
Desktop
[support = 6%]
Laptop
[support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
Back
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 55/58
Reduced SupportMulti-level mining with reduced support
Desktop
[support = 6%]
Laptop
[support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 3%
Computer
[support = 10%]
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 56/58
Multi-Dimensional Association: Concepts
Single-dimensional rules:buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X,“coke”)
hybrid-dimension association rules (repeated predicates)age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 57/58
Summary
Association rule mining probably the most significant contribution from
the database community in KDD A large number of papers have been published
Many interesting issues have been explored An interesting research direction
Association analysis in other types of data: spatial data, multimedia data, time series data, etc.
Data Mining – Arif Djunaidy – FTIF ITS Bab 5 - 58/58
AkhirAkhirBab 5Bab 5