Data Mining
Association Rules Mining or
Market Basket Analysis
Prithwis Mukerjee, Ph.D.
Prithwis Mukerjee 2
Let us describe the problem ...
A retailer sells the following items
And we assume that the shopkeeper keeps track of what each customer purchases :
He needs to know which items are generally sold together
Bread Cheese Coffee JuiceMilk Tea BiscuitsSugar Newspaper
Items10 Bread, Cheese, Newspaper20 Bread, Cheese, Juice30 Bread, Milk40 Cheese, Juice, Milk, Coffee50 Sugar, Tea, Coffee, Biscuits, Newspaper60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper70 Bread, Cheese 80 Bread, Cheese, Juice, Coffee90 Bread, Milk
100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper
Trans ID
Prithwis Mukerjee 3
Associations
Rules expressing relations between items in a “Market Basket”
{ Sugar and Tea } => {Biscuits} Is it true, that if a customer buys Sugar and Tea,
she will also buy biscuits ? If so, then
These items should be ordered together But discounts should not be given on these items at
the same time !
We can make a guess but It would be better if we could structure this problem
in terms of mathematics
Prithwis Mukerjee 4
Basic Concepts
Set of n Items on Sale I = { i1 , i2 , i3 , i4 , i5 , i5 , ......, in }
Transaction A subset of I : T I A set of items purchased in an individual
transaction With each transaction having m items ti = { i1 , i2 , i3 , i4 , i5 , i5 , ......, im } with m < n
If we have N transactions then we have t1 , t2 ,t3 ,.. tN as unique identifier for each transaction
D is our total data about all N transactions D = {t1 , t2 ,t3 ,.. tN}
Prithwis Mukerjee 5
An Association Rule
Whenever X appears, Y also appears X Y X I, Y I, X Y =
X and Y may be Single items or Sets of items – in which the same item does not
appear
X is referred to as the antecedent
Y is referred to as the consequent
Whether a rule like this exists is the focus of our analysis
Prithwis Mukerjee 6
Two key concepts
Support ( or prevalence) How often does X and Y appear together in the basket ? If this number is very low then it is not worth examining Expressed as a fraction of the total number of
transactions Say 10% or 0.1
Confidence ( or predictability ) Of all the occurances of X, in what fraction does Y also
appear ? Expressed as a fraction of all transactions containing X Say 80% or 0.8
We are interested in rules that have a Minimum value of support : say 25% Minimum value of confidence : say 75%
Prithwis Mukerjee 7
Mathematically speaking ...
Support (X) = (Number of times X appears ) / N = P(X)
Support (XY) = (Number of times X and Y appears ) / N = P(X Y)
Confidence (X Y) = Support (XY) / Support(X) = Probability (X Y) / P(X) = Conditional Probability P( Y | X)
Lift : an optional term Measures the power of association P( Y | X) / P(Y)
Prithwis Mukerjee 8
The task at hand ...
Given a large set of transactions, we seek a procedure ( or algorithm ) That will discover all association rules That have a minimum support of p% And a minimum confidence level of q% And to do so in an efficient manner
Algorithms The Naive or Brute Force Method
The Improved Naive algorithm The Apriori Algorithm
Improvements to the Apriori algorithm FP ( Frequent Pattern ) Algorithm
Prithwis Mukerjee 9
Let us try the Naive Algorithm manually !
This is the set of transaction that we have ...
We want to find Association Rules with Minimum 50% support and Minimum 75% confidence
Items100 Bread, Cheese200 Bread, Cheese, Juice300 Bread, Milk400 Cheese, Juice, Milk
Trans ID
Prithwis Mukerjee 10
Itemsets & Frequencies
Which sets are frequent ? Since we are looking for a
support of 50%, we need a set to appear in 2 out of 4 transactions = (# of times X
appears ) / N = P(X)
6 sets meet this criteria
Item Sets Frequency{Bread} 3{Cheese } 3{Juice} 2{Milk} 2{Bread, Cheese} 2{Bread, Juice } 1{Bread, Milk} 1{Cheese, Juice} 2{Cheese, Milk} 1{Juice, Milk} 1{Bread, Cheese, Juice} 1{Bread, Cheese, Milk} 0{Bread, Juice, Milk} 0{Cheese, Juice, Milk} 1{Bread, Cheese, Juice, Milk} 0
Prithwis Mukerjee 11
A closer look at the “Frequent Set”
Look at itemsets with more than 1 item {Bread, Cheese}, {Cheese, Juice} 4 rules are possible
Look for confidence levels Confidence (X Y) = Support (XY) / Support(X)
Item Sets Frequency Rule Confidence
{Bread} 3 Bread => Cheese 2 / 3 67.00%{Cheese } 3{Juice} 2 Cheese => Bread 2 / 3 67.00%{Milk} 2{Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00%{Cheese, Juice} 2
Juice => Cheese 2 / 2 100.00%
Prithwis Mukerjee 12
A closer look at the “Frequent Set”
Look at itemsets with more than 1 item {Bread, Cheese}, {Cheese, Juice} 4 rules are possible
Look for confidence levels Confidence (X Y) = Support (XY) / Support(X)
Item Sets Frequency Rule Confidence
{Bread} 3 Bread => Cheese 2 / 3 67.00%{Cheese } 3{Juice} 2 Cheese => Bread 2 / 3 67.00%{Milk} 2{Bread, Cheese} 2 Cheese => Juice 2 / 3 67.00%{Cheese, Juice} 2
Juice => Cheese 2 / 2 100.00%
Prithwis Mukerjee 13
The Big Picture
List all itemsets Find frequency of each
Identify “frequent sets” Based on support
Search for Rules within “frequent sets” Based on confidence
Prithwis Mukerjee 14
Looking Beyond the Retail Store
Counter Terrorism Track phone calls
made or received from a particular number every day
Is an incoming call from a particular number followed by a call to another number ?
Are there any sets of numbers that are always called together ?
Expand the item sets to include Electronic fund
transfers Travel between two
locations Boarding cards Railway reservation
All data is available in electronic format
Prithwis Mukerjee 15
Major Problem
Exponential Growth of number of Itemsets 4 items : 16 = 24 members n items : 2n members As n becomes larger, the
problem cannot be solved anymore in finite time
All attempts are made to reduce the number of Item sets to be processed
“Improved” Naive algorithm Ignore sets with zero
frequency
Item Sets Frequency{Bread} 3{Cheese } 3{Juice} 2{Milk} 2{Bread, Cheese} 2{Bread, Juice } 1{Bread, Milk} 1{Cheese, Juice} 2{Cheese, Milk} 1{Juice, Milk} 1{Bread, Cheese, Juice} 1{Bread, Cheese, Milk} 0{Bread, Juice, Milk} 0{Cheese, Juice, Milk} 1{Bread, Cheese, Juice, Milk} 0
Prithwis Mukerjee 16
The APriori Algorithm
Consists of two PARTS First find the frequent itemsets
Most of the cleverness happens here We will do better than the naive algorithm
Find the rules This is relatively simpler
Prithwis Mukerjee 17
APriori : Part 1 - Frequent Sets
Step 1 Scan all transactions and find all frequent items
that have support above p%. This is set L1
Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using
pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.
This is Candidate set CK
Step 3 : Find Frequent Item Sets again Scan all transactions and find frequency of sets in
CK that are frequent : This gives LK
If LK is empty, stop, else go back to step 2
Prithwis Mukerjee 18
APriori : Part 1 - Frequent Sets
Step 1 Scan all transactions and find all frequent items
that have support above p% - This is set L1
Prithwis Mukerjee 19
Example
We have 16 items spread over 25 transactionsItem No Item Name
1 Biscuits2 Bread3 Cereal4 Cheese5 Chocolate6 Coffee78 Eggs9 Juice
10 Milk11 Newspaper12 Pastry13 Rolls14 Sugar15 Tea16
Donuts
Yogurt
TID Items12 Bread, Cereal, Cheese, Coffee34 Bread, Cheese, Coffee, Cereal, Juice56 Milk, Tea7 Biscuits, Bread, Cheese, Coffee, Milk8 Eggs, Milk, Tea9 Bread, Cereal, Cheese, Chocolate, Coffee
1011 Bread, Cheese, Juice1213 Biscuits, Bread, Cereal1415 Chocolate, Coffee161718 Biscuits, Bread, Cheese, Coffee 19202122 Bread, Cereal, Cheese, Coffee2324 Newspaper, Pastry, Rolls25 Rolls, Sugar, Tea
Biscuits, Bread, Cheese, Coffee, Yogurt
Cheese, Chocolate, Donuts, Juice, Milk
Bread, Cereal, Chocolate, Donuts, Juice
Bread, Cereal, Chocolate, Donuts, Juice
Bread, Cheese, Coffee, Donuts, Juice
Cereal, Cheese, Chocolate, Donuts, Juice
DonutsDonuts, Eggs, Juice
Bread, Cereal, Chocolate, Donuts, JuiceCheese, Chocolate, Donuts, Juice Milk, Tea, Yogurt
Chocolate, Donuts, Juice, Milk, Newspaper
Prithwis Mukerjee 20
Apriori : Step 1 – Computing L1
Count frequency for each item and exclude those that are below minimum support
Item No Item Name Frequency1 Biscuits 42 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 108 Eggs 29 Juice 11
10 Milk 611 Newspaper 212 Pastry 113 Rolls 214 Sugar 115 Tea 416 2
Donuts
Yogurt
Item No Item Name Frequency2 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 109 Juice 11
Donuts
25% support25%
support
This is set L1
Prithwis Mukerjee 21
APriori : Part 1 - Frequent Sets
Step 1 Scan all transactions and find all frequent items
that have support above p%. This is set L1
Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using
pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.
This is Candidate set CK
Prithwis Mukerjee 22
Step 2 : Computing C2
Given L1, we now form candidate pairs of C2. The 7 items in form 21 pairs : d*(d-1)/2 – this is a quadratic function and not a exponential function.
1 {Bread, Cereal}2 {Bread, Cheese}3 {Bread, Chocolate}4 {Bread, Coffee}56 {Bread,Juice}7 {Cereal, Cheese}8 {Cereal, Coffee}9 {Cereal, Chocolate}
1011 {Cereal, Juice}12 {Cheese, Chocolate}13 {Cheese, Coffee}1415 {Cheese, Juice}16 {Chocolate, Coffee}1718 {Chocolate, Juice}1920 {Coffee, Juice}21
{Bread, Donuts}
{Cereal, Donuts}
{Cheese, Donuts}
{Chocolate, Donuts}
{Coffee, Donuts}
{Donuts, Juice}
Item No Item Name Frequency2 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 109 Juice 11
Donuts
L1 to C
2L
1 to C
2
Prithwis Mukerjee 23
APriori : Part 1 - Frequent Sets
Step 1 Scan all transactions and find all frequent items
that have support above p%. This is set L1
Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using
pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.
This is Candidate set CK
Step 3 : Find Frequent Item Sets again Scan all transactions and find frequency of sets in
CK that are frequent : This gives LK
If LK is empty, stop, else go back to step 2
Prithwis Mukerjee 24
From C2 to L2 based on minimum
support
Candidate 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Chocolate} 4{Bread, Coffee} 8
4{Bread,Juice} 6{Cereal, Cheese} 5{Cereal, Coffee} 4{Cereal, Chocolate} 5
4{Cereal, Juice} 6{Cheese, Chocolate} 4{Cheese, Coffee} 9
3{Cheese, Juice} 4{Chocolate, Coffee} 1
7{Chocolate, Juice} 7
1{Coffee, Juice} 2
9
{Bread, Donuts}
{Cereal, Donuts}
{Cheese, Donuts}
{Chocolate, Donuts}
{Coffee, Donuts}
{Donuts, Juice}
Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9
7{Chocolate, Juice} 7
9
{Chocolate, Donuts}
{Donuts, Juice}
25% support25%
support
This is a computationally intensive step
L2 is not empty
This is set L2
Prithwis Mukerjee 25
APriori : Part 1 - Frequent Sets
Step 1 Scan all transactions and find all frequent items
that have support above p%. This is set L1
Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using
pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.
This is Candidate set CK
Step 3 : Find Frequent Item Sets again Scan all transactions and find frequency of sets in
CK that are frequent : This gives LK
If LK is empty, stop, else go back to step 2
Prithwis Mukerjee 26
Step 2 Again : Get C3
We combine the appropriate frequent 2-item sets from L2 (which must have the same first item) and obtain four such itemsets each containing three items
Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9
7{Chocolate, Juice} 7
9
{Chocolate, Donuts}
{Donuts, Juice}
This is set L2
Candidate 3 item set{Bread, Cheese, Cereal}{Bread, Cereal, Coffee}{Bread, Cheese, Coffee}{Chocolate, Donut, Juice}
L2 to C
3L
2 to C
3
Prithwis Mukerjee 27
Step 3 Again C3 to L3
Again Based on Minimum Support
Since C4 cannot be formed, L4 cannot be formed so we stop here
Candidate 3 item set Frequency
{Bread, Cheese, Cereal} 4{Bread, Cereal, Coffee} 4{Bread, Cheese, Coffee} 8
7{Chocolate, Donut, Juice}
Frequent 3 item set Frequency
{Bread, Cheese, Coffee} 87{Chocolate, Donut, Juice}
25% support25% support
Prithwis Mukerjee 28
APriori : Part 1 - Frequent Sets
Step 1 Scan all transactions and find all frequent items
that have support above p%. This is set L1
Step 2 : Apriori-Gen Build potential sets of k items from the Lk-1 by using
pairs of itemsets in Lk-1 that has the first k-2 items common and one remaining item from each member of the pair.
This is Candidate set CK
Step 3 : Find Frequent Item Sets again Scan all transactions and find frequency of sets in
CK that are frequent : This gives LK
If LK is empty, stop, else go back to step 2
Prithwis Mukerjee 29
The APriori Algorithm
Consists of two PARTS First find the frequent itemsets
Most of the cleverness happens here We will do better than the naive algorithm
Find the rules This is relatively simpler
Prithwis Mukerjee 30
APriori : Part 2 – Find Rules
Rules will be found by looking at 3-item sets found in L3 2-item sets in L2 that are not subsets of L3
In each case we Calculate confidence (A B )
= P (B | A) = P(A B ) / P(A)
Some short hand {Bread, Cheese, Coffee } is written as { B, C, D}
Prithwis Mukerjee 31
Rules for Finding Rules !
A 3 item frequent set { BCD} results in 6 rules B CD, C BD, D BC CD B, BD C, BC D
Also note that B CD can also be written as
B D, B C
We now look at these two 3-item sets and find their confidence levels { Bread, Cheese, Coffee} { Chocolate, Donuts, Juice } From the L3 set ( the highest L set ) and note that
support for these rules is 8 and 7
Prithwis Mukerjee 32
Rules from First of 2 Itemsets in L3
One rule drops out because confidence < 70%
Calculate confidence (X Y ) = P (Y | X) = P(X Y ) / P(X)
Confidence of association rules from { Bread, Cheese, Coffee }
Rule Confidence
B => CD 8 13 0.615C => BD 8 11 0.727D => BC 8 9 0.889CD => B 8 9 0.889BD => C 8 8 1.000BC => D 8 8 1.000
Support of BCD
Frequency of LHS
Item No Item Name Frequency1 Biscuits 42 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 108 Eggs 29 Juice 11
10 Milk 611 Newspaper 212 Pastry 113 Rolls 214 Sugar 115 Tea 416 2
Donuts
Yogurt
Prithwis Mukerjee 33
Rules from First of 2 Itemsets in L3
One rule drops out because confidence < 70%Confidence of association rules from { Bread B, Cheese C, Coffee D }
Rule Confidence
B => CD 8 13 0.615C => BD 8 11 0.727D => BC 8 9 0.889CD => B 8 9 0.889BD => C 8 8 1.000BC => D 8 8 1.000
Support of BCD
Frequency of LHS
Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9
7{Chocolate, Juice} 7
9
{Chocolate, Donuts}
{Donuts, Juice}
Prithwis Mukerjee 34
Rules from Second of 2 Itemsets in L3
One rule drops out because confidence < 70%
Rule Confidence
N => MP 7 9 0.778M => NP 7 10 0.700P => NM 7 11 0.636MP => N 7 9 0.778NP => M 7 7 1.000NM => P 7 7 1.000
Confidence of association rules from { chocolate N, donut M, juice P}
Support of BCD
Frequency of LHS
Item No Item Name Frequency1 Biscuits 42 Bread 133 Cereal 104 Cheese 115 Chocolate 96 Coffee 97 108 Eggs 29 Juice 11
10 Milk 611 Newspaper 212 Pastry 113 Rolls 214 Sugar 115 Tea 416 2
Donuts
Yogurt
Prithwis Mukerjee 35
Rules from Second of 2 Itemsets in L3
One rule drops out because confidence < 70%
Rule Confidence
N => MP 7 9 0.778M => NP 7 10 0.700P => NM 7 11 0.636MP => N 7 9 0.778NP => M 7 7 1.000NM => P 7 7 1.000
Confidence of association rules from { chocolate N, donut M, juice P}
Support of BCD
Frequency of LHS
Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9
7{Chocolate, Juice} 7
9
{Chocolate, Donuts}
{Donuts, Juice}
Prithwis Mukerjee 36
Set of 14 Rules obtained from L3
C => BDC => B 1 Cheese => Bread
C => D 2 Cheese => CoffeeD => BC
D => B 3 Coffee = > BreadD => C 4 Coffee => Cheese
CD => B 5 Cheese, Coffee => Bread
BD => C 6 Bread, Coffee => CheeseBC => D 7 Bread, Cheese => Coffee
N => MPN => M 8N => P 9 Chocolate => Juice
M => NPM => P 10M => N 11
MP => N 12NP => M 13NM => P 14
Chocolate => Donuts
Donuts => Chocolate
Donuts => Juice
Donuts, Juice => Chocolate
Chocolate , Juice => Donuts
Chocolate, Donuts => Juice
Prithwis Mukerjee 37
What about L2 ?
Look for sets in L2 that are not subsets of L3
{ Bread, Cereal} is the only candidate Which gives are two more rules
Bread Cereal Cereal Bread
Frequent 2-Item Set Freq{Bread, Cereal} 9{Bread, Cheese} 8{Bread, Coffee} 8{Cheese, Coffee} 9
7{Chocolate, Juice} 7
9
{Chocolate, Donuts}
{Donuts, Juice}
Frequent 3 item set Frequency
{Bread, Cheese, Coffee} 87{Chocolate, Donut, Juice}
Prithwis Mukerjee 38
Which are now added to get 16 rules
C => BDC => B 1 Cheese => BreadC => D 2 Cheese => Coffee
D => BCD => B 3 Coffee = > BreadD => C 4 Coffee => Cheese
CD => B 5 Cheese, Coffee => Bread
BD => C 6 Bread, Coffee => CheeseBC => D 7 Bread, Cheese => Coffee
N => MPN => M 8N => P 9 Chocolate => Juice
M => NPM => P 10M => N 11
MP => N 12NP => M 13NM => P 14
15 Bread = > Cereal16 Cereal => Bread
Chocolate => Donuts
Donuts => Chocolate
Donuts => Juice
Donuts, Juice => Chocolate
Chocolate , Juice => Donuts
Chocolate, Donuts => Juice
Prithwis Mukerjee 39
So where are we ?
Apriori Algorithm Consists of two PARTS First find the frequent
itemsets Most of the cleverness
happens here We will do better than
the naive algorithm Find the rules
This is relatively simpler
We have just completed the two PARTS
Overall approach to ARM is as follows List all itemsets
Find frequency of each Identify “frequent sets”
Based on support Search for Rules within
“frequent sets” Based on confidence
Naive Algorithm Exponential Time
A Priori Algoritm Polynomial Time
Prithwis Mukerjee 40
Observations
Actual values of support and confidence 25%, 75% are very high values In reality one works with far smaller values
“Interestingness” of a rule Since X, Y are related events – not independent –
hence P(X Y) P(X)P(Y) Interestingness P(X Y) – P(X)P(Y)
Triviality of rules Rules involving very frequent items can be trivial You always buy potatoes when you go to the market
and so you can get rules that connect potatoes to many things
Inexplicable rules Toothbrush was the most frequent item on Tuesday ??
Prithwis Mukerjee 41
Better Algorithms
Enhancements to the Apriori Algorithm AP-TID Direct Hashing
and Pruning (DHP)
Dynamic Itemset Counting (DIC)
Frequent Pattern (FP) Tree Only frequent items are
needed to find association rules – so ignore others !
Move the data of only frequent items to a more compact and efficient structure A Tree structure or a
directed graph is used Multiple transactions with
same (frequent) items are stored once with a count information
Prithwis Mukerjee 42
Software Support
KDNuggets.com Excellent collections of software available
Bart Goethals Free software for Apriori, FP-Tree
ARMiner GNU Open Source software from UMass/Boston
DMII National University of Singapore
DB2 Intelligent Data Miner IBM Corporation Equivalent software available from other vendors as
well