21
Setting the scene Basic concepts Generating strong association rules Further aspects Multivariate Statistics: Association rule learning Steffen Unkel Department of Medical Statistics University Medical Center Goettingen, Germany Summer term 2017 1/21

Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

  • Upload
    others

  • View
    28

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Multivariate Statistics: Association rule learning

Steffen Unkel

Department of Medical StatisticsUniversity Medical Center Goettingen, Germany

Summer term 2017 1/21

Page 2: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Motivation

Association rule learning has emerged as a popular tool fordiscovering interesting relationships between variables that arehidden in large data sets.

The uncovered relationships can be represented in form ofassociation rules.

Association rules are rules presenting association or correlationbetween sets of items.

For example, the rule {onions, potatoes} ⇒ {burger}

would indicate that if a customer buys onions and potatoestogether, he/she is likely to also buy hamburger meat.

Summer term 2017 2/21

Page 3: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

ApplicationsIntroduction

Figure: Association Rules (KDnuggets 2016)

Yvonne Barth (LMU) Association Rules January 20th, 2017 3 / 32

Figure: Market basket analysisSummer term 2017 3/21

Page 4: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Applications

In market basket analysis such rules can be used as the basisfor decisions about marketing activities such as promotionalpricing or product placements.

In addition, association rules are employed today in variousareas including

- the health sector,- the financial sector,- telecommunications,- web mining,- fraud detection.

In contrast with sequence mining, association rule learningdoes not consider the order of items either within atransaction or across transactions.

Summer term 2017 4/21

Page 5: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

A simple example from market basket analysis

Five transactions in a supermarket:

Customer 1 milk, breadCustomer 2 bread, butterCustomer 3 beerCustomer 4 milk, bread, butterCustomer 5 bread, butter

Given these transactions, how to find association rules forpotential purchase behaviour?

Summer term 2017 5/21

Page 6: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Transaction database of binary items

Suppose there is a set I = {i1, . . . , ip} of p binary items and aset D = {t1, . . . , tn} with n transactions. Each transactionhas a unique identification (ID) number and consists of asubset of the items in I.

Supermarket example: Items I = {milk, bread, butter, beer}with transaction database D =

ID number milk bread butter beer

1 1 1 0 02 0 1 1 03 0 0 0 14 1 1 1 05 0 1 1 0

Summer term 2017 6/21

Page 7: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Association rules – Definition

An association rule is defined as an implication X ⇒ Y , whereX ,Y ⊆ I and X ∩ Y = ∅. X and Y are therefore sets ofitems, so-called itemsets.

An itemset that contains k items is called a k-itemset.

X is called the antecedent and Y the consequent.

Supermarket example: one possible rule would be{milk, bread} ⇒ {butter}.Interpretation: if a customer buys milk and bread, he/she islikely to also buy butter.

Summer term 2017 7/21

Page 8: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

The set of all possible rules

The power set P(I) of the set I:

{milk} ⇒ {bread}{butter} ⇒ {bread}

{beer, milk} ⇒ {butter}{beer, butter} ⇒ {milk}{butter, milk} ⇒ {bread}

. . .⇒ . . .

How to find interesting association rules?

We need measures of interestingness.

Summer term 2017 8/21

Page 9: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Support

The support count of an itemset X is the number oftransactions that contain the itemset X .

The (relative) support of an itemset X , supp(X ), is defined asthe proportion of transactions in the database that containthe itemset X (the relative frequency of the itemset).

The support of a rule supp(X ⇒ Y ) is the support of the jointitemsets X and Y : supp(X ∪Y ) (the relative frequency of thetransactions to which the rule can be applied (= P(X ∪ Y )).

Consider the rule {milk, bread} ⇒ {butter}. Since thesupport count for {milk, bread, butter} is 1 and |D| = 5,the support of the rule is supp({milk, bread, butter}) =0.2.

Summer term 2017 9/21

Page 10: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Confidence

The confidence determines how frequently items in Y appearin transactions that contain X . Confidence measures thestrength of a rule.

The confidence of a rule X ⇒ Y is defined as

conf(X ⇒ Y ) =supp(X ∪ Y )

supp(X ),

which is the relative frequency of transactions containing Ythat also contain X (= P(Y |X )).

Consider again the rule {milk, bread} ⇒ {butter}. Sincesupp({milk, bread, butter}) = 0.2 andsupp({milk, bread}) = 0.4, conf({milk, bread} ⇒{butter}) = 0.5.

Summer term 2017 10/21

Page 11: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Association analysis

Association analysis is concerned with finding strongassociation rules.

Problem: |P(I)| = 2p.

Generating strong rules:1 Define a minimum support threshold s and a minimum

confidence threshold c.

2 Find all itemsets having support ≥ s. These itemsets are calledfrequent itemsets.

3 From the frequent itemsets found in the previous step extractall rules having confidence ≥ c. These rules are called strongrules.

Summer term 2017 11/21

Page 12: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Apriori algorithm

The Apriori algorithm for finding strong association rules is aniterative algorithm that is based on the Apriori principle.

Reference: Agrawal, R., Mannila, H., Srikant, R., Toivonen,H. and Verkamo, A. I. (1995): Fast Discovery of AssociationRules, Advances in Knowledge Discovery and Data Mining,Chapter 12, AAAI/MIT Press, Cambridge, MA.

Apriori principle

If an itemset is frequent, than all of its subsets must also befrequent.

Conversely, if an itemset is infrequent, then all supersets areinfrequent.

Summer term 2017 12/21

Page 13: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

First step: search for frequent itemsets

1 Determine L1, which is the set of frequent 1-itemsets, that is,the set of all 1-itemsets in I with support ≥ s.

2 While Lk 6= ∅ (k = 2, . . .):

2.1 Calculate the set of candidate k-itemsets, Ck , from Lk−1 usingthe Apriori-Gen subroutine.

2.2 Form Lk of all sets in Ck for which the support is ≥ s.

3 All frequent itemsets are⋃

k Lk .

Summer term 2017 13/21

Page 14: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Apriori-Gen

The sub-routine Apriori-Gen takes as an argument Lk−1, theset of all frequent (k − 1)-itemsets.

It returns a superset of the set of all frequent k-itemsets, Ck .

We assume that items in transactions and itemsets are sortedin their lexicographic order.

Two-step procedure based on the Apriori principle:1 Join step: generate k-itemsets by merging each two

(k − 1)-itemsets that share the first two k − 2 items and addthe unions to the set of candidates, Ck .

2 Prune step: Delete all itemsets from Ck whose (k − 1)-subsetsare not in Lk−1.

Summer term 2017 14/21

Page 15: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Supermarket example

Task: find all frequent itemsets with a minimum support of 0.2!

Summer term 2017 15/21

Page 16: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Second step: forming strong rules

Each frequent k-itemset Z could possibly generate 2k − 2association rules.

Generate association rules by partitioning Z into twonon-empty subsets X and Y = Z \X with conf(X ⇒ Y ) ≥ c.

Use the Apriori principle for generating rules efficiently.

Apriori principle for rules

The confidence of the rule X̃ ⇒ Z \X̃ cannot be more thanthe confidence of the rule X ⇒ Z \X , where X̃ denotes asubset of X .

Summer term 2017 16/21

Page 17: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Supermarket example

Task: generate strong association rules with a minimum confidenceof 0.7!

Summer term 2017 17/21

Page 18: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Misleading association rules

An association rule that satisfies the minimum support s andthe minimum confidence c does not necessarily need to beinteresting!

Example: itemset {X ,Y } with

supp(X ) = 0.6, supp(Y ) = 0.75 and supp(X ⇒ Y ) = 0.45.

Then,

conf(X ⇒ Y ) =supp(X ∪ Y )

supp(X )=

0.45

0.6= 0.75.

The confidence is exactly as large as the support of Y . Therule only reflects the support of Y and does not make anystatement about the association between X and Y .

Summer term 2017 18/21

Page 19: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

A further measure of interestingness: lift

One way to address this problem is by applying a measureknown as lift:

lift(X ⇒ Y ) =conf(X ⇒ Y )

supp(Y )=

supp(X ∪ Y )

supp(X ) supp(Y ),

which compares the frequency of a pattern against a baselinefrequency computed under the statistical independenceassumption.

The lift describes the correlation between binary items:lift = 1 no correlation between X and Y .lift > 1 positive correlationlift < 1 negative correlation

The lift is a symmetric measure: the lift for the rule X ⇒ Y isthe same as for Y ⇒ X .

Summer term 2017 19/21

Page 20: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Lift: Supermarket example

The rule {milk, bread} ⇒ {butter} has a lift of 1.25.

Compared to itemsets that are statistically independent ofbutter, we expect butter to appear 25% more often initemsets that contain milk and bread.

Summer term 2017 20/21

Page 21: Multivariate Statistics: Association rule learning · Multivariate Statistics: Association rule learning Ste en Unkel Department of Medical Statistics University Medical Center Goettingen,

Setting the sceneBasic concepts

Generating strong association rulesFurther aspects

Problems with the lift measure

The lift is sensitive with respect to the supp(X ) and supp(Y ).

Rare sets of items could generate high values of lift.

Example: suppose the rule {mushroom pizza} ⇒{ice cream} has

supp(mushroom pizza ∪ ice cream) = 0.01lift(mushroon pizza ⇒ ice cream) = 4

One may think that this is an interesting rule, but only a verysmall number of customers buy mushroom pizza.

A marketing plan to encourage mushroom pizza buyers topurchase ice cream might not have a high impact.

Summer term 2017 21/21