Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων

1

Α. Νανόπουλος & Γ. Μανωλόπουλος

Εισαγωγή στην Εξόρυξη & τις Αποθήκες Δεδομένων

Κεφάλαιο 8: Κανόνες Συσχέτισης

http://delab.csd.auth.gr/books/grBooks/grBooks.html

Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων

AssociationsRules expressing relationships between items

Example

cereal, milk fruit

“People who bought cereal and milk also bought fruit.”

Stores might want to offer specials on milk and cereal to get people to buy more fruit.

2Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Market Basket Analysis

Analyze tables of transactions

Can we hypothesize? Chips => Salsa Lettuce => Spinach

Person Basket

A Chips, Salsa, Cookies, Crackers, Coke, Beer

B Lettuce, Spinach, Oranges, Celery, Apples, Grapes

C Chips, Salsa, Frozen Pizza, Frozen Cake

D Lettuce, Spinach, Milk, Butter


Definitions

Let I ={i1, …, im} a set of literals called items

Let D be a set of transactions for each T D, T I

Itemset X: any X I k-itemset: |X| = k

X contained-by T, if X T

support s(X): fraction of transactions containing X

||

|},|{|)(

D

TXDTTXs


Rule Measures: Support and ConfidenceFind all the rules Beer Diaper with minimum confidence and support

support, s, probability that a transaction contains {Beer Diaper}

confidence, c, conditional probability that a transaction having {Beer} also contains Diaper

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Let minimum support 50%, and minimum confidence 50%, we have

A C (50%, 66.6%) C A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer


Association rules: Problem description

Association rule: X Y, X I, Y I, X Y = Support of rule = s(X Y)Confidence of rule = s(X Y) / s(X)Problem: Find all association rules with confidence larger that MINCONF and support larger than MINSUPNotice that: if X Y, Y Z, then not necessarily X Z

because X Z may not have enough support or confidence


Mining Association Rules—An Example

For rule A C:support = support({A C}) = 50%

confidence = support({A C})/support({A}) = 66.6%

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%


Association Rule Mining

Two basic steps Find all frequent itemsets

Satisfying minimum support Find all strong association rules

Generate association rules from frequent itemsets Keep rules satisfying minimum confidence


Generating Frequent Itemsets: Naive algorithm

n <- |D|

for each subset s of I do

l <- 0

for each transaction T in D do

if s is a subset of T then

l <- l + 1

if minimum support <= l/n then

add s to frequent subsets


Analysis of naive algorithm

O(2m) subsets of I

Scan n transactions for each subset

O(2mn) tests of s being subset of T

Growth is exponential in the number of items!

Can we do better?


Reducing Number of Candidates: Apriori

Apriori principle: If an itemset is frequent, then all of its subsets must also

be frequent

Apriori principle holds due to the following property of the support measure:

Support of an itemset never exceeds the support of its subsets

This is known as the anti-monotone property of support

)()()(:, YsXsYXYX


Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Illustrating Apriori Principlenull


A B C D E



ABCDEPruned supersets


Illustrating Apriori Principle

Item CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1

Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

Itemset Count {Bread,Milk,Diaper} 3

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Triplets (3-itemsets)Minimum Support = 3

If every subset is considered, 6C1 + 6C2 + 6C3 = 41

With support-based pruning,6 + 6 + 1 = 13


The Apriori Algorithm

Join Step: Ck is generated by joining Lk-1with itself

Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;


How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruning

forall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck


Example of Generating Candidates

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:

acde is removed because ade is not in L3

C4={abcd}


How to Count Supports of Candidates?

Candidate counting: Scan the database of transactions to determine the support of

each candidate itemset To reduce the number of comparisons, store the candidates in a

hash structure Instead of matching each transaction against every candidate,

match it against candidates contained in the hashed buckets

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions Hash Structure

k

Buckets


Example

b=2, c=3


The Apriori Algorithm — Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1

L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2

Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2


Another example of Apriori

Minsup = 2


Another example of Apriori


Generating rules (2nd sub-problem)

Given a frequent itemset L, find all non-empty subsetsf L such that f L – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,

If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)


Rule Generation with anti-monotone property

How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone

propertyc(ABC D) can be larger or smaller than c(AB D)

But confidence of rules generated from the same itemset has an anti-monotone property

e.g., L = {A,B,C,D}:

c(ABC D) c(AB CD) c(A BCD) Confidence is anti-monotone w.r.t. number of items on the

RHS of the rule


Rule Generation: example of anti-monotonicity

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Lattice of rules

ABCD=>{ }

BCD=>A ACD=>B ABD=>C ABC=>D

BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned Rules

Low Confidence Rule


Methods to Improve Apriori’s Efficiency

Hash-based itemset counting: A k-itemset whose corresponding

hashing bucket count is below the threshold cannot be frequent

Transaction reduction: A transaction that does not contain any

frequent k-itemset is useless in subsequent scans

Partitioning: Any itemset that is potentially frequent in DB must be

frequent in at least one of the partitions of DB

Sampling: mining on a subset of given data, lower support threshold

+ a method to determine the completeness

Dynamic itemset counting: add new candidate itemsets only when all

of their subsets are estimated to be frequent


Hash-based method

|C2| = comb(|L1|, 2) = O(|L1|2) |L1| maybe thousands, |C2| bottleneck

While counting the support of C1

find the 2-itemsets in each transaction hash them into buckets, accumulating counts if a 2-itemset belongs to a bucket with count

less than MINSUP, then it cannot be large


Example of hash-based method100 ACD

200 BCE

300 ABCE

400 BE

C1

{A} (2) {B} (3) {C} (3) {D} (1) {E} (3)

L1

{A}

{B}

{C}

{E}

3

{C,E}

{C,E}

{A,D}

3

{B,C}

{B,C}

{A,E}

0 3

{B,E}

{B,E}

{B,E}

1

{A,B}

3

{A,C}

{C,D}

{A,C}

L1 x L1

{A,B}

{A,C}

{A,E}

{B,C}

{B,E}

{C,E}

C2

{A,C}

{A,E}

{B,C}

{B,E}

{C,E}


Transaction trimming

While counting Ck (k phase) an item in a transaction t can be trimmed if it does

not appear in at least k members of Ck (k-itemset candidates)

example: candidates ABC, ABD, BCD, and ACD is not if t contains A, then A can be trimmed from t, because

will not participate in forthcoming passes (ABCD will not be generated)

if t contains no more than k items, discard it


Example of transaction trimming

100 ACD

200 BCE

300 ABCE

400 BE

C2

{A,C}

{B,C}

{B,E}

{C,E}

{C}

{B,C,E}

{B,C,E}

{B,E}

discard

{B,C,E}

{B,C,E}

discard

due to length


Partitioning method

Logically partitions the database into a number of non-overlapping partitions each partition fits entirely in main memory

Finds the large itemsets in each partition (local large itemsets) multiple phases are performed in main memory, I/O

only to read the partition once

Determine which of the local large are also global large itemsets in one scan of the database no global large is missed


Partition Algorithm

partition_database(D)

n = Number of partitions

//Phase I

for i=1 to n do

read i -th partition

Li = gen_all_large_itemsets(i-th partition)

C = L1 … Ln

for i=1 to n do

read i -th partition

for each candidate c C update c.count

L = {c | c C, c.count MINSUP}31Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Partitioning example


Why Partition is correct?

We have to show that no large itemset (global large) is missed

A global large itemset X is not missed if it is canidate in the 2nd phase, i.e., when its is local large in at least one partition Let X be a global large which is not local large in any

partition t(X, P

i ) < MINSUP |P i| 1 i n Σ t(X, P

i )

< MINSUP Σ |P i| t(X, D) < MINSUP |D|

s(X) < MINSUP (contradiction)


Size of candidate set for the 2nd phase

Based on the assumption that the size |C | of all local large itemsets is small enough comparable to |L| (global large)

The same MINSUP is used, one can expect the same as |L| at worst case n |L| in practice, many overlap (the same in different

partitions)

For small n, |C | very close to |L|

For larger n, |C | larger but not the worst case


Sampling method

Main idea get a random sample s of database D find large itemsets S in s verify which of S are large in D (one scan) identify potentially “missed” large (possibly a 2nd scan)

Gains mining s is for free in terms of I/O one scan of D in the best and two in the worst case


Negative border


How not to miss anything?

Goal is to avoid missing any itemset that is frequent in the full set of baskets If X is frequent in D, but not frequent in S, then

there Y X, Y Bd-(S)

We check if something in the negative border is actually frequent in D Try to choose Minsup in S so the probability of

failure is low, while the number of itemsets checked on the second pass fits in main-memory


Sampling algorithm


Example

Domain = {A, B, C, D, E, F}

S = {A,B,C}, {A,C,F}, {A,D}, {B,D}

Bd-(S) = {B,F}, {C,D}, {D,F}, {E} candidates not found large in sample

L = {A,B}, {B,F}, {A,C,F} large in D

{B,F} represents a “miss” {A,B},{B,F} and {A,F} generate {A,B,F} which

can possibly be large but was not count


Is Apriori Fast Enough? — Performance Bottlenecks

The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-

itemsets Use database scan and pattern matching to collect counts for the

candidate itemsets

The bottleneck of Apriori: candidate generation Huge candidate sets:

104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …,

a100}, one needs to generate 2100 1030 candidates.

Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern


Mining Frequent Patterns Without Candidate Generation

Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern

mining avoid costly database scans

Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose

mining tasks into smaller ones Avoid candidate generation: sub-database test only!


Construct FP-tree from a Transaction DB

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3

min_support = 0.5

TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:

1. Scan DB once, find frequent 1-itemset (single item pattern)

2. Order frequent items in frequency descending order

3. Scan DB again, construct FP-tree 42Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ. Δεδομένων

Mining Frequent Patterns Using FP-tree

General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree

Method For each item, construct its conditional pattern-base, and

then its conditional FP-tree Repeat the process on each newly created conditional

FP-tree Until the resulting FP-tree is empty, or it contains only

one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)


Major Steps to Mine FP-tree

1) Construct conditional pattern base for each node in the

FP-tree

2) Construct conditional FP-tree from each conditional

pattern-base

3) Recursively mine conditional FP-trees and grow

frequent patterns obtained so far

If the conditional FP-tree contains a single path,

simply enumerate all the patterns


Step 1: From FP-tree to Conditional Pattern Base

Starting at the frequent header table in the FP-tree

Traverse the FP-tree by following the link of each frequent item

Accumulate all of transformed prefix paths of that item to form a conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header Table

Item frequency head f 4c 4a 3b 3m 3p 3


Step 2: Construct Conditional FP-tree

For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:

fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency head f 4c 4a 3b 3m 3p 3


Mining Frequent Patterns by Creating Conditional Pattern-Bases

EmptyEmptyf

{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}a

Empty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m

{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern-baseItem


Single FP-tree Path Generation

Suppose an FP-tree T has a single path P

The complete set of frequent pattern of T can be generated

by enumeration of all the combinations of the sub-paths of P

{}

f:3

c:3

a:3

m-conditional FP-tree

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam


Άλλο ένα παράδειγμα

Minsup = 2





ΥΣ FP-tree | f





Συνδυασμός ef με κάθε υποσύνολο του dca (και το κενό): ef, def, cef, aef, dcef, daef, caef, dcaef Όλα με υποστήριξη ίση με 2

(όμοια και για όλα τα άλλα ΥΣ μονοπάτια του f)




Benefits of the FP-tree Structure

Performance study shows FP-growth is an order of

magnitude faster than Apriori, and is also faster than tree-projection

Reasoning No candidate generation,

no candidate test Use compact data structure Eliminate repeated

database scan Basic operation is counting

and FP-tree building

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime(s

ec.)

D1 FP-grow th runtime

D1 Apriori runtime

55

Α.Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη & Αποθ.

Δεδομένων

Closed Itemset

An itemset is closed if none of its immediate supersets has the same support as the itemset

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

Minimum support = 2

null


A B C D E



ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4


Μειονεκτήματα υποστήριξης, εμπιστοσύνης

Θέατρο => Κινηματογράφος (15%, 75%)P(Κινηματογράφος) = 80% > 75%


Correlation

We need a measure of dependent or correlated events

If corrA,B = 1: independence

If corrA,B > 1: positive correlation

If corrA,B < 1: negative correlation

)()(

)(, BPAP

BAPcorr BA


Παράδειγμα

Corr = 0.15/(0.2 * 0.8) = 0.9375 < 1


Παράδοξο Simpson

Τηλεόραση => Ραδιόφωνο (εμπ = 99/180= 55%)Όχι Τηλεόραση => Ραδιόφωνο (εμπ = 54/120 = 45%)Corr(Τηλεόραση, Ραδιόφωνο) = 1.07Θετική συσχέτιση μεταξύ τηλεόρασης και ραδιοφώνου



Τηλεόραση => Ραδιόφωνο (εμπ = 1/10= 10%)Όχι Τηλεόραση => Ραδιόφωνο (εμπ = 4/34 = 11.8%)Corr(Τηλεόραση, Ραδιόφωνο) = 0.88

ανήλικοι



Τηλεόραση => Ραδιόφωνο (εμπ = 98/170= 57.7%)Όχι Τηλεόραση => Ραδιόφωνο (εμπ = 50/86 = 58.1%)Corr(Τηλεόραση, Ραδιόφωνο) = 0.60

ενήλικοι


Documents

Α. Νανόπουλος, Γ. Μανωλόπουλος: Εισαγωγή στην Εξόρυξη και τις Αποθήκες Δεδομένων