Frequent itemset mining and temporal extensions

Frequent itemset mining and temporal extensions

Sunita [email protected]

http://www.it.iitb.ac.in/~sunita

Association rules Given several sets of items, example:

Set of items purchased Set of pages visited on a website Set of doctors visited

Find all rules that correlate the presence of one set of items with another Rules are of the form X Y where X and Y are

sets of items Eg: Purchase of books A&B purchase of C

Parameters: Support and Confidence All rules X Z have two parameters

Support probability that a transaction has X and Z confidence conditional probability that a transaction having X also

contains Z Two parameters to association rule mining:

Minimum support s Minimum confidence c

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

S: 50%, and c: 50% A C (50%, 66.6%) C A (50%, 100%)

Applications of fast itemset counting Cross selling in retail, banking Catalog design and store layout Applications in medicine: find redundant tests Improve predictive capability of classifiers

that assume attribute independence Improved clustering of categorical attributes

Finding association rules in large databases

Number of transactions: in millions Number of distinct items: tens of thousands Lots of work on scalable algorithms Typically two parts to the algorithm:

1. Finding all frequent itemsets with support > S

2. Finding rules with confidence greater than C

Frequent itemset search more expensive Apriori algorithm, FP-tree algorithm

The Apriori Algorithm L1 = {frequent items of size one}; for (k = 1; Lk !=; k++)

Ck+1 = candidates generated from Lk by • Join Lk with itself • Prune any k+1 itemset whose subset not in Lk

for each transaction t in database do• increment the count of all candidates in Ck+1

that are contained in t Lk+1 = candidates in Ck+1 with min_support

return k Lk;

How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck

The Apriori Algorithm — Example

TID Items100 1 2 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 4{3} 2{4} 1{5} 3

itemset sup.{1} 2{2} 4{3} 2{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 2{1 3} 1{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 2} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2

Scan D

C3

L3

itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Improvements to Apriori Apriori with well-designed data structures works

well in practice when frequent itemsets not too long (common case)

Lots of enhancements proposed Sampling: count in two passes Invert database to be column major instead of row major

and count by intersection Count multiple length itemsets in one-pass

Reducing passes not useful since I/O not bottleneck: Main bottleneck: candidate generation and counting

not optimized for long itemsets

Mining Frequent Patterns Without Candidate Generation

Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern

mining

Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose

mining tasks into smaller ones Avoid candidate generation

Construct FP-tree from Database

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Item frequency f 4c 4a 3b 3m 3p 3

min_support = 0.5TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Scan DB once, find frequent 1-itemset

Order frequent items by decreasing frequency

Scan DB again, construct FP-tree

Step 1: FP-tree to Conditional Pattern Base Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a

conditional pattern base

Conditional pattern bases

item cond. pattern base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

Header TableItem frequency f 4c 4a 3b 3m 3p 3

Step 2: Construct Conditional FP-tree

For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:fca:2, fcab:1

{}

f:3

c:3

a:3m-conditional FP-tree

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

Mining Frequent Patterns by Creating Conditional Pattern-Bases

EmptyEmptyf

{(f:3)}|c{(f:3)}c

{(f:3, c:3)}|a{(fc:3)}a

Empty{(fca:1), (f:1), (c:1)}b

{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m

{(c:3)}|p{(fcam:2), (cb:1)}p

Conditional FP-treeConditional pattern-baseItem

Repeat this recursively for higher items…

FP-growth vs. Apriori: Scalability With the Support Threshold

0

10

20

30

40

50

60

70

80

90

100

0 0.5 1 1.5 2 2.5 3

Support threshold(%)

Ru

n t

ime(s

ec.)

D1 FP-grow th runtime

D1 Apriori runtime

Data set T25I20D10K

Criticism to Support and ConfidenceX and Y: positively correlated,X and Z, negatively relatedsupport and confidence of X=>Z dominates Need to measure departure from

expected. For two items:

For k items, expected support derived from support of k-1 itemsets using iterative scaling methods

X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%

)()(

)(

BPAP

BAP

Prevalent correlations are not interesting Analysts already know

about prevalent rules Interesting rules are

those that deviate from prior expectation

Mining’s payoff is in finding surprising phenomena

1995

1998

bedsheets and pillow covers sell together!

Zzzz...

bedsheets and pillow covers sell together!

What makes a rule surprising? Does not match prior

expectation Correlation between

milk and cereal remains roughly constant over time

Cannot be trivially derived from simpler rules Milk 10%, cereal 10% Milk and cereal 10%

… surprising Eggs 10% Milk, cereal and eggs

0.1% … surprising! Expected 1%

Finding suprising temporal patterns Algorithms to mine for surprising patterns

Encode itemsets into bit streams using two models

• Mopt: The optimal model that allows change along time

• Mcons: The constrained model that does not allow change along time

Surprise = difference in number of bits in Mopt and Mcons

One item: optimal model Milk-buying habits modeled by biased coin Customer tosses this coin to decide whether

to buy milk Head or “1” denotes “basket contains milk” Coin bias is Pr[milk]

Analyst wants to study Pr[milk] along time Single coin with fixed bias is not interesting Changes in bias are interesting

The coin segmentation problem Players A and B A has a set of coins

with different biases A repeatedly

Picks arbitrary coin Tosses it arbitrary

number of times

B observes H/T Guesses transition

points and biases

Pick

Toss

Return

A

B

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

How to explain the data Given n head/tail observations

Can assume n different coins with bias 0 or 1• Data fits perfectly (with probability one)

• Many coins needed

Or assume one coin• May fit data poorly

“Best explanation” is a compromise

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

1/4 5/7 1/3

Coding examples Sequence of k zeroes

Naïve encoding takes k bits Run length takes about log k bits

1000 bits, 10 randomly placed 1’s, rest 0’s Posit a coin with bias 0.01 Data encoding cost is (Shannon’s theorem):

bits 66100log10 that Note

bits 1000 « bits 8199.0log99001.0log10

How to find optimal segments

0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1

Sequence of 17 tosses:

Derived graph with 18 nodes:

Edge cost = model cost+ data cost

Model cost =one node ID +one Pr[head]

Data cost forPr[head] = 5/7,5 heads, 2 tails

Shortest path

Two or more items “Unconstrained” segmentation

k items induce a 2k sided coin “milk and cereal” = 11, “milk, not

cereal” = 10, “neither” = 00, etc.

Shortest path finds significant shift in any of the coin face probabilities

Problem: some of these shifts may be completely explained by marginals

00 0110 11

Example

Theta=2

0

0.1

0.2

0.3

0.4

0 2 4 6 8 10

TimeS

uppo

rtMilk Cereal Both

Drop in joint sale of milk and cereal is completely explained by drop in sale of milk

Pr[milk & cereal] / (Pr[milk] Pr[cereal]) remains constant over time

Call this ratio

Constant- segmentation

Compute global over all time All coins must have this common value of Segment as before Compare with un-constrained coding cost

))(( 11011110

11

11

11

pppp

p

pp

p

Observed support

Independence

Is all this really needed? Simpler alternative

Aggregate data into suitable time windows Compute support, correlation, , etc. in each

window Use variance threshold to choose itemsets

Pitfalls Choices: windows, thresholds May miss fine detail Over-sensitive to outliers

Experiments Millions of baskets over several years Two algorithms

Complete MDL approach MDL segmentation + statistical tests (MStat)

Data set 2.8 million transactions 7 years, 1987 to 1993 15800 items Average 2.62 items per basket

Little agreement in itemset ranks

0

400

800

1200

1600

0 400 800 1200 1600

Rank(Stat, 4 week)

Ra

nk(

MD

L)

0

400

800

1200

1600

0 400 800 1200 1600Rank(MStat)

Ra

nk(

MD

L)

Simpler methods do not approximate MDL

MDL has high selectivity

MDL

0

500

1000

1500

2000

-2000 0 2000 4000 6000Score

Fre

q

MStat

0

200400

600

8001000

1200

14001600

1800

0 5 10 15Score

Fre

q

Score of best itemsets stand out from the rest using MDL

Three anecdotes

0

5

10

15

20

0

10

20

30

40

0

100

200

300

400

500

600

against time High MStat score

Small marginals Polo shirt & shorts

High correlation Small % variation Bedsheets & pillow cases

High MDL score Significant gradual drift Men’s & women’s shorts

Conclusion New notion of surprising patterns based on

Joint support expected from marginals Variation of joint support along time

Robust MDL formulation Efficient algorithms

Near-optimal segmentation using shortest path Pruning criteria

Successful application to real data

References R. Agrawal and R. Srikant. Fast algorithms for mining association rules.

VLDB'94 487-499, Santiago, Chile. S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using

temporal description length Proc. of the 24th Int'l Conference on Very Large Databases (VLDB), 1998

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX, May 2000. Jiawei Han, Micheline Kamber , Data Mining: Concepts and Techniques by,

Morgan Kaufmann Publishers (Some of the slides in the talk are taken from this book)

H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept. 1996

Documents

Frequent itemset mining and temporal extensions