Upload
niles
View
30
Download
0
Embed Size (px)
DESCRIPTION
Frequent itemset mining and temporal extensions. Sunita Sarawagi [email protected] http://www.it.iitb.ac.in/~sunita. Association rules. Given several sets of items, example: Set of items purchased Set of pages visited on a website Set of doctors visited - PowerPoint PPT Presentation
Citation preview
Frequent itemset mining and temporal extensions
Sunita [email protected]
http://www.it.iitb.ac.in/~sunita
Association rules Given several sets of items, example:
Set of items purchased Set of pages visited on a website Set of doctors visited
Find all rules that correlate the presence of one set of items with another Rules are of the form X Y where X and Y are
sets of items Eg: Purchase of books A&B purchase of C
Parameters: Support and Confidence All rules X Z have two parameters
Support probability that a transaction has X and Z confidence conditional probability that a transaction having X also
contains Z Two parameters to association rule mining:
Minimum support s Minimum confidence c
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
S: 50%, and c: 50% A C (50%, 66.6%) C A (50%, 100%)
Applications of fast itemset counting Cross selling in retail, banking Catalog design and store layout Applications in medicine: find redundant tests Improve predictive capability of classifiers
that assume attribute independence Improved clustering of categorical attributes
Finding association rules in large databases
Number of transactions: in millions Number of distinct items: tens of thousands Lots of work on scalable algorithms Typically two parts to the algorithm:
1. Finding all frequent itemsets with support > S
2. Finding rules with confidence greater than C
Frequent itemset search more expensive Apriori algorithm, FP-tree algorithm
The Apriori Algorithm L1 = {frequent items of size one}; for (k = 1; Lk !=; k++)
Ck+1 = candidates generated from Lk by • Join Lk with itself • Prune any k+1 itemset whose subset not in Lk
for each transaction t in database do• increment the count of all candidates in Ck+1
that are contained in t Lk+1 = candidates in Ck+1 with min_support
return k Lk;
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
The Apriori Algorithm — Example
TID Items100 1 2 4200 2 3 5300 1 2 3 5400 2 5
Database D itemset sup.{1} 2{2} 4{3} 2{4} 1{5} 3
itemset sup.{1} 2{2} 4{3} 2{5} 3
Scan D
C1L1
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
itemset sup{1 2} 2{1 3} 1{1 5} 1{2 3} 2{2 5} 3{3 5} 2
itemset sup{1 2} 2{2 3} 2{2 5} 3{3 5} 2
L2
C2 C2
Scan D
C3
L3
itemset{2 3 5}
Scan D itemset sup{2 3 5} 2
Improvements to Apriori Apriori with well-designed data structures works
well in practice when frequent itemsets not too long (common case)
Lots of enhancements proposed Sampling: count in two passes Invert database to be column major instead of row major
and count by intersection Count multiple length itemsets in one-pass
Reducing passes not useful since I/O not bottleneck: Main bottleneck: candidate generation and counting
not optimized for long itemsets
Mining Frequent Patterns Without Candidate Generation
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern
mining
Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose
mining tasks into smaller ones Avoid candidate generation
Construct FP-tree from Database
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Item frequency f 4c 4a 3b 3m 3p 3
min_support = 0.5TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Scan DB once, find frequent 1-itemset
Order frequent items by decreasing frequency
Scan DB again, construct FP-tree
Step 1: FP-tree to Conditional Pattern Base Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header TableItem frequency f 4c 4a 3b 3m 3p 3
Step 2: Construct Conditional FP-tree
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:fca:2, fcab:1
{}
f:3
c:3
a:3m-conditional FP-tree
All frequent patterns concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
Mining Frequent Patterns by Creating Conditional Pattern-Bases
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem
Repeat this recursively for higher items…
FP-growth vs. Apriori: Scalability With the Support Threshold
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Ru
n t
ime(s
ec.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
Criticism to Support and ConfidenceX and Y: positively correlated,X and Z, negatively relatedsupport and confidence of X=>Z dominates Need to measure departure from
expected. For two items:
For k items, expected support derived from support of k-1 itemsets using iterative scaling methods
X 1 1 1 1 0 0 0 0Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1Rule Support ConfidenceX=>Y 25% 50%X=>Z 37.50% 75%
)()(
)(
BPAP
BAP
Prevalent correlations are not interesting Analysts already know
about prevalent rules Interesting rules are
those that deviate from prior expectation
Mining’s payoff is in finding surprising phenomena
1995
1998
bedsheets and pillow covers sell together!
Zzzz...
bedsheets and pillow covers sell together!
What makes a rule surprising? Does not match prior
expectation Correlation between
milk and cereal remains roughly constant over time
Cannot be trivially derived from simpler rules Milk 10%, cereal 10% Milk and cereal 10%
… surprising Eggs 10% Milk, cereal and eggs
0.1% … surprising! Expected 1%
Finding suprising temporal patterns Algorithms to mine for surprising patterns
Encode itemsets into bit streams using two models
• Mopt: The optimal model that allows change along time
• Mcons: The constrained model that does not allow change along time
Surprise = difference in number of bits in Mopt and Mcons
One item: optimal model Milk-buying habits modeled by biased coin Customer tosses this coin to decide whether
to buy milk Head or “1” denotes “basket contains milk” Coin bias is Pr[milk]
Analyst wants to study Pr[milk] along time Single coin with fixed bias is not interesting Changes in bias are interesting
The coin segmentation problem Players A and B A has a set of coins
with different biases A repeatedly
Picks arbitrary coin Tosses it arbitrary
number of times
B observes H/T Guesses transition
points and biases
Pick
Toss
Return
A
B
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
How to explain the data Given n head/tail observations
Can assume n different coins with bias 0 or 1• Data fits perfectly (with probability one)
• Many coins needed
Or assume one coin• May fit data poorly
“Best explanation” is a compromise
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
1/4 5/7 1/3
Coding examples Sequence of k zeroes
Naïve encoding takes k bits Run length takes about log k bits
1000 bits, 10 randomly placed 1’s, rest 0’s Posit a coin with bias 0.01 Data encoding cost is (Shannon’s theorem):
bits 66100log10 that Note
bits 1000 « bits 8199.0log99001.0log10
How to find optimal segments
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
Sequence of 17 tosses:
Derived graph with 18 nodes:
Edge cost = model cost+ data cost
Model cost =one node ID +one Pr[head]
Data cost forPr[head] = 5/7,5 heads, 2 tails
Shortest path
Two or more items “Unconstrained” segmentation
k items induce a 2k sided coin “milk and cereal” = 11, “milk, not
cereal” = 10, “neither” = 00, etc.
Shortest path finds significant shift in any of the coin face probabilities
Problem: some of these shifts may be completely explained by marginals
00 0110 11
Example
Theta=2
0
0.1
0.2
0.3
0.4
0 2 4 6 8 10
TimeS
uppo
rtMilk Cereal Both
Drop in joint sale of milk and cereal is completely explained by drop in sale of milk
Pr[milk & cereal] / (Pr[milk] Pr[cereal]) remains constant over time
Call this ratio
Constant- segmentation
Compute global over all time All coins must have this common value of Segment as before Compare with un-constrained coding cost
))(( 11011110
11
11
11
pppp
p
pp
p
Observed support
Independence
Is all this really needed? Simpler alternative
Aggregate data into suitable time windows Compute support, correlation, , etc. in each
window Use variance threshold to choose itemsets
Pitfalls Choices: windows, thresholds May miss fine detail Over-sensitive to outliers
Experiments Millions of baskets over several years Two algorithms
Complete MDL approach MDL segmentation + statistical tests (MStat)
Data set 2.8 million transactions 7 years, 1987 to 1993 15800 items Average 2.62 items per basket
Little agreement in itemset ranks
0
400
800
1200
1600
0 400 800 1200 1600
Rank(Stat, 4 week)
Ra
nk(
MD
L)
0
400
800
1200
1600
0 400 800 1200 1600Rank(MStat)
Ra
nk(
MD
L)
Simpler methods do not approximate MDL
MDL has high selectivity
MDL
0
500
1000
1500
2000
-2000 0 2000 4000 6000Score
Fre
q
MStat
0
200400
600
8001000
1200
14001600
1800
0 5 10 15Score
Fre
q
Score of best itemsets stand out from the rest using MDL
Three anecdotes
0
5
10
15
20
0
10
20
30
40
0
100
200
300
400
500
600
against time High MStat score
Small marginals Polo shirt & shorts
High correlation Small % variation Bedsheets & pillow cases
High MDL score Significant gradual drift Men’s & women’s shorts
Conclusion New notion of surprising patterns based on
Joint support expected from marginals Variation of joint support along time
Robust MDL formulation Efficient algorithms
Near-optimal segmentation using shortest path Pruning criteria
Successful application to real data
References R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94 487-499, Santiago, Chile. S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using
temporal description length Proc. of the 24th Int'l Conference on Very Large Databases (VLDB), 1998
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX, May 2000. Jiawei Han, Micheline Kamber , Data Mining: Concepts and Techniques by,
Morgan Kaufmann Publishers (Some of the slides in the talk are taken from this book)
H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept. 1996