20
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. Chen M. Dash, Y. Qiao, P. Scheuerm P. Haas lytechnic Univ Exilixis Northwestern University IBM Almaden [email protected] [email protected] [email protected] {manoranj,yiqiao,peters}@ece.nwu.

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

Embed Size (px)

Citation preview

Page 1: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 1

Efficient Data-Reduction Methods for On-line Association Rule Mining

H. Bronnimann B. Chen M. Dash, Y. Qiao, P. ScheuermannP. Haas

Polytechnic Univ Exilixis Northwestern UniversityIBM Almaden

[email protected] [email protected] [email protected] {manoranj,yiqiao,peters}@ece.nwu.edu

Page 2: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 2

Motivation

Volume of Data in Warehouses & Internet is growing faster than Moore’s Law

Scalability is a major concern “Classical” algorithms require one/more scans of the database

Need to adopt to Streaming Data

One Solution: Execute algorithm on a sample

Data elements arrive on-line Limited amount of memory

Lossy compressed synopses (sketch) of data

Page 3: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 3

Motivation

Advantage: can explicitly trade-off accuracy and speed

Work best when tailored to application

Base set of items & each data element is vector of item counts Application: Association rule mining

Sampling Methods

Our Contributions Sampling methods for count datasets

Page 4: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 4

Outline

Motivation

FAST

Epsilon Approximation

Experimental Results

Data Stream Reduction

Conclusion

Outline of the Presentation

Page 5: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 5

The Problem

Generate a smaller subset S0 of a larger superset S such that the supports of 1-itemsets in S0 are close to those in S

0 00

,minimize ( , )

S S S nDist S S

NP-Complete: One-In-Three SAT Problem

I1(T) = set of all 1-itemsets in transaction set TL1(T) = set of frequent 1-itemsets in transaction set T

f(A;T) = support of itemset A in transaction set T

);();(max 0)(1

SAfSAfDistSIA

20

)(2 ));();((

1

SAfSAfDistSIA

|)()(|

|)()(||)()(|

101

1010111 SLSL

SLSLSLSLDist

Page 6: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 6

FAST-trim

1. Obtain a large simple random sample S from D.2. Compute f(A;S) for each 1-itemset A.3. Using the supports computed in Step 2, obtain a

reduced sample S0 from S by trimming away outlier transactions.

4. Run a standard association-rule algorithm against S0

– with Minimum support p and confidence c – to obtain the final set of Association Rules.

FAST-trim OutlineGiven a specified minimum support p and confidencec, FAST-trim Algorithm proceeds as follows:

Page 7: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 7

FAST-trim

while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 – {t*}, where Dist(S0 -{t*},S) = min Dist(S0 - {t},S) }}

FAST-trim AlgorithmUses input parameter k to explicitly trade-offspeed and accuracy

Trimming Phase

t G

Note: Removal of outlier t* causes maximum decrease or minimumincrease in Dist(S0,S)

Page 8: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 8

FAST-grow

while (|S0| > n) { divide S0 into disjoint groups of min(k,|S0|) transactions each; for each group G { compute f(A;S0) for each item A; set S0=S0 {t*}, where Dist(S0 {t*},S) = min Dist(S0{t},S) }}

FAST-grow AlgorithmSelect representative transactions from S and add themto the sample S0 that is initially empty

Growing Phase

t G

Page 9: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 9

Epsilon Approximation (EA)

Theory based on work in statistics on VC Dimensions (Vapnik & Cervonenkis’71) shows:

Epsilon Approximation (EA)

Can estimate simultaneously the frequency of a collection ofsubsets VC dimension is finite

Applications to computational geometry and learning theory

Def: A sample S0 of S1 is an approximation iff discrepancysatisfies ),( 10 SSDist

Page 10: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 10

Epsilon Approximation (EA)

Deterministically halves the data to get sample S0

Apply halving repeatedly (S1 => S2 => … => St (= S0)) until

Each halving step introduce a discrepancy where m = total no. of items in database, ni = size of sub-sample Si

Halving stops with the maximum t such that

Halving Method

),( 10 SSDist

),( mnii

ti

iit mn ),(

Page 11: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 11

Epsilon Approximation (EA)

How to compute halving?Hyperbolic cosine method [Spencer]

1. Color each transaction red (in sample) or blue (not in sample)

2. Penalty for each item, reflectsPenalty small if red/blue approximately balancedPenalty will shoot up exponentially when

red dominates (item is over-sampled), orblue dominates (item is under-sampled)

3. Color transactions sequentially, keeping penalty low Key property: no increase on penalty in average=> One of the two colors does not increase the penalty globally

Page 12: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 12

Epsilon Approximation (EA)

Penalty Computation Let Qi = Penalty for item Ai

Init Qi = 2 Suppose that we have colored the first j transactions

iiii bi

ri

bi

ri

jii QQ )1()1()1()1()(

where ri = ri(j) = no. of red transactions containing Ai

bi = bi(j) = no. of blue transactions containing Ai

i = parameter that influences how fast penalty changes as function of |ri - bi|

Page 13: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 13

Epsilon Approximation (EA)

How to color transaction j+1 Compute global penalty:

i

rji

r QQ )||()(= Global penalty assuming transaction j+1 is red

i

bji

b QQ )||()( = Global penalty assuming transaction j+1 is blue

Choose color for which global penalty is smaller

EA is inherently an on-line method

Page 14: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 14

Performance Evaluation

Synthetic data set IBM QUEST project [AS94] 100,000 transactions 1,000 items number of maximal potentially large itemsets = 2000 average transaction length: 10 average length of maximal large itemsets: 4 minimum support: 0.77% length of the maximal large itemsets: 6

Final sampling ratios:0.76%, 1.51%, 3.0%, … dictated by EA halvings

Page 15: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 15

Experimental Results

Time vs. Sample Ratio FAST_trim vs. EA/SRS

0

0.5

1

1.5

2

2.5

3

3.5

0 0.05 0.1 0.15 0.2 0.25 0.3Sample Ratio

Time (

cpu s

ec)

FAST_trim_D1FAST_trim_D2EASRS

Accuracy vs. Sample RatioFAST_trim vs. EA/SRS

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3

Sample Ratio

Accu

racy

FAST_trim_D1FAST_trim_D2EASRS

87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%)

Page 16: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 16

Experimental Results

Time vs. Sample RationFAST_grow vs. EA/SRS

0

0.5

1

1.5

2

2.5

3

3.5

0 0.05 0.1 0.15 0.2 0.25 0.3

Sample Ratio

Time (

cpu s

ec)

FAST_grow_D1FAST_grow_D2EASRS

Accuracy vs. Sample RatioFAST_grow vs. EA/SRS

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2 0.25 0.3Sample Ratio

Accu

racy

FAST_grow_D1FAST_grow_D2EASRS

FAST_grow_D2 is best for very small sampling ratio (< 2%) EA best over-all in accuracy

Page 17: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 17

Data Stream Reduction

Data Stream Reduction (DSR) Representative sample of data stream

Assign more weight to recent data while partially keeping track of old data

NS/2 NS/2 NS/2 NS/2

NS/2 NS/4 NS/8 1

mS mS-1 mS-2 1 Bucket#

Bucket#1mS-2mS-1mS

To generate NS-element sample, halve (mS-k) times of bucket k

Total #Transactions

= ms.Ns/2

Page 18: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 18

Data Stream Reduction

Practical Implementation

Ns

0 Halving

1 Halving

2 Halving

Empty

1 Halving

2 Halving3 Halving

To avoid frequent halving we use one buffer onceand compute new representative sample whenbuffer is full by applying EA

Page 19: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 19

Data Stream Reduction

Problem: Two users immediately before and after halving operation see data that varies substantially

Continuous DSR: Buffer divided into chunks

2ns

4ns

Ns-2ns

Ns

Next ns

transactionsarrive

Oldest chunkis halved first

New transns

3ns

5ns

Ns-ns

Ns

Page 20: NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic

NGDM’02 20

Conclusion

Two-stage sampling approach based on trimming outliers or selecting representative transactions Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample Can be used in conjunction with other non-sampling count-based mining algorithms EA-based data stream reduction

• We are investigating how to evaluate goodness of representative subset• Frequency information to be used for discrepancy function