58
Efficient Itemset Generator Discovery over a Stream Sliding Window Chuancong Gao , Jianyong Wang Database Laboratory Department of Computer Science and Technology Tsinghua University, Beijing 100084, China C. Gao , J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 1 / 28

CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Embed Size (px)

Citation preview

Page 1: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Efficient Itemset Generator Discovery over a StreamSliding Window

Chuancong Gao, Jianyong Wang

Database LaboratoryDepartment of Computer Science and Technology

Tsinghua University, Beijing 100084, China

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 1 / 28

Page 2: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Outline

IntroductionWhat is GeneratorWhy We Need GeneratorsWhat have We done

Related Work

The StreamGen AlgorithmFP-TreeEnumeration TreeADD and REMOVE Operations

Extension for Mining Classification Rules

Evaluation Results

Conclusions

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 2 / 28

Page 3: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What is Generator

What is Generator

Example:Given the 4 transactions, with the

:::::::::minimum

::::::::support

:::::::::threshold

:::::::::(supmin) of 2.

A B CA D

A B C DA B D

Ø : 4

D : 3C : 2B : 3A : 4

ABD : 2ABC : 2

BD : 2BC : 2AD : 3AC : 2AB : 3

Equivalence Class

Generator ItemsetClosed Itemset

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 3 / 28

Page 4: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What is Generator

What is Generator

Example:Given the 4 transactions, with the

:::::::::minimum

::::::::support

:::::::::threshold

:::::::::(supmin) of 2.

A B CA D

A B C DA B D

Ø : 4

D : 3C : 2B : 3A : 4

ABD : 2ABC : 2

BD : 2BC : 2AD : 3AC : 2AB : 3

Equivalence Class

Generator ItemsetClosed Itemset

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 3 / 28

Page 5: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What is Generator

What is Generator

::::::::::::Equivalence

::::::class: All the frequent

::::::::itemsets contained in the same set of

input:::::::::::transactions

:::::::Closed

::::::::Itemset: The maximal one in equivalence class

::::::::::Generator

:::::::::Itemsets: The minimal ones

Characteristics:

I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same

:::::::support value and

::::::::::confidence value;

I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;

I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;

I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.

I An itemset could be both a generator itemset and a closed itemset.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28

Page 6: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What is Generator

What is Generator

::::::::::::Equivalence

::::::class: All the frequent

::::::::itemsets contained in the same set of

input:::::::::::transactions

:::::::Closed

::::::::Itemset: The maximal one in equivalence class

::::::::::Generator

:::::::::Itemsets: The minimal ones

Characteristics:

I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same

:::::::support value and

::::::::::confidence value;

I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;

I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;

I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.

I An itemset could be both a generator itemset and a closed itemset.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28

Page 7: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What is Generator

What is Generator

::::::::::::Equivalence

::::::class: All the frequent

::::::::itemsets contained in the same set of

input:::::::::::transactions

:::::::Closed

::::::::Itemset: The maximal one in equivalence class

::::::::::Generator

:::::::::Itemsets: The minimal ones

Characteristics:

I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same

:::::::support value and

::::::::::confidence value;

I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;

I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;

I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.

I An itemset could be both a generator itemset and a closed itemset.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28

Page 8: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What is Generator

What is Generator

::::::::::::Equivalence

::::::class: All the frequent

::::::::itemsets contained in the same set of

input:::::::::::transactions

:::::::Closed

::::::::Itemset: The maximal one in equivalence class

::::::::::Generator

:::::::::Itemsets: The minimal ones

Characteristics:

I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same

:::::::support value and

::::::::::confidence value;

I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;

I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;

I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.

I An itemset could be both a generator itemset and a closed itemset.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28

Page 9: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What is Generator

What is Generator

::::::::::::Equivalence

::::::class: All the frequent

::::::::itemsets contained in the same set of

input:::::::::::transactions

:::::::Closed

::::::::Itemset: The maximal one in equivalence class

::::::::::Generator

:::::::::Itemsets: The minimal ones

Characteristics:

I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same

:::::::support value and

::::::::::confidence value;

I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;

I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;

I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.

I An itemset could be both a generator itemset and a closed itemset.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28

Page 10: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What is Generator

What is Generator

::::::::::::Equivalence

::::::class: All the frequent

::::::::itemsets contained in the same set of

input:::::::::::transactions

:::::::Closed

::::::::Itemset: The maximal one in equivalence class

::::::::::Generator

:::::::::Itemsets: The minimal ones

Characteristics:

I same equivalence class =⇒ same input transactions =⇒ same datadistribution =⇒ same

:::::::support value and

::::::::::confidence value;

I No:::::::::::sub-itemset for a generator itemset in an eqivalence class;

I No:::::::::::::super-itemset for a closed itemset in an eqivalence class;

I Only one closed itemset, while one or more generator itemsets in one sameequivalence class.

I An itemset could be both a generator itemset and a closed itemset.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 4 / 28

Page 11: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction Why We Need Generators

Why We Need Generators

I Form a concise representation of equivalence classes together withclosed item-sets;

I As classification rules / features.

I At least one generator sharing the same support and confidence withothers for each equivalence class;

I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by

:::::MDL

::::::::::(Minimum

:::::::::::Description

::::::::Length) principle.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28

Page 12: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction Why We Need Generators

Why We Need Generators

I Form a concise representation of equivalence classes together withclosed item-sets;

I As classification rules / features.

I At least one generator sharing the same support and confidence withothers for each equivalence class;

I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by

:::::MDL

::::::::::(Minimum

:::::::::::Description

::::::::Length) principle.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28

Page 13: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction Why We Need Generators

Why We Need Generators

I Form a concise representation of equivalence classes together withclosed item-sets;

I As classification rules / features.I At least one generator sharing the same support and confidence with

others for each equivalence class;

I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by

:::::MDL

::::::::::(Minimum

:::::::::::Description

::::::::Length) principle.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28

Page 14: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction Why We Need Generators

Why We Need Generators

I Form a concise representation of equivalence classes together withclosed item-sets;

I As classification rules / features.I At least one generator sharing the same support and confidence with

others for each equivalence class;I The number is much smaller than all frequent ones;

I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by

:::::MDL

::::::::::(Minimum

:::::::::::Description

::::::::Length) principle.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28

Page 15: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction Why We Need Generators

Why We Need Generators

I Form a concise representation of equivalence classes together withclosed item-sets;

I As classification rules / features.I At least one generator sharing the same support and confidence with

others for each equivalence class;I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;

I The average size tends to be the smallest;I Preferred by

:::::MDL

::::::::::(Minimum

:::::::::::Description

::::::::Length) principle.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28

Page 16: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction Why We Need Generators

Why We Need Generators

I Form a concise representation of equivalence classes together withclosed item-sets;

I As classification rules / features.I At least one generator sharing the same support and confidence with

others for each equivalence class;I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;

I Preferred by:::::MDL

::::::::::(Minimum

:::::::::::Description

::::::::Length) principle.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28

Page 17: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction Why We Need Generators

Why We Need Generators

I Form a concise representation of equivalence classes together withclosed item-sets;

I As classification rules / features.I At least one generator sharing the same support and confidence with

others for each equivalence class;I The number is much smaller than all frequent ones;I The shortest ones in an equivalence class;I The average size tends to be the smallest;I Preferred by

:::::MDL

::::::::::(Minimum

:::::::::::Description

::::::::Length) principle.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 5 / 28

Page 18: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What have We done

What have We done

A novel algorithm to mine frequent generator itemsets on stream slidingwindow.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28

Page 19: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What have We done

What have We done

A novel algorithm to mine frequent generator itemsets on stream slidingwindow.

Contributions:

I First algorithm mining frequent itemset generators over stream slidingwindows;

I Novel::::::::::::enumeration

:::::tree structure and some effective optimization

techniques;

I Extended to directly mine classification rules on a sliding window;

I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28

Page 20: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What have We done

What have We done

A novel algorithm to mine frequent generator itemsets on stream slidingwindow.

Contributions:

I First algorithm mining frequent itemset generators over stream slidingwindows;

I Novel::::::::::::enumeration

:::::tree structure and some effective optimization

techniques;

I Extended to directly mine classification rules on a sliding window;

I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28

Page 21: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What have We done

What have We done

A novel algorithm to mine frequent generator itemsets on stream slidingwindow.

Contributions:

I First algorithm mining frequent itemset generators over stream slidingwindows;

I Novel::::::::::::enumeration

:::::tree structure and some effective optimization

techniques;

I Extended to directly mine classification rules on a sliding window;

I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28

Page 22: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Introduction What have We done

What have We done

A novel algorithm to mine frequent generator itemsets on stream slidingwindow.

Contributions:

I First algorithm mining frequent itemset generators over stream slidingwindows;

I Novel::::::::::::enumeration

:::::tree structure and some effective optimization

techniques;

I Extended to directly mine classification rules on a sliding window;

I An extensive performance study shows StreamGen outperforms othersperforming similar tasks, and achieves high classification accuracy.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 6 / 28

Page 23: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Related Work

Related Work

Itemset Mining Algorithms:

I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.

I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.

I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.

I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28

Page 24: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Related Work

Related Work

Itemset Mining Algorithms:

I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.

I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.

I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.

I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28

Page 25: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Related Work

Related Work

Itemset Mining Algorithms:

I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.

I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.

I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.

I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28

Page 26: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Related Work

Related Work

Itemset Mining Algorithms:

I Mining frequent patterns without candidate generation: A frequent-pattern tree approach.J. Han, J. Pei, Y. Yin, and R. Mao. Data Min. Knowl. Discov., 2004.

I Closet: An efficient algorithm for mining frequent closed itemsets. J. Pei, J. Han, and R.Mao. SIGMOD Workshop DMKD, 2000.

I Minimum description length principle: Generators are preferable to closed patterns. J. Li,H. Li, L. Wong, J. Pei, and G. Dong. AAAI, 2006.

I Mining statistically important equivalence classes and delta-discriminative emergingpatterns. J. Li, G. Liu, and L. Wong. SIGKDD, 2007.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 7 / 28

Page 27: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Related Work

Related Work

Stream Itemset Mining Algorithms:

I Catch the moment: maintaining closed frequent itemsets over a data stream slidingwindow. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28

Page 28: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Related Work

Related Work

Stream Itemset Mining Algorithms:

I Catch the moment: maintaining closed frequent itemsets over a data stream slidingwindow. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.

Itemset based Classification Algorithms:

I On mining instance-centric classification rules. J. Wang and G. Karypis. IEEE Trans.Knowl. Data Eng., 2006.

I Discriminative frequent pattern analysis for effective classification. H. Cheng, X. Yan, J.Han, and C.-W. Hsu. ICDE, 2007.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28

Page 29: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Related Work

Related Work

Stream Itemset Mining Algorithms:

I Catch the moment: maintaining closed frequent itemsets over a data stream slidingwindow. Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Knowl. Inf. Syst., 2006.

Itemset based Classification Algorithms:

I On mining instance-centric classification rules. J. Wang and G. Karypis. IEEE Trans.Knowl. Data Eng., 2006.

I Discriminative frequent pattern analysis for effective classification. H. Cheng, X. Yan, J.Han, and C.-W. Hsu. ICDE, 2007.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 8 / 28

Page 30: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm

The StreamGen Algorithm

Details of our algorithm here.

Example:One running example of stream data containing 6 transaction itemsets and withwindow size of 4.

T i m e L i n e

I D I t e m s e t

1

2

3

4

5

6

A B C

A D

A B C D

A B D

B C D

C D

W i n d o w

# 1

W i n d o w

# 2

W i n d o w

# 3

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 9 / 28

Page 31: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm

The StreamGen Algorithm

Details of our algorithm here.

Example:One running example of stream data containing 6 transaction itemsets and withwindow size of 4.

T i m e L i n e

I D I t e m s e t

1

2

3

4

5

6

A B C

A D

A B C D

A B D

B C D

C D

W i n d o w

# 1

W i n d o w

# 2

W i n d o w

# 3

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 9 / 28

Page 32: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm

A Few Basic Theorems

TheoremA frequent itemset S is a generator iff there exists no subset with size |S − 1|having the same support with S .

Hint:Can be used to check whether an itemset is a generator easily.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 10 / 28

Page 33: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm

A Few Basic Theorems

TheoremA frequent itemset S is a generator iff there exists no subset with size |S − 1|having the same support with S .

Hint:Can be used to check whether an itemset is a generator easily.

TheoremAny subset of a generator would be also a generator.

TheoremAny superset of an unpromising itemset must be either unpromising orinfrequent.

Hint:Help define the border between generators and non-generators;

Form the foundation for the enumeration tree.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 10 / 28

Page 34: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm FP-Tree

FP-Tree

A modified FP-Tree for store and compress transactions in each slidingwindow.

Example:FP-Tree of first sliding window in previous example.

1 A B C2 A D3 A B C D4 A B D

�

D : 3

C : 1

B : 1

A : 1

C : 1

B : 1

A : 1

1 4 3 2

I D T a b l e

A : 1 B : 1

A : 1 H e a

d T

a b l e

A

B

D

C

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 11 / 28

Page 35: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm FP-Tree

FP-Tree

A modified FP-Tree for store and compress transactions in each slidingwindow.

Example:FP-Tree of first sliding window in previous example.

1 A B C2 A D3 A B C D4 A B D

�

D : 3

C : 1

B : 1

A : 1

C : 1

B : 1

A : 1

1 4 3 2

I D T a b l e

A : 1 B : 1

A : 1 H e a

d T

a b l e

A

B

D

C

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 11 / 28

Page 36: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm Enumeration Tree

Enumeration Tree

To help maintain the information of the mined generators and the borderbetween generators and non-generators.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28

Page 37: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm Enumeration Tree

Enumeration Tree

To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:

I Infrequent Node;

I Unpromising Node.

I Generator Node.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28

Page 38: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm Enumeration Tree

Enumeration Tree

To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:

I Infrequent Node;

I Unpromising Node.

I Generator Node.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28

Page 39: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm Enumeration Tree

Enumeration Tree

To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:

I Infrequent Node;

I Unpromising Node.

I Generator Node.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28

Page 40: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm Enumeration Tree

Enumeration Tree

To help maintain the information of the mined generators and the borderbetween generators and non-generators.3 types of nodes:

I Infrequent Node;

I Unpromising Node.

I Generator Node.

A hash table is prepared for each level of the enumeration tree toaccelerate the checking operation.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 12 / 28

Page 41: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm Enumeration Tree

Enumeration Tree

Example:Enumeration tree of first sliding window with minimum support 2

1 A B C2 A D3 A B C D4 A B D

Ø: : 4

D : 3 C : 2 B : 3 A : 4

B C : 2 B D : 2 C D : 1

Solid border ellipse: Generator NodeDotted border ellipse: Unpromising NodeDotted border rectangle: Infrequent Node

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 13 / 28

Page 42: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm ADD and REMOVE Operations

ADD and REMOVE Operations

Core part:Enumeration tree-node status transforming matrix.

ADD REMOVE

Type x < y x = y x > y x < y x = y x > y

G G G G G G/U I/G

U U G/U U U U I/U

I I I I/G/U I I Ix = |itemsetn ∩ T |, y = |itemsetn| − 1G = Generator, U = Unpromising, I = Infrequent

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 14 / 28

Page 43: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm ADD and REMOVE Operations

Example of ADD Operation

Ø: : 4

D : 3 C : 2 B : 3 A : 4

B C : 2 B D : 2 C D : 1

ADDType x < y x = y x > y

G G G GU U G/U UI I I I/G/U

x = |itemsetn ∩ T |, y = |itemsetn| − 1T = B C D

1 A B C2 A D3 A B C D4 A B D5 B C D +

Ø: : 5

D : 4 C : 3 B : 4 A : 4

A B : 3 A C : 2

A B C : 2

A D : 2 B C : 3 B D : 3 C D : 2

A C D : 1 A B D : 2

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 15 / 28

Page 44: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm ADD and REMOVE Operations

Example of REMOVE Operation

Ø: : 5

D : 4 C : 3 B : 4 A : 4

A B : 3 A C : 2

A B C : 2

A D : 2 B C : 3 B D : 3 C D : 2

A C D : 1 A B D : 2

1 A B C −2 A D3 A B C D4 A B D5 B C D

REMOVEType x < y x = y x > y

G G G/U I/GU U U I/UI I I I

x = |itemsetn ∩ T |, y = |itemsetn| − 1T = A B C

Ø: : 4

D : 4 C : 2 B : 3 A : 3

A B : 2 A C : 1 B C : 2

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 16 / 28

Page 45: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

The StreamGen Algorithm ADD and REMOVE Operations

Combine Two Operations

I For Sliding Window:I ADD when window is not fullI REMOVE when window is full

I For IncrementalI Only ADD

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 17 / 28

Page 46: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Extension for Mining Classification Rules

Extension for Mining Classification Rules

Algorithm 1: StreamGenRules(n)

Input : The root node n of the enemuration tree.begin1

nodes← getGenerators(n);2

sort nodes by info-gain;3

rules← ∅;4

foreach cn ∈ nodes do5

if ∀r ∈ rules, r 6⊂ cn then6

if cn covers at least one transaction then7

rules← rules ∪ {cn};8

remove covered transactions;9

if no more transactions then10

break;11

return rules;12

end13

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 18 / 28

Page 47: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Evaluation Results

Datasets

Dataset # Items # tran. # Pos. # Neg. Avg. Len.

mushroom 116 8,124 4,208 3,916 21.695

horse 89 368 232 136 16.769

adult 128 48,842 11,687 37,155 13.868

breast 45 699 458 241 8.977

hepatitus 55 155 32 123 17.923

pima 40 768 500 268 8

chess 75 3,196 - - 37

connect 129 67,557 - - 43

pumsb 2,113 49,046 - - 74The above part is for both runtime evaluation and classification evaluation,

The bottom part is only for runtime evaluation.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 19 / 28

Page 48: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Evaluation Results

Runtime Comparing with Moment

Comparsion with Moment, one frequent closed itemset mining algorithmon sliding windows:

1

10

100

10 20 30 40 50

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

MomentStreamGen

dataset = mushroomwindow size = 2,000

1

10

100

10 20 30 40 50

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

MomentStreamGen

dataset = mushroomwindow size = 4,000

0.1

1

10

100

1000

75 80 85 90 95 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

MomentStreamGen

dataset = chesswindow size = 1,000

0.1

1

10

100

1000

60 70 80 90 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

MomentStreamGen

dataset = chesswindow size = 2,000

10

100

75 80 85 90 95 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

MomentStreamGen

dataset = pumsbwindow size = 2,500

10

100

70 80 90 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

MomentStreamGen

dataset = pumsbwindow size = 10,000

1

10

100

1000

99.333 99.5 99.667 99.833 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

MomentStreamGen

dataset = connectwindow size = 30,000

1

10

100

95 95.833 96.667 97.5 98.333 99.167 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

MomentStreamGen

dataset = connectwindow size = 60,000

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 20 / 28

Page 49: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Evaluation Results

Memory Use Comparing with Moment

Peak memory uses of Moment and StreamGen in KB:Dataset window size supmin Moment StreamGen

mushroom 4,000 0.1 14,476 10,108

mushroom 2,000 0.1 12,504 8,472

chess 2,000 0.6 103,180 31,636

chess 1,000 0.75 34,624 9,176

connect-4 60,000 0.95 141,756 98,236

connect-4 30,000 0.998 73,056 52,372

pumsb 10,000 0.7 1,732,136 75,316

pumsb 2,500 0.75 90,944 23,472

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 21 / 28

Page 50: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Evaluation Results

Runtime Comparing with DPM & DDPMine

Comparsion with DPM, one frequent generator itemset mining algorithm on staticdata:

0.1

1

10

100

1000

50 60 70 80 90 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

DPMStreamGen

dataset = mushroomwindow size = 4,000

0.1

1

10

100

1000

75 80 85 90 95 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

DPMStreamGen

dataset = chesswindow size = 1,000

1

10

100

1000

97.015 97.761 98.507 99.254 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

DPMStreamGen

dataset = connectwindow size = 67,000

1

10

89.796 91.837 93.878 95.918 97.959 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

DPMStreamGen

dataset = pumsbwindow size = 49,000

*The runtimes of DPM $ DDPMine are only mearsured on full-sized windows.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 22 / 28

Page 51: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Evaluation Results

Runtime Comparing with DPM & DDPMine

Comparsion with DPM, one frequent generator itemset mining algorithm on staticdata:

0.1

1

10

100

1000

50 60 70 80 90 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

DPMStreamGen

dataset = mushroomwindow size = 4,000

0.1

1

10

100

1000

75 80 85 90 95 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

DPMStreamGen

dataset = chesswindow size = 1,000

1

10

100

1000

97.015 97.761 98.507 99.254 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

DPMStreamGen

dataset = connectwindow size = 67,000

1

10

89.796 91.837 93.878 95.918 97.959 100

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

DPMStreamGen

dataset = pumsbwindow size = 49,000

Comparsion with DDPMine, one frequent itemset based classification rule miningalgorithm on static data:

0.1

1

10

100

1000

10000

50 60 70 80 90

Run

time

(in s

econ

ds)

Minimum Support Threshold (in %)

DDPMineStreamGen

dataset = mushroomwindow size = 8,000

0.01

0.1

1

10

100

1000

10000

10 20 30 40 50R

untim

e (in

sec

onds

)Minimum Support Threshold (in %)

DDPMineStreamGen

dataset = horsewindow size = 600

*The runtimes of DPM $ DDPMine are only mearsured on full-sized windows.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 22 / 28

Page 52: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Evaluation Results

Classification Experiment Results

Classification Accuracy:Dataset StreamGen DDPMine

Accuracy max. len. avg. len. avg. num. Accuracy max. len. avg. len. avg. num.

breast 96.708 3 1.551 23.6 95.28 9 2.448 11.6adult 82.146 3 1.831 13 81.292 14 4.583 7.2

mushroom 98.918 3 1.958 9.6 97.184 22 15.592 16.2hepatitus 82.006 4 2.387 15 76.986 8 4.8 5

horse 81.512 2 1.389 3.6 81.246 20 4.88 10pima 74.87 4 1.663 18.4 75.124 7 2.435 12.6

Rule Example on “mushroom”:StreamGen DDPMine

38 17 3912 25 5 7 8 11 13 15 16 17 18 19 20 2613 25 8 17 187 67 5 7 9 13 14 15 16 17 18 19 20 40 41 46 53 5466 2 7 9 11 13 14 15 16 17 18 19 20 21 38 40 44 53 54 76

7 68 2 7 9 11 13 14 15 16 17 18 19 20 28 38 40 44 53 54 7611 18 2 7 9 11 13 14 15 16 17 18 19 20 32 38 40 53 54 65 76

6 18 37 2 7 9 11 13 14 15 16 17 18 19 20 22 32 38 40 53 54 764 53 2 7 9 11 13 14 15 16 17 18 19 20 28 32 38 40 46 53 54 76

2 7 9 11 13 14 15 16 17 18 19 20 21 32 38 40 45 46 53 54 762 7 9 11 13 14 15 16 17 18 19 20 21 32 34 38 40 46 48 53 54 76

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 23 / 28

Page 53: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Evaluation Results

Classification Experiment Results

Classification Accuracy:Dataset StreamGen DDPMine

Accuracy max. len. avg. len. avg. num. Accuracy max. len. avg. len. avg. num.

breast 96.708 3 1.551 23.6 95.28 9 2.448 11.6adult 82.146 3 1.831 13 81.292 14 4.583 7.2

mushroom 98.918 3 1.958 9.6 97.184 22 15.592 16.2hepatitus 82.006 4 2.387 15 76.986 8 4.8 5

horse 81.512 2 1.389 3.6 81.246 20 4.88 10pima 74.87 4 1.663 18.4 75.124 7 2.435 12.6

Rule Example on “mushroom”:StreamGen DDPMine

38 17 3912 25 5 7 8 11 13 15 16 17 18 19 20 2613 25 8 17 187 67 5 7 9 13 14 15 16 17 18 19 20 40 41 46 53 5466 2 7 9 11 13 14 15 16 17 18 19 20 21 38 40 44 53 54 76

7 68 2 7 9 11 13 14 15 16 17 18 19 20 28 38 40 44 53 54 7611 18 2 7 9 11 13 14 15 16 17 18 19 20 32 38 40 53 54 65 76

6 18 37 2 7 9 11 13 14 15 16 17 18 19 20 22 32 38 40 53 54 764 53 2 7 9 11 13 14 15 16 17 18 19 20 28 32 38 40 46 53 54 76

2 7 9 11 13 14 15 16 17 18 19 20 21 32 38 40 45 46 53 54 762 7 9 11 13 14 15 16 17 18 19 20 21 32 34 38 40 46 48 53 54 76

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 23 / 28

Page 54: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Conclusions

Conclusions

I Explored a new and challenging problem:Mining frequent itemset generators over stream sliding window;

I Devised novel enumeration tree structure;

I Also proposed effective optimization techniques;

I Outperformed other state-of-the-art algorithms in terms of efficiencyand classification accuracy.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28

Page 55: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Conclusions

Conclusions

I Explored a new and challenging problem:Mining frequent itemset generators over stream sliding window;

I Devised novel enumeration tree structure;

I Also proposed effective optimization techniques;

I Outperformed other state-of-the-art algorithms in terms of efficiencyand classification accuracy.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28

Page 56: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Conclusions

Conclusions

I Explored a new and challenging problem:Mining frequent itemset generators over stream sliding window;

I Devised novel enumeration tree structure;

I Also proposed effective optimization techniques;

I Outperformed other state-of-the-art algorithms in terms of efficiencyand classification accuracy.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28

Page 57: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Conclusions

Conclusions

I Explored a new and challenging problem:Mining frequent itemset generators over stream sliding window;

I Devised novel enumeration tree structure;

I Also proposed effective optimization techniques;

I Outperformed other state-of-the-art algorithms in terms of efficiencyand classification accuracy.

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 24 / 28

Page 58: CIKM 2009 - Efficient itemset generator discovery over a stream sliding window

Conclusions

The End

Thank you for Listening!

Questions or Comments?

C. Gao, J. Wang (Tsinghua Univ.) Efficient Itemset Generator Discovery over a Stream Sliding Window 25 / 28