Copyright 2005, Data Mining Research Lab, The Ohio State University Cache-conscious Frequent Pattern Mining on a Modern Processor Amol Ghoting, Gregory

Copyright 2005, Data Mining Research Lab, The Ohio State University

Cache-conscious Frequent Pattern Mining on a Modern Processor

Amol Ghoting, Gregory Buehrer, and Srinivasan Parthasarathy

Data Mining Research Laboratory, CSEThe Ohio State University

Daehyun Kim, Anthony Nguyen, Yen-Kuang Chen, and Pradeep Dubey

Intel Corporation


Roadmap

• Motivation and Contributions• Background• Performance characterization• Cache-conscious optimizations• Related work• Conclusions


Motivation

• Data mining applications– Rapidly growing segment

in commerce and science– Interactive response

time is important– Compute- and memory-

intensive• Modern architectures

– Memory wall– Instruction level

parallelism (ILP)

FP-Growth

Note: Experiment conducted on specialized hardware

SATURATION

2.4x

1.6x


Contributions

• We characterize the performance and memory access behavior of three state-of-the-art frequent pattern mining algorithms

• We improve the performance of the three frequent pattern mining algorithms– Cache-conscious prefix tree

• Spatial locality + hardware pre-fetching• Path tiling to improve temporal locality• Co-scheduling to improve ILP on a

simultaneous multi-threaded (SMT) processor


Roadmap



• Finds groups of items that co-occur frequently in a transactional data set

• Example:

• Minimum support = 2• Frequent patterns:

– Size 1: (milk), (bread), (eggs), (sugar)– Size 2: (milk & bread), (milk & eggs), (sugar & eggs)– Size 3: None

• Search space traversal: breadth-first or depth-first

Frequent pattern mining (1)

Customer 1: milk bread cereal

Customer 2: milk bread eggs sugar

Customer 3: milk bread butter

Customer 4: eggs sugar


Frequent pattern mining (2)

• Algorithms under study– FP-Growth (based on the pattern-growth

method)• Winner of the 2003 Frequent Itemset Mining

Implementations (FIMI) evaluation– Genmax (depth-first search space traversal)

• Maximal pattern miner– Apriori (breadth-first search space traversal)

• All algorithms use a prefix tree as a data set representation


Roadmap



Setup

• Intel Xeon processor– At 2Ghz with 4GB of main memory– 4-way 8KB L1 data cache– 8-way 512KB L2 cache on chip– 8-way 2MB L3 cache

• Intel VTune Performance Analyzers to collect performance data• Implementations

– FIMI repository• FP-Growth (Gosta Grahne and Jianfei Zhu)• Genmax (Gosta Grahne and Jianfei Zhu)• Apriori (Christian Borgelt)

– Custom memory managers


Execution time breakdown

FPGrowth Genmax Apriori

Count-FPGrowth () – 61% Count-GM () – 91% Count-Apriori () – 70%

Project-FPGrowth () – 31% Other – 9% Candidate-Gen () – 25%

Other – 8% Other – 5%

Support counting in a prefix tree

Similar to Count-FPGrowth ()


Operation mix

Count-FPGrowth ()

Count-GM ()

Count-Apriori ()

Integer ALU operations per

instruction

0.65 0.64 0.34

Memory operations per

instruction

0.72 0.69 0.66

Note: Each column need not sum up to 1


Memory access behavior

8%9%9%CPU utilization

0.040.030.03L3 misses per instruction

27%40%39%L3 hit rate



Count-Apriori()Count-GM()Count-FPGrowth()


Roadmap



FP-tree

Minimum support = 3

a:1

p:1

c:1

f:1

m:1

r

COUNT

ITEM

NODE POINTER

CHILD POINTERS

PARENT POINTER

a:2

p:1

c:2

f:2

m:1

r

b:1

m:1

a:4

p:2

c:3

f:3

m:2

c:1

b:1

p:1

f:1

b:1

b:1

r

m:1•Index based on largest common prefix Typically results in a compressed

data set representation

Node pointers allow for fast searching of items


FP-Growth

• Basic step:– For each item in the FP-tree,

build conditional FP-tree– Recursively mine the

conditional FP-tree

a:4

p:2

c:3

f:3

m:2

c:1

b:1

p:1

f:1

b:1

b:1

r

m:1

r

a:3

c:3

f:3

Conditional

FP-tree for m:

COUNT

ITEM

NODE POINTER

CHILD POINTERS

PARENT POINTER

Mine conditional FP-tree

We process items p, f, c, a, b similarly

•Dynamic data structure and only two node fields are used through the bottom up traversal

•Poor spatial locality

•Large data structure

•Poor temporal locality

•Pointer de-referencing

•Poor instruction level parallelism (ILP)


Cache-conscious prefix tree• Improves cache line utilization

– Smaller node size + DFS allocation• Allows the use of hardware cache line pre-fetching for bottom-

up traversals

a

p

c

f

m

c

b

p

f

b

b

r

m

ITEM

oo o o

oooooooo

Header lists (for Node pointers)

oo o o

oooooooo

Count lists (for Node counts)

DFS allocation

abcfmp

abcfmp

COUNT

CHILD POINTERS

NODE POINTER

PARENT POINTER


Path tiling to improve temporal locality

• DFS order enables breaking the tree into tiles based on node addresses– These tiles can

partially overlap• Maximize tree reuse

– All tree accesses are restructured to operate on a tile-by-tile basis

r

Tile 1 Tile 2 Tile NTile N-1


Improving ILP using SMT

• Simultaneous multi-threading (SMT) – Intel hyper-threading (HT) maintains two hardware contexts on

chip– Improves instruction level parallelism (ILP)

• When one thread waits, the other thread can use CPU resources

• Identifying independent threads is not good enough• Unlikely to hide long cache miss latency• Can lead to cache interference (conflicts)

• Solution: Restructure multi-threaded computation to reuse cache on a tile-by-tile basis

r

Tile 1 Tile 2 Tile NTile N-1

Thread1 Thread2 Same data but different computation


Speedup for FP-Growth (Synthetic data set)

4000 4500 5000 5500


Speedup for FP-Growth(Real data set)

50000 58350 66650 75000

For FP-Growth, L1 hit rate improves from 89% to 94%L2 hit rate improves from 43% to 98%

SpeedupGenmax – up to 4.5x Apriori – up to 3.5x


Roadmap



Related work (1)

• Data mining algorithms– Characterizations

• Self organizing maps– Kim et al. [WWC99]

• C4.5– Bradford and Fortes [WWC98]

• Sequence mining, graph mining, outlier detection, clustering, and decision tree induction

– Ghoting et al. [DAMON05]– Memory placement techniques for association rule

mining• Considered the effects of memory pooling and coarse

grained spatial locality on association rule mining algorithms in a serial and parallel setting

– Parthasarathy et al. [SIGKDD98,KAIS01]


Related work (2)

• Data base algorithms– DBMS on modern hardware

• Ailamaki et al. [VLDB99,VLDB2001] – Cache sensitive search trees and B+-trees

• Rao and Ross [VLDB99,SIGMOD00]– Prefetching for B+-trees and Hash-Join

• Chen et al. [SIGMOD00,ICDE04]


Ongoing and future work

• Algorithm re-design for next generation architectures – e.g. graph mining on multi-core architectures

• Cache-conscious optimizations for other data mining and bioinformatics applications on modern architectures– e.g. classification algorithms, graph mining

• Out-of-core algorithm designs• Microprocessor design targeted at data

mining algorithms


Conclusions

• Characterized the performance of three popular frequent pattern mining algorithms

• Proposed a tile-able cache-conscious prefix tree– Improves spatial locality and allows for cache line pre-fetching– Path tiling improves temporal locality

• Proposed novel thread-based decomposition for improving ILP by utilizing SMT– Overall, up to 4.8-fold speedup

• Effective algorithm design in data mining needs to take into account modern architectural designs.


Thanks

• We would like to acknowledge the following grants– NSF: CAREER-IIS-0347662– NSF: NGS-CNS-0406386– NSF: RI-CNS-0403342– DOE: ECPI-DE-FG02-04ER25611

Documents

Copyright 2005, Data Mining Research Lab, The Ohio State University Cache-conscious Frequent Pattern Mining on a Modern Processor Amol Ghoting, Gregory