Upload
roger-pennings
View
247
Download
0
Tags:
Embed Size (px)
Citation preview
1
PROFESSOR:DR. GOSTA GRAHNE
LAB INSTRUCTOR:ASHKAN AZARNIK
GROUP 15 ADITYA DEWAL
MOHAMMAD IFTEKHARUL HOQUE SALEH AHMED
Advance Database Systems and Applications
COMP 6521
2
PROJECT 1
Develop a program which sort numbers in ascending order using 2 Phase Multiway Merge Sort(2PMMS) with limitation of 5MB of virtual memory.
External sorting is required when the data being sorted do not fit into the main memory of a computing device and instead they must reside in slower external memory (usually hard drive).
3
Our approached to solve the problem External sorting typically uses a sort-merge
technique.
In the sorting phase, chunks of data small enough to fit in main memory are read, sorted in ascending order using quick sort algorithm and written out to a temporary file.
In the merge phase, the sorted temporary files are combined using 2 phase multiway merge sort into a single larger file.
4
Challenges Which algorithm to choose ?
Quicksort is one of the fastest and simplest sorting algorithm because its inner loop can be efficiently implemented on most architectures.
Efficient average case compared to other sort algorithms.
The complexity of quick sort in the average case is O(n log(n)
5
List of Data Structures Primitive Types:
Boolean, Integer, Long Abstract Types:
Array, String Arrays (Linear Data Structure)
Integer Array, Boolean Array, Long Array I/O:
Scanner, PrintWriter
6
Buffer Size Experiments
0 50 100 150 200 250 3000
20
40
60
80
100
120
140
160
180
200
The execution time (sec) as a function of the buffer size (KB)
Small
Medium
Large
Buffer Size KB
Exe
cuti
on
Tim
e S
ec
7
Conclusion
After our buffer size experiments we concluded that for 160000 number of data which occupying 2.5mb of memory gives best execution time for us.
8
Results from Demo
The execution time to run our program during the demo was 3 minutes.
The reason for taking too much time
was the way we were taking our input and writing output in our program.
9
Project 2
Mining Frequent Itemsets from Secondary Memory
Build an application that will compute the frequent itemsets of all sizes (Pairs, Triples, Quadruples, etc.) from a set of transactions
based on input support threshold percentage.
10
Algorithms Considered
AprioriHorizontal Data Layout
EclatVertical Data Layout
11
Algorithms Considered
AprioriBreadth-First Traversal
EclatDepth-First Traversal
12
ECLAT
Better Execution TimeExecution time is better than Apriori
Memory EfficientRequire less amount of memory compare to Apriori if itemsets are small in number
Depth-First Search
Explore the unexplored
13
ECLAT Algorithm
For each item, store a list of transaction ids (tids)
TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D
10 B
HorizontalData Layout
A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109
Vertical Data Layout
TID-list
14
ECLAT AlgorithmDetermine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets.
3 traversal approaches: top-down bottom-up hybrid
B1257810
AB1578
A1456789
15
ECLAT Algorithm
16
Primitive Types
Boolean, Integer, Double
Abstract Types
Map, Set, List, Array,
String
Arrays (Linear Data Struc.)
Hash Map (Hash Table)
Hash Set (Hash Map)
Array List (Dynamic Array)
Bit Set (Bit Array)
String Array
Trees
Search Tree
List of Data Structures
ECLAT Implementation
17
ECLAT Implementation
Our implementation denotes the set of transactions as a bit set.
Intersects rows to determine the support of item sets.
The search follows a depth first traversal of a prefix tree as it is shown in Figure 1.
18
ECLAT ImplementationDivide and Conquer Phase
Divide the file in N partitions. If an item is frequent in one partition we don’t check it again.
Merge Phase
Suppose an item is not frequent in any partition but it is frequent globally, it is going to come when we would merge.
In the merge part we would run the algorithm again with the infrequent items.
19
ECLAT Implementation
File size = 10000, Threshold = 2%An item is frequent if it occurs >= 200 timesWe would get intermediate results by checking all the partitions.Merge part we would work with the infrequent items for each partition, and then merge the results to get the final output list of frequent items
20
Eclat Execution Time
Execution time of Eclat for Small and Medium datasets:
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.220
50
100
150
200
250
Small Dataset
Eclat
Support
Tim
e m
s
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.220
50
100
150
200
250
300
350
400
450
500
Medium Dataset
Eclat
Support
Tim
e m
s
21
Eclat VS Apriori
We have compared the execution time for Apriori and Eclat for Small and Medium datasets and found the following:
0 0.05 0.1 0.15 0.2 0.250
5000
10000
15000
20000
25000
0
50
100
150
200
250
Small Dataset
Apriori
Eclat
Support
Ap
rio
ri T
ime
Ecl
at T
ime
ms
0 0.05 0.1 0.15 0.2 0.250
10000
20000
30000
40000
50000
60000
70000
80000
0
50
100
150
200
250
300
350
400
450
500
Medium Dataset
Apriori
Eclat
Support
Ap
rio
ri T
ime
Ecl
at T
ime
ms
22
Benefits of Divide and Conquer
Program executes for Large files. Gives better performance.
23
Results from Demo
Execution time was 35 seconds.
24
REFERENCES
Project 1Database Systems, the complete book by Hector Gracia-Molina, Jeff Ullman, and Jennifer widom
http://en.wikipedia.org/wiki/Quicksort
Project 2
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=846291&userType=inst
http://www.ece.northwestern.edu/~yingliu/papers/para_arm_cluster.pdf
http://ceur-ws.org/Vol-90/borgelt.pdf
http://www.isca.in/COM_IT_SCI/Archive/v1i1/2.ISCA-RJCITS-2013-001.pdf
http://www.intsci.ac.cn/shizz/fimi.pdf