FrequencyofWordCombinationsusingAprioriAlgorithm ...salsahpc.indiana.edu/csci-b649-2011/collection/Term... · 2011. 12. 19. · FrequencyofWordCombinationsusingAprioriAlgorithm)))))B649TermProjectA)Group9))

Frequency of Word Combinations using Apriori Algorithm B649 Term Project - Group 9

Hemanth Gokavarapu Santhosh Kumar Saminathan [email protected] [email protected] Introduction:

The data processing and data mining are becoming more popular and vital in the modern days of data intensive computing. We execute the computations in parallel in order to increase speed and performance. In this scenario, we have to get the right data at the right time effectively. Our project concentrates on searching the right combination of words in a big data set to find the correct pair. In this way we can get the correct combination of words from a huge data set. Our approach uses the Apriori algorithm. The implementation of the project is done using the MapReduce framework and HBase database. Motivation: The frequently accessed item sets are very much useful in the real world applications to provide better solutions. For example an e-‐commerce application is very much interested to know what the customers have bought frequently in the past. Similarly the frequently asked queries or questions are useful in many of the modern applications. In our project we calculate the frequency of exact combination of words using the apriori algorithm. Approach: The apriori algorithm is used to find the association rules for a given transaction set. It finds the most frequent subset by using the association rule mining. It follows the bottom up approach when frequent subsets are extended one item at a time (the candidate generation step), and groups of candidates are tested against the data. The algorithm terminates when no further extensions are found. This approach is explained clearly in the below flow chart. Checking the number of times each item has occurred in the given set forms the initial set of frequent sets. From this set the candidate item sets are generated for each of the combination thus leading to the next level. Now for the newly formed set, the frequency is determined by checking against the data. The following factors are considered for eliminating the itemsets. Confidence (A-‐>B)= (Number of tuples containing both A and B) / (Number of tuples containing A) Support(A-‐>B) = (Number of tuples containing both A and B) / (Total number of tuples containing )

The next step is to find the next set of items for the next level. This operation is repeated till the set becomes empty. Once the set is empty, the association rules are formed.

Implementation: In this project we use MapReduce framework for parallel execution and HBase to store the results. MapReduce: MapReduce is a framework invented by Google to perform large computations in a distributed environment to perform operations over a large set of data on cluster or cloud platform. By implementing the MapReduce parallel executions can be done which increases the performance. It has Mapper and Reducer Tasks. The Mapper is used to partition the input into sub-‐problems and assigns them to the worker nodes. The worker nodes process these sub-‐problems. The Reducer collects all the output sent by the worker nodes and forms the output.

HBase: HBse is a non-‐relational, open source database designed for Google Big Table. It runs on top of HDFS. In our project, every time we calculate the frequent sets we store the values in the HBase database. We use 3 map tasks in our project viz, FrequentItemsMap, CandidateGenMap and AssociationRuleMap. The FrequentItemsMap is used to calculate the initial frequency of the items and then the CAndidateGenMap is for calculating the candidate sets for the intermediate results and the AssociationRuleMap is for the calculation of association rules. Similarly we have the following reducers, FrequentItemsReduce, CandidateGenReduce and AssociationRuleReduce. Time Schedule: 1 Week – Taking to the experts at Futuregrid to understand the problem 2 Weeks – Survey of Hbase and Apriori Algorithm 4 Weeks – Kick start and work on implementing Apriori Algorithm 2 Weeks – Testing the code and get the results Result:

We implemented the project using Hadoop and found the computation time for mappers 2, 4 and 6. We obtained the above results for single node and multi node environment. We use the results from the above program and developed the web interface, we can type a word and it will give the corresponding combinations along with their frequencies.

Conclusion: From the results we obtained, we conclude that the execution time is more for the single node. As the number of mappers are increased in the multimode environment, we see better performance in terms of time. When the data is extensively large, single node execution takes more time and sometimes behaves weirdly. Acknowledgement: We thank Professor Judy Qiu for helping us by clearing the doubts and kindling our thought to come up with new ideas and implementation. We also thank the assistant instructor, Stephen who helped us by giving details about errors that we got. Future Work: The project can be further enhanced in many ways. We have implemented using the Hadoop. The same project can also be implemented in Twister and the performance analysis of both can be obtained to get a better view of both. This simple algorithm can be molded in many real world applications that involve machine-‐learning techniques. Reference: [1] http://en.wikipedia.org/wiki/Apriori_algorithm [2] http://en.wikipedia.org/wiki/Mapreduce [3] http://en.wikipedia.org/wiki/Text_mining [4] http://hbase.apache.org/book/book.html [5] http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_apriori.html [6] http://www.codeproject.com/KB/recipes/AprioriAlgorithm.aspx [7] http://rakesh.agrawal-‐family.com/papers/vldb94apriori.pdf

Documents

FrequencyofWordCombinationsusingAprioriAlgorithm ...salsahpc.indiana.edu/csci-b649-2011/collection/Term... · 2011. 12. 19. · FrequencyofWordCombinationsusingAprioriAlgorithm)))))B649TermProjectA)Group9))