Upload
clare-ware
View
30
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Frequent Word Combinations Mining and Indexing on HBase. Hemanth Gokavarapu Santhosh Kumar Saminathan. Introduction. Many projects use Hbase to store large amount of data for distributed computation - PowerPoint PPT Presentation
Citation preview
S
Frequent Word Combinations Mining and Indexing on
HBase
Hemanth GokavarapuSanthosh Kumar Saminathan
Introduction
Many projects use Hbase to store large amount of data for distributed computation
The Processing of these data becomes a challenge for the programmers
The use of frequent terms help us in many ways in the field of machine learning
Eg: Frequently purchased items, Frequently Asked Questions, etc.
Problem
These projects on Hbase create indexes on multiple data
We are able to find the frequency of a single word easily using these indexes
It is hard to find the frequency of a combination of words
For example: “cloud computing”
Searching these words separately may lead to results like “scientific computing”, “cloud platform”
Objective
This project focuses on finding the frequency of a combination of words
We use the concept of Data mining and Apriori algorithm for this project
We will be using Map-Reduce and HBase for this project.
Survey Topics
Apriori Algorithm
HBase
Map – Reduce
Data Mining
What is Data Mining?
Process of analyzing data from different perspective
Summarizing data into useful information.
Data Mining
How Data Mining works?
Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries
What technology of infrastructure is needed?
Two critical technological drivers answers this question.
Size of the database
Query complexity
Apriori Algorithm
Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules.
Association rules form an very applied data mining approach.
Association rules are derived from frequent itemsets.
It uses level-wise search using frequent item property.
Algorithm Flow
Apriori Algorithm & Problem Description
10
Transaction ID Items Bought1 Shoes, Shirt, Jacket2 Shoes,Jacket3 Shoes, Jeans4 Shirt, Sweatshirt
If the minimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that satisfies the minimum support.
Frequent Itemset Support{Shoes} 75%{Shirt} 50%{Jacket} 50%{Shoes, Jacket} 50%
If the minimum confidence is 50%, then the only two rules generated from this 2-itemset, that have confidence greater than 50%, are:
Shoes Jacket Support=50%, Confidence=66%Jacket Shoes Support=50%, Confidence=100%
Apriori Algorithm Example
Scan D
itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
C1
itemset sup.{1} 2{2} 3{3} 3{5} 3
L1
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
L2
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
C2 itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
C2
Scan D
C3 itemset{2 3 5}
Scan D L3 itemset sup{2 3 5} 2
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database DMin support =50%
Apriori Advantages & Disadvantages
ADVANTAGES:
Uses larger itemset property
Easily Parallelized
Easy to Implement
DISADVANTAGES:
Assumes transaction database is memory resident
Requires many database scans
HBase
What is HBase?
A Hadoop Database
Non - Relational
Open-source, Distributed, versioned, column-oriented store model
Designed after Google Bigtable
Runs on top of HDFS ( Hadoop Distributed File System )
Map Reduce
Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster.
Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured )
Map Reduce
Mapper and Reducer
Mappers FreqentItemsMap -Finds the combination and assigns the key value for each combination CandidateGenMap AssociationRuleMap
Reducer FrequentItemsReduce CandidateGenReduce AssociationRuleReduce
Flow Chart
Find Frequent Items
Start
Find Candidate Itemsets
Find Frequent Items
Set Null?
Generate Association Rules
No
Yes
Schedule
1 week – Talking to the Experts at Futuregrid
1 Week – survey of HBase, Apriori Algorithm
4 Weeks -- Kick start on implementing Apriori Algorithm
2 Weeks – Testing the code and get the results.
Results
Conclusion
The execution takes more time for the single node
As the number of mappers getting increased, we come up with better performance
When the data is very large, single node execution takes more time and behaves weirdly
Screenshot
Known Issues
When the frequency is very low for large data set the reducer takes more time
Eg: A text paragraph in which the words are not repeated often.
Future Work
The analysis can be done with Twister and other platform
The algorithm can be extended for other applications that use machine learning techniques
References
http://en.wikipedia.org/wiki/Text_mining
http://en.wikipedia.org/wiki/Apriori_algorithm
http://hbase.apache.org/book/book.html
http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_apriori.html
http://www.codeproject.com/KB/recipes/AprioriAlgorithm.aspx
http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf
Questions?