Upload
parineetha-tirumali
View
250
Download
0
Embed Size (px)
Citation preview
Data Crackers on YELP Dataset- Prashanth Sandela (PS)- Vimal Chandra Gorijala (VG)- Parineetha Tirumali Gandhi (PG)
Agenda
Project Vision
Data Mining Tasks
Hypothesis
Data
Solutions
Experiment results
Models Comparison
Conclusion
Timelines
Project Vision
Yelp Dataset has variety of businesses on which user gives reviews and starrating.
Our task is to Classify User Rating Stars based on user review text, businesses and users.
We used various Data Mining models to classify user star ratings by applyingvarious model tuning techniques to attain optimal accuracy of classifying.
Data Mining Task
This kind of classification comes under Multi-Class Classification.
Data Processing: Converting to CSV Stop word Removal Special Character removal Lower case conversion Consolidation of dataset
Formulated Models Naïve Bayes implementation in HIVE Naïve Bayes Multinomial Classification using WEKA Naïve Bayes Multinomial Text (Best Case) Decision Tree KNN
Evaluation Metrics: Ngrams Features Percentage Division of Test and Training Data Accuracy
Hypothesis
As it is classification which is based on text, Naïve Bayes classification could begood model to be considered.
Dividing text into ngrams will increase accuracy. But, uni-grams should give bestaccuracy. This assumption is wrong, as bi-grams gave better accuracy.
Use of both business id and user id together will not give better accuracy whenconsidered individual features.
Data
213509163761
748188
NEGATIVE NEUTRAL POSITIVE
Review Stars
Count
66.5%
13.5%20%
It contains 1.3 Million reviews, 40 Thousand business and 250Thousand users.
Classify review stars based on review text. Stars are from points1 to 5. To reduce the problem, we reduce stars 4, 5 as 1, stars 3as 0 and stars 1, 2 as -1
Features: Business Id Review Id User Id Review text Stars (Ratings)
Classification Negative Neutral Positive
Tools and Technologies Used• Pentaho Data Integration• HIVE( In AWS using S3 and Ec2)• WEKA• Experimented with H2O
Solution
Naïve Bayes(Implementation in HIVE)Implementation of Naive Bayes can be done in HIVE and doing it in HIVE is scalable and there won't be any limitation on size of
data. Huge task would be to query everything.
Naïve Bayes multinomial(WEKA)It is a special type classifier which uses a multinomial distribution for each of the features. This model is mainly useful for multiclassclassification. Initially we have 5 classes to classify the reviews but we have reduced them into three (positive, neutral, negative).
Naïve Bayes multinomial Text (WEKA)It a special type of Bayesian classifier which operates only on string attributes. It suites best for the text data. Other types of
attributes are accepted but are ignored during training and classification. It uses word frequencies rather than bag of words Representation.
Decision Tree (WEKA)
Instances are described with fixed set of attributes and their values.
Suited for almost all kinds of inputs like text, numeric and nominal data
Easily extend to learning function with more than two possible outcomes.
Learning methods are robust to errors
KNN (WEKA)Classification of unknown instances can be done by relating the unknown to the known according to some distance/similarity
function.
Experiment Results(Naïve Bayes Multinomial & Multinomial Text)
Naïve Bayes Multinomial Naïve Bayes Multinomial Text
Initial % 48 54
Stopwords 53 56
Stemmer 54 59
Unigrams 59 61
Min Word Frequency from 5 -10 65 66
Bigrams 70 71
Trigrams 63 62
Business and user id 63.5 63
Business id 74 75
User id 74 74.7
Attribute Selection Filter 78 78.4
Bag of Words Count 79
Over all Accuracy 79.49 79.6
Experiment Results( KNN & Decision Tree)
5NN Decision TreeInitial % 68.2535 72.4138Stopwords 68.2535 72.6521stemmer 68.2535 72.9523unigrams 64.5532 73.1538bigrams 65.5532 73.3251trigrams 65.5532 72.5216business and userid 66.3256 69.5364business id 69.6529 73.3526user id 70.1253 74.2596bag of words 71.1253 74.2596overall accuracy 71.1253 74.2596
Manhattan Euclidean
Bi/Trigrams 65.5131 64.5532
Stemmer(Lovins) 68.2535 68
Words to Keep(5000) 71.1253 70
Experiment Results( Naïve Bayes in HIVE)
Sl. No Action *Probability Model
1 Initial Dataset 44%
2 Refining of Training and Test Data 7%
3 Change of Stemmer 0.50%
4
Ngrams:
Unigrams 3%
Bigrams 7%
Trigrams 2%
5
Including Features
Business id and User id -1%
Business id 2%
User id 4%
6 Bag of words 5%
7 Overall Accuracy on 100,000 records ~72%
8 Accuracy on complete dataset ~74%
Experiment Results(Naïve Bayes Multinomial)
Sl. No Action *Naïve Bayes Multinomial
1 Initial Dataset 46%
2 Refining of Training and Test Data N/A
3 Change of Stemmer N/A
4
Ngrams:
Unigrams 3.50%
Bigrams 7.50%
Trigrams 2%
5
Including Features
Business id and User id 1%
Business id 3%
User id 5%
6 Bag of words 4%
7 Overall Accuracy on 100,000 records ~76%
8 Accuracy on complete dataset N/A
Comparison
Use Case(Why not business id and User id together)
Business id User id Review text Stars
1 1 Laptop is good 1
1 2 HP is bad 0
1 3 Lenovo is good 1
2 1 Pizza is good 1
2 4 Pizza is bad 0
3 1 Product is bad 0
Training Data
Business id User id Review text
1 1 Product is good
2 4 Pizza is good
Test Data
Bus id Words Probability Stars
1 Laptop, HP,Lenovo
0.15 1
1 Good 0.35 1
1 Bad 0.15 0
2 Pizza 0.5 1
2 Good 0.25 1
2 Bad 0.25 0
Business id
User id Words Probability Stars
1 Laptop, Pizza 0.16 1
1 Product 0.16 0
1 Good 0.32 1
1 Bad 0.16 0
User id
Project Management
PG
VG
PS
2014Week 1 3 5 7 9 11 13
Project Proposal
9/11/2014
Report 1 - Initial Attempt
9/30/2014
Report 2 - Data Processing & Feature Selection
10/28/2014
Report 3 - Model Selection & Tuning
11/27/2014
Project Proposal Initial Attempt Data Processing & Feature Selection Model Selection & Tuning
Dataset Decision
Data Mining Problems
Key Attributes
Machine Learning Models
Data Sampling
Data Cleaning
Model Selection and implementation
Future Task
Data Quality Problems
Data Processing
Feature Selection
Feature Extraction
Model Selection
Model Selection
Model Selection Model Tuning & Results Comparison
Model Tuning & Results Comparison
Model Tuning & Results ComparisonDataset Decision
Dataset Decision
Feature Selection
Feature Selection
Thank You!!
Questions…?