23
Empowering Businesses using Yelp Reviews Mining Vipul Munot Pralhad Sapre Nishant Salvi Neelam Tikone Rutuja Kulkarni Fall 2016 Advisor Prof. Xiaozhong Liu Z534 ILS Search

Empowering Businesses using Yelp Reviews Mining

Embed Size (px)

Citation preview

Page 1: Empowering Businesses using Yelp Reviews Mining

Empowering Businesses using YelpReviews Mining

Vipul MunotPralhad SapreNishant Salvi

Neelam TikoneRutuja Kulkarni

Fall 2016

Advisor Prof. Xiaozhong Liu

Z534 ILS Search

Page 2: Empowering Businesses using Yelp Reviews Mining

Yelp Dataset Mining

Agenda

• Task 1 -Predicting categories of a business (multi-class and multi-label)

• Task 2 - Predict pros and cons of a business

(topic modelling)

2

Page 3: Empowering Businesses using Yelp Reviews Mining

Yelp Dataset Mining

Technologies

• MongoDB• Python• Gensim• NLTK• Scikit-learn• TextBlob• R

3

Page 4: Empowering Businesses using Yelp Reviews Mining

Yelp Dataset Mining

Exploratory Data Analysis

• Total number of Reviews: 2685066• Total businesses : 85901

4

Page 5: Empowering Businesses using Yelp Reviews Mining

Ratings for Businesses

5

Page 6: Empowering Businesses using Yelp Reviews Mining

Top CategoriesRestaurants : 26729 Shopping : 12444Food : 10143Beauty : 7490Health & Medical : 6106Home Services : 5866Nightlife : 5507

6

Page 7: Empowering Businesses using Yelp Reviews Mining

Data Preprocessing

• Merging the reviews and businesses using Business id’s.

• Merge all the reviews into a Passage for that Business id.

• Remove stop words from reviews.• Use TF-IDF to create the word vector.• Class labels : All categories for that business id.

7

Page 8: Empowering Businesses using Yelp Reviews Mining

Data

Multi-Class:Hotel Chocolate - [Coffee &, Tea, Food, Cafes, Chocolatiers & Shops, Specialty Food, Event Planning & Services, Hotels Travel, Hotels, Restaurants]

Multi-Label:Prediction of at least 45% of distinct labels which “Hotel Chocolate” have.

8

Page 9: Empowering Businesses using Yelp Reviews Mining

Task 1

Prediction of Business categories using reviews

• Naive Bayes• Logistic Regression• Random Forest

We built the Naive Bayes classifier ground up. For the rest we used scikit-learn

9

Page 10: Empowering Businesses using Yelp Reviews Mining

Challenges Faced

• Multi-class classification• Adapting existing classifiers (one vs all) • Preprocessing the data (engineering problem)• Defining own Accuracy function based on

nature of Problem (Partial accuracy - 45% in our case)

• Labels assigned are not mutually exclusive• There is an inherent class hierarchy - could be

learned by association rule mining

10

Page 11: Empowering Businesses using Yelp Reviews Mining

Adapting classifiers to multi-class, multi-label problems• Make probabilistic prediction• Take top 7 categories• Accurate += 1

if (prediction ∩ truth) > len(truth) * 0.45• This is the idea of partial match

e.g

11

"predicted_labels" : [ "Automotive", "Oil Change Stations", "Auto Repair", "Tires", "Shopping", "Auto Parts & Supplies", "Gas & Service Stations" ]

"labels" : [ "Automotive", "Auto Parts & Supplies" ]

Page 12: Empowering Businesses using Yelp Reviews Mining

Evaluation Metrics• Hamming Loss:

Fraction of the wrong labels to the total number of labels

• Hamming Score : Number of correct labels divided by the union of predicted andtrue labels

12

Page 13: Empowering Businesses using Yelp Reviews Mining

Evaluation metrics (contd.)• Precision :

The fraction of retrieved instances that are relevant.

• Recall :The fraction of relevant instances that are retrieved.

13

Naive BayesAvg precision - 0.33Avg recall - 0.80Avg hamming score - 0.31

Page 14: Empowering Businesses using Yelp Reviews Mining

Performance of classifiers (Partial Label Match - 45%)

14

Businesses Classifier (One vs All) Accuracy (75%-25% split)

0 - 80000 (full set) Naive Bayes 90.94

0 - 20000 (537K reviews) Random Forest 76.18

20000 - 40000 (897K reviews) Random Forest 75.97

20000 - 40000 (897K reviews) Logistic Regression 90.64

40000 - 60000 (574K reviews) Random Forest 68.62

40000 - 60000 (574K reviews) Logistic Regression 89.61

Page 15: Empowering Businesses using Yelp Reviews Mining

Yelp Dataset Mining

Task 2: Objectives

• The prediction goal was to figure out the words, phrases, ratings, and patterns that predict pros and cons of the business.

• Also we extract the good and bad features for every

restaurant which can help in providing suggestions to yelp users.

15

Page 16: Empowering Businesses using Yelp Reviews Mining

Yelp Dataset Mining

Task 2: Tools and Techniques• Gensim: used for applying the LDA algorithm • TextBlob: used for assigning POS tags• NLTK: used for removing stop words, extracting nouns

and creating bag of words• LDA (Latent Dirichlet algorithm): used for grouping

similar terms from negative and positive reviews together and associating a name to that grouping.

16

Page 17: Empowering Businesses using Yelp Reviews Mining

17

Task 2: Build Model

Page 18: Empowering Businesses using Yelp Reviews Mining

18

Task 2: Utilize Model

Page 19: Empowering Businesses using Yelp Reviews Mining

Task 2: Analysis

Top 10 Good Topics 1. Customer Service2. Food3. Bar & Liquor4. Overall Quality5. Mexican Food6. Breakfast7. Ambiance & Hospitality8. Expensive9. Location10. Entertainment

19

Page 20: Empowering Businesses using Yelp Reviews Mining

Task 2: Analysis

Top 10 Bad Topics 1. Staff and Service2. Coffee and Cake3. Ambiance and Hospitality4. Bad Service5. Pet Friendliness6. Delivery Services7. Entertainment8. Parking and Utilities9. Food10. Mexican Food

20

Page 21: Empowering Businesses using Yelp Reviews Mining

Yelp Dataset Mining

Task 2: Results• Business Id - 1vQLTKwmcmZXtNzfKEvMmA• Good points-

Food, Mexican Food, Overall Quality Bad Points -

Delivery Services, Staff and Service

21

Page 22: Empowering Businesses using Yelp Reviews Mining

Yelp Dataset Mining

Future Scope

• Association rules to define hierarchy of labels

• Device formula to convert good and bad topics into rating

• Human feedback for task 2 to evaluate.

22

Page 23: Empowering Businesses using Yelp Reviews Mining

Yelp Dataset Mining

Questions?

23

Thank You