Yelp Business Rating Prediction

8/18/2019 Yelp Business Rating Prediction

1/8

Yelp Business Rating Prediction

Rahul

[email protected]

Tamilmani Manoharan

[email protected]

Venkatakrishnan

[email protected]

ABSTRACT

Yelp has an enormous amount of data on a variety of local businesses. Yelp only provides the aggregated rat-ing for each business. The rich information about thebusinesses, users and reviews can be used for predict-ing how a restaurant’s reviews have been changing, howvarious features of a business affect its ratings, how the

behaviour of users impact the ratings etc. There arevarious studies regarding how the varrious features of a business affect the ratings and how much review texthas an impact on the ratings a business receives. In thisproject, we investigate various features of Yelp data tobuild models for business rating prediction. We haveconsidered both the features as well as the review textto generate models for business rating prediction.

Keywords

Yelp, Ratings, Prediction, Feature Extraction, LinearRegression, NLP

1. INTRODUCTION

Yelp, Inc. is a company that operates a social net-working, user review, and local search web site of thesame name. Over 31 million people access Yelp’s web-site each month, putting it in the top 150 of U.S. Inter-net web sites. The company also trains small businessesin how to respond to reviews, hosts social events for re-viewers, and provides data about businesses, includinghealth inspection scores.

The goal of our project is to predict the ratings of various businesses given their features, reviews and thedata about the users who have written the reviews. We

have approached this problem in two different ways:i) Predict the ratings of different businesses with theirgiven aggregated features and the features we have ex-tracted for each business by combining user and reviewdata with the business data.ii) Learn how the ratings of a particular business hasbeen changing over time and predict how it will changein the future.

2. PRIOR WORK

In the ”Semantic feature analysis and mining for yelpprediction” paper, they have tried to predict review rat-ing based on average business and user rating, reviewcount and number of votes features. They have initiallytaken average rating as a feature and calculated trainMSE and test MSE. The MSE got reduced when they

added features like review count, vote count along withaverage rating. Also they have done topic modelingand calculated scores for each review . They have com-bined this feature with the above mentioned features topredict the review rating which minimized MSE errorfurther. We found this idea to be meaningful and haveplanned to integrate this idea into our prediction model.

In the ”Inferring Future Business Attention” paper,they have taken into account the review text to predictthe future ratings a business will receive. They haveused the methods of sentiment analysis and keyword-opinion extraction. We have planned to go ahead withthe approach of sentiment analysis to generate featurevectors from the review text in order to predict the re-

view rating. In the paper ”Data Mining Yelp Data - Pre-dicting rating stars from review text”, they have usedthe methods of Latent Dirchlet Allocation(LDA) andTerm Frequency−Inverse Document Frequency(TF−IDF)to predict the rating. In addition to this model, wehave planned to combine the text features along withthe other features as explained above. We have alsoplanned to extract features using LDA as well.

3. DATA DESCRIPTION

Yelp provides a snapshot of their enormous amountof data as ”Yelp Dataset Challenge”. This set includes

information about local businesses in 10 cities across 4countries. The dataset is split into 5 ’.json’ files, eachone representing a different category : business data,checkin data, user data, review data and tips providedby users to improve the businesses.

All the user related data have a unique user id, busi-ness data have a business id and the other data such ascheckin, reviews and tips connect the user and businessusing the unique user id and business id. The overallsize of the data is around 1.64GB.


2/8

For our project we have used three datasets namely’yelp academic dataset review.json’,’yelp academic dataset user.json’ and’yelp academic dataset business.json’ which are the Re-view, User and Business datasets respectively. Somedetails about the datasets are as follows

Table 1: Dataset DescriptionDataset Rows Size

User 366,715 166.2MBBusiness 61,184 55.4MBReview 1,578,264 1.4GB

The review table includes information about each re-view. Specifically, it contains business id, user id, stars(Astar rating on a scale of 1-5), text (The raw review text),data, votes(The number of ’useful’, ’funny’ or ’cool’).The user table consists of user id, name, review count,

average stars (Average rating on a scale of 1-5 made bythe user), votes(the total number of votes for reviewsmade by this user). Business table contains detailsabout business including id, name, neighborhoods, ad-dress and geographic information, stars, reviews count(The total number of reviews about this business), cat-egories (a list of category tags for this business), andother attributes.

Data Preparation and Cleaning

We have used an open source application from GitHubcalled ’jsontocsv’ that converts a .json file to a .csv file.The application also converts nested attributes into sep-arate columns in the csv.

Also, when the feature set did not include the reviewtext, we separated the ’text’ attribute from the reviewdataset to reduce the size of the dataset and also to im-prove execution time. We extracted only the featureswe wanted and combined them with the features ex-tracted from the review text to minimize the load onthe machine and gained faster execution time.

4. EXPLORATORY DATA ANALYSISSince there are a lot of businesses with very few re-

views, we have planned to work on data that are heavilyreviewed. In order to get rich data with sufficient at-tributes to perform prediction of business ratings, EDAwas performed to find out the categories that have themost reviews and extract out only those particular datathat are heavily reviewed. As explained in the ProblemStatement, any business with insufficient reviews willnot be helpful in any analysis.

Figure 1: Rating Stars distribution

The plot of stars vs counts as shown above indicatesthat more than half of the reviews are positive i.e theyare rated 4 or 5. Even the third highest count is 3 starswhich may indicate that users only tend to rate a placeif they really like it.

Figure 2: Businesses of cities distribution

The no.of categories that are present more than athousand times in the business column are very less. Wecan see that restaurant category is present in more than20k reviews which accounts for 1/3rd of the businessdata. Other categories are present in relatively verylesser numbers, so to enrich the data and as a prelimi-nary step, we have take only the data under Restaurantcategory. After EDA, the data was spliced and refinedbased on only the most reviewed and useful content in

order to minimize the outliers and to perform the pre-diction analysis effectively.

5. POLYNOMIAL FIT OF REVIEW STARS

After EDA we found that 3 businesses are reviewedmore than 1000 times in the Review dataset. Only forthese 3 businesses we sorted the reviews acquired byincreasing time and tried to apply a fit to see whetherthe reviews followed any common pattern.


3/8

Figure 3: Highest reviewed business

Figure 4: 2nd Highest reviewed business

Figure 5: 3rd Highest reviewed business

R-squared is a statistical measure of how close thedata are to the fitted regression line. It is also knownas the coefficient of determination, or the coefficient of multiple determination for multiple regression.

R-squared = Explained variation / Total variationR-squared is always between 0 and 100%. In general,

the higher the R-squared, the better the model fits yourdata.

Table 2: Mean Squared ErrorBusiness 1 Business 2 Business 3

MSE 1.2 1.1 0.87

Table 3: R2 values for various fitsFit Business 1 Business 2 Business 3

Linear 0.0099 0.0007 0.00065Quadratic 0.012 0.0073 0.00065Polynomial 4th 0.015 0.013 0.0048Polynomial 6th 0.022 0.014 0.0067

This means that none of the models fit our data. Weneed more features or a different model to predict thefuture review stars a business might obtain.

6. LINEAR REGRESSION ON FEATURES

In this section, we intend to investigate how each fea-ture of users, businesses, reviews influences rating starsof a business.

Linear regression models the target Y as a linear fun-cion of the feature variables X j , a bias term (α) andregulation term λ:

Y = α +

i

wiX i + λ

The coefficients (wi) are what the training procedurelearns. Each model coefficient describes the expectedweight of in- fluence in the target variable associatedwith feature.

Intuitively, the coefficients often tell an interestingstory of how much each feature matters in predictingtarget values. the bias term indicates the average target

value.For example, in the business rating on Yelp, the value

of coefficient shows strength of the feature and the signof coefficient (positive or negative) indicate direction of association to final rating.

Exploratory Rating Prediction

We aggregated and combined the user and review datato run Linear Regression on various combination of fea-tures and checked how they performed while predictingthe business rating stars. The features were split intotrain and test data and from each of the train and test

datasets, all the features were given as the data and thestars column was given as the result. We use Python’sScikit Learn package sklearn to perform Linear Regres-sion. We tried for different splits of test and trainingdata and achieved similar results.

A few feature combinations that we used are ”All userattributes”, ”Cool, funny, useful votes of review”, ”Av-erage stars given by users”, ”Cool, funny, useful votesreceived by users” and so on. The MSE values for fewof the features are as follows:


4/8

Table 4: Root Mean Square Error ValuesFeature Set train MSE test MSE

all user features 1.262155615 1.289070039cool,funny,useful review 1.719629405 1.724435433avg star user 1.328749457 1.330678437cool,funny,useful user 1.814434709 1.837739733

The mean squared error function computes mean squareerror, a risk metric corresponding to the expected valueof the squared (quadratic) error loss or loss. If ŷi isthe predicted value of the i-th sample, and yi is thecorresponding true value, then the mean squared error(MSE) estimated over nsamples is defined as

MSE (y, ŷ) = 1

nsamples

nsamples−1

i=0

(yi − ŷi)2

In short, the lesser the MSE value, the better is ourmodel and the closer the fit is to the data. Withoutany feature analysis, the raw featurees give MSE val-ues which are very high (> 1). Hence, we decided toinvestigate the features further and perform predictionon only those features that highly impact the businessratings.

7. FEATURE ANALYSIS

In order to find features which best predict the busi-

ness rating, we tried to find the correlation between dif-ferent features of a business and the aggregated ratinga business has received. We selected and extracted thefollowing features and tried to find the co-relation withbusiness rating:

i) Average rating of all the users who reviewed a par-ticular business so far ii) Review Count iii) Useful votesiv) Cool votes v) Funny votes

Figure 6: Correlation Matrix Visualization 1

Figure 7: Correlation Matrix Visualization 2

From the above corelation plot, we could deduce thataverage rating of users feature is highly corelated withbusiness stars when compared to other features. so wehave initially taken average rating of users as a primaryfeature and tried to predict business rating. Later weadded other features and analyzed whether MSE is get-

ting reduced.We tried regression with many combination of fea-

tures with the user average rating. But only the cat-egory average(i.e average rating of the particular cate-gory such as ’Italian’, ’Chinese’ etc), review count andcool/funny/useful votes gave good results.

Table 5: Root Mean Square Error ValuesFeature Set train MSE test MSE

avg user rating 0.3311218195 0.343218655avg user rating,category average

0.5421097938 0.53110551

avg user rating,reviewcount,cool/funny/useful

0.2941391645 0.3005636476

From the above table it is clear that these featuresperform much better than what the initial set of fea-tures before finding out the correlation performed. Thismodel has MSE as low as 0.3 which means the model isa good fit for predicting the business star ratings.

Table 6: Average of Predicted RatingActual Rating Count Avg of predicted Rating

5 88866 4.7277531514 78000 3.8978758313 37588 3.5119514642 24438 2.8824917591 30886 2.437493715

From the EDA we know that more than half of thereviews are positive i.e they are rated 4 or 5. Even thethird highest count is 3 stars which may indicate thatusers only tend to rate a place if they really like it.


5/8

The above table shows that the model does well onratings that were between 3 and 5 but not too well onratings 1 and 2. One reason why this could happen isthat the number of 4 and 5 star reviews is more thantwice the number of reviews with 1-3 stars as we haveseen in the EDA and the model prediction tends to haveless deviation than real rating

8. GENERATING REVIEW TEXT FEA-TURES

There is still room for improvement. So we decidedto use ”Review text” as a feature in addition to aboefeatures to predict the business star ratings. Reviewtext is one of the most important and rich componentsof the Yelp dataset. It can be processed and used ina variety of ways for a regression model. Some of themost common methods used to process review text aren-gram analysis, keyword associations and extractionsand sentiment analysis.

Frequently Occuring KeyWords

Initially, we have taken the approach followed by BryanHood, Victor Hwang and Jennifer King in their paper”Inferring Future Business Attention”. The first methodwe tried was to mine the most frequently occurringkeywords among all restaurant reviews. We then usedcounts of the sentiments for each of these keywords asfeatures.

We use the below algorithm to produce a feature vec-tor containing the counts of positive and negative wordspresent in each review. This is a two-step process: Inthe first step, we compute the top keywords among allthe restaurant reviews. After we extract the top key-words, for each review, we then count the number of

times these keywords occur. We used the Python Nat-ural Language Toolkit (NLTK) for this process.

Steps:

• The first step is to clean the review text. This stepinvolves spell checking and removing the punctu-ation characters.

• Next, we tokenize each review into sentences andeach sentence into words. Then we remove thetokens which are stopwords.

• Then, we part-of-speech tag (pos) each token andassociate them with a speech tag. The Penn Tree-

Bank Corpus was used to tokenize and pos-tag thereview text.

• From the features produced, we only care aboutthe adjectives (labeled starting with ’JJ’). We ranthe above algorithm for all the reviews of all thebusinesses coming under a specific category (i.e.restaurants).

• The 25 most frequently used words were extractedusing this technique.

The next step is to construct the feature vector for in-dividual reviews taking the count of these most frequentwords in them. Then we aggregate the word counts of each word based on the business id of the businesses.Then we combine this feature vector with the averagereview stars for each business and perform linear regres-sion on the feature vector with the average review starsas the result feature.

Table 7: MSE for Top words feature vectorFeature Set train MSE test MSE

Top words frequency 0.4681823533 0.4656203843

This model performs worse than our intial featurevector. Hence we decided to try other models for reviewtext feature analysis.

tf-idf

Term frequency will just find the frequency of word in adocument rather than finding the importance of word.

Inorder to compute the score based on its importancewe have chosen tf −idf method. tfidf, is the short formfor term frequency−inverse document frequency.

tf −idf predicts how important a word is to a docu-ment in a collection or corpus. It gives more weight towords that appears less frequently across all documents.For this, we have aggregated all reviews for a businessand created separate document for each business review.We got around 2500 business review documents. Thenwe have selected around 15 words to find tf-idf scorefor each word particular to a business. Using pythonNLTK library, we computed tf −idf scores for all wordsspecific to a business. Then using these tf −idf scoresof each word as a feature, we used linear regression topredict the accuracy of business stars.

Table 8: MSE for tf-idf feature vectorFeature Set train MSE test MSE

tf-idf 0.07837244585 0.1047954418

This model produced excellent results with MSE aslow as 0.1. Each run of 50,000 rows of reviews tookaround 3 hours and so due to the heavy nature of thisalgorithm we were able to run this algorithm only for asubset of the review dataset - 50,000 reviews out of 1.5million reviews. This may be a reason why this model

performs so well.So we decided to run the model for increased num-

ber of rows to check how the model performed withmore data being given to it. After the initial run of about 50,000 rows that contained information about650+ restaurants, we ran the algorithm for about 150,000reviews. This contained information about more thana 1000 restaurants. Finally we ran the algorithm for400,000 reviews and this had information about morethan 1700 restaurants.


6/8

Table 9: MSE for tf-idf feature vector#Reviews train MSE test MSE

50000 0.07837244585 0.1047954418150000 0.05933798492 0.06423109019400000 0.06115728106 0.06005993051

Initially we chose the most frequently used words to

find tf −idf score. Since most reviews are positive, thechosen words are mostly positive and the accuracy of business rating prediction has been affected. So we splitthe reviews into two parts. The first one contains highlyrated reviews(4-5 stars) and the second part containslow rated reviews(1-2 stars). We have chosen most fre-quent 10 words from the first part and most frequent 5words from the second part.

The reviews for each business has been aggregatedand then these reviews has been created as a separatedocument and placed in specific directory. we have cre-ated around 2500 documents. To generate tf-idf model,we considered only 1500 documents due to memory andprocessing constraints. Then we iterate through everyfile in that directory, converting the text to lowercaseand removing punctuation. Then these reviews are to-kenized and preprocessed using NLTK’s tokenizer andporter stemmer and then they are passed to tfidfvec-torizer method which generates matrix of tfidf vectors.

After generating these tf-idf score matrix, then foreach individual review we found tf-idf score for the cho-sen 15 words. We took the average the tf-idf score foreach business and the tf-idf score of these 15 words areconsidered as feature vectors and then passed to linearregression model to predict business star rating.

It is clearly evident from Table 9 and the low MSEthat this model is a very good predictor of the star

ratings of a restaurant. Even though we had initiallythought that the low MSE values were because of thelesser number of rows taken as a sample, the MSE didnot increase but rather slightly decreased after increas-ing the size of the input rows/reviews.

Table 10: Average of Predicted RatingActual Rating Count Avg of predicted Rating

5 116911 4.8826457614 106044 4.1459515753 65632 3.0824412652 52482 2.172148637

1 58930 1.343798834

In comparison with Table 6 and Table 9, it is evi-dent that the average of predicted rating under tf −idf is closer to the actual rating than what was obtainedfrom the other features. But still there is some differ-ence found mostly because of the skewed distribution of the number of votes received for each rating from 1 to5. From tables 9 and 10, tf −idf seems to be the bestmodel to predict the ratings.

LDA

Latent Dirichlet Allocation (LDA)

After experimenting with frequent words and tf-idf, wedecided to test topic modeling and check whether it hasany impact on predicting the accuracy of business rat-ing. Inorder to accomplish this, we studied about topicmodeling algorithms and finally decided LDA over La-

tent Semantic Indexing(LSI). Eventhough LSI is fasterthan LDA, the latter gives more accuracy than former.LDA represents documents as mixture of topics that

spit out words with certain probabilities. The businessreview texts are filtered using nltk stop words and thentokenized to different words. The words are then con-verted to a dictionary which contains word-id mappings.Each word in the corpus will be assigned with an uniqueid. As gensim uses Bag of Words(BoW) Representation,the dictionary word-id mappings are then converted toBag Of Words format. The function doc2bow() in gen-sim simply counts the number of occurrences of eachdistinct word to word id and returns the results as asparse vector(word id and their count).

Instead of finding the frequency of word, we plannedto incorporate tf-idf before passing to LDA model. TheBoW sparse vectors are then transformed to tf-idf vec-tor using tf-idf gensim model. The tf-idf model expectsBoW vectors as input and returns a vector with samedimensionality of tfidf valued weights.These tf-idf val-ued vectors are given to LDA model to generate reuirednumber of topics.

The number of topics to be generated can be specifiedin gensim LDA model. We chose the number of topicsas 10. Then using this model, we took each review fromthe dataframe and computed how much it was related toeach topic generated by LDA model. We computed this

score for around 150,000 reviews and took the averagescore for each business id. The ten topics generatedby LDA are represented as feature vectors and usingthese feature vectors and we tried to predict businessrating using Linear Regression. We used python gensimpackage to perform topic modeling using LDA.

Table 11: LDA MSE#Businesses 2916Train MSE 0.380210916Test MSE 0.393376747

Though this model performs better than frequent wordsmodel, it is on par with the business features model butperforms worse than the tf −idf model.

9. TIME SERIES ANALYSIS

The AR(autoregressive model) specifies that the out-put variable depends linearly on its own previous val-ues and on a stochastic term. A moving-average(MA)model is conceptually a linear regression of the currentvalue of the series against current and previous (unob-


7/8

served) white noise error terms or random shocks. Therandom shocks at each point are assumed to be mutu-ally independent and to come from the same distribu-tion, typically a normal distribution, with location atzero and constant scale. An ARMA model is a com-bination of the two. We decided to try out AR modelinitially because of the lack of a deeper understandingof the MA model.

For time series analysis, we extracted the top 3 mostreviewed businesses’ ratings and their respective datesseparately. Then, the records were sorted based on thedate column in ascending order i.e now the dataset foreach of the 3 businesses only had two columns, date inascending order and the corresponding rating received.We used the statsmodels package in Python for ARprediction and fitting.

We initially tried to fit the entire data in AR modeland tried to predict the ratings for the entire time. Thenwe also tried by splitting the first 70-80% of the datainto training and the rest was used as the test dataa.The split had to be in the same order as the data be-

cause the the output variable depends linearly on itsown previous values. Then the actual ratings and thepredicted ratings were sent to the mean squared errorfunction to measure their accuracy.

Table 12: AR MSEBusiness 1 0.4306942204Business 2 0.4868993014Business 3 0.4729743146

As we can infer from the table, this model gives rea-sonably high MSE when compared to the other models.We suspect that the main reason behind this is the fact

that the change in the ratings of a business over timedepends on a variety of different factors related to thebusiness and its surroundings, not just its previous rat-ings. This may be a reason why models including otherfeatures of a business perform better than this model.Due to time constraints, we were unable to proceed fur-ther with Time Series Analysis. We hope TSA will per-form better if we incorporate time dependent featuresinstead of just the previous ratings.

10. CROSS-FOLD VALIDATION

We validated the models that gave us the best resultsusing cross-fold validation.

Learning the parameters of a prediction function andtesting it on the same data is a methodological mistake:a model that would just repeat the labels of the samplesthat it has just seen would have a perfect score butwould fail to predict anything useful on yet-unseen data.This situation is called overfitting.

A solution to this problem is a procedure called cross-validation(CV). A test set should still be held out for fi-nal evaluation, but the validation set is no longer neededwhen doing CV. In the basic approach, called k-fold CV,

the training set is split into k smaller sets.- A model is trained using k-1 of the folds as trainingdata- the resulting model is validated on the remaining partof the dataThe performance measure reported by k-fold cross-validationis then the average of the values computed in the loop.

We used the cross validation package from scikit

learn and the method cross val score to calculate thescore. This method took minimum four arguments: theobject to use to fit the data, original data, the predicteddata and cv where cv is cross-validation generator or aniterable i.e cv determines the cross-validation splittingstrategy by specifying the number of folds.

Table 13: Cross Fold Scoremodel cross fold score

avg user rating Accuracy: 0.37 (+/- 0.09)avg user rating,reviewcount,cool/funny/useful

Accuracy: 0.36 (+/- 0.11)

most freq words Accuracy: 0.08 (+/- 0.13)tf-idf(653 businesses) Accuracy: 0.59 (+/- 0.13)tf-idf(1327 businesses) Accuracy: 0.67 (+/- 0.15)tf-idfl(1777 bsinesses) Accuracy: 0.71 (+/- 0.10)

The accuracy is the mean of the returne scores vari-able i.e scores.mean() and the +/- value is scores.std()* 2. We tried this for various different valued of cv anddid not observe much change. As expected from ouranalysis so far, tf-idf performs better than the othermodels and accuracy seems to increase for more num-ber of reviews analysed.

11. CONCLUSION

The motivation for this project was to come up withgood method to predict a business star rating from itsfeatures. This has a lot of potential applications suchas deteermining the features that contribute to a goodreview, fraud review detection, given a business and itsfeatures what can be added to improve ratings etc.

In this report, we have discussed our approach whichinvolved the generation of various corelated features forlinear regression prediction, review text feature gener-ation and Time series analysis. Several feature extrac-tion methods such as term frequency classifier, Latent

Dirichlet Allocation (LDA) and TF-IDF have been usedand evaluated.

avguserrating, reviewcount, cool/funny/usefulvotes were the features that best predicted the businessrating. tf-idf model produced the best results thanLDA and term frequency among review text based fea-ture predictions.


8/8

12. FUTURE WORK

In our project, we generated business features and re-view test features separately and evaluated them. Butwe did not test with any combined feature matrix in-cluding the both. We can combine the business featuresand review text feature matrix and test how the com-bined set of features predict the ratings of a business.Also, due to time constraints, we could not carry out

Time Series Analysis completely. The performed anal-ysis using AR model did not give very good results.We can improve this if we incorporate time dependentfeatures instead of just the previous ratings. We cangenerate time based features such as total reviews sofar, cool funny useful votes so far etc. This can be usedin the Time Series Analysis to predict how the ratingsof a business would change in the future given the timedependent features generated with ratings received sofar.

13. REFERENCES

[1] Bryan Hood, Victor Hwang and Jennifer King,”Inferring Future Business Attention”, 2013,retrieved fromhttp://www.yelp.com/dataset challenge.

[2] Wael Farhan, ”Predicting Yelp Restaurant Reviews”, UCSD.

[3] Yinshi Zhang, ”Semantic Feature Analysis and Mining for Yelp Rating Prediction”, UCSD.

[4] Rakesh Chada, Chetan Naik, ”Data Mining YelpData - Predicting rating stars from review text”,Stony Brook University.

[5] J. Huang, S. Rogers, and E. Joo. ”Improving restaurants by extracting subtopics from yelpreviews”, 2014.

[6] https://radimrehurek.com/gensim/tut2.html

Documents

Yelp Business Rating Prediction