Download pdf - Classifying What Destinations Airbnb Users Choosemoazum.com/assets/projects/Airbnb User Analysis_12.pdf · The “airbnb” datasets are from the Airbnb New User Bookings data science

Classifying What Destinations Airbnb Users Choose

Moazum Munawer

Abstract

The importance of an industry-wide emphasis on analytics underscores the valuable insights that can be

derived from user data. Machine learning and predictive modeling in analyzing user behavior and data

can result in highly effective targeted ads which increase the probability of a user making a purchase. If a

company knows what their user is interested in then they can market the right product to them. Using data

from 2010 to 2015 of Airbnb users, the current study uses a range of classification and predictive models

to understand the relationship between select Airbnb user behavioral features and which country

destination a user books. Significant predictors of country destinations included user age, length of time a

user took from their first activity to their booking, and length of time between creating an account and

booking. Methods utilized were k-Nearest Neighbors, Extreme Gradient Boosting, Support Vector

Machine, Random Forest, and Neural Networks. All models showed similar performance predicting

country destination with accuracies hovering near 71%. Note that the data is unbalanced with ‘US’ as the

country destination 71% of the observations in the processed training data. The best performing model

was the Support Vector Machine with an accuracy of 90.6%.

Keywords

Airbnb, Classification, Decision Trees, Boosting, Country Destination, Support Vector Machine, Extreme

Gradient Boosting, Random Forest, k-Nearest Neighbors, Neural Networks, Marketing, Predictive

Models, Exploratory Data Analysis

1

Introduction

Description The goal of this analysis is to build a model that will accurately predict where a new Airbnb user will

book their first travel experience. Airbnb allows users to book accommodations in more than 81,000

cities and 191 countries1; predicting the destination country of a user’s first booking allows Airbnb to

create a more personalized experience for their users, as well as make the booking process more efficient.

The objective of this analysis is to determine which classification technique best predicts the destination

country of first booking. Several classification techniques are explored, including K-nearest neighbors,

decision trees, support vector machines, boosting, and neural networks. The dataset used for analysis

includes user data collected between 2010 and 2015; any records from users that did not book an

accommodation were disregarded. Both qualitative and quantitative variables are included as predictors;

one-hot encoding is used to convert each level of the categorical variables to its own dummy variable for

better model performance. The dataset was split into separate training and test sets for model building and

validation.

Research Questions

Interested in determining how user characteristics can be used to best predict the destination country for a

user’s first booking on Airbnb.

Statistical Questions

Specific statistical questions of interest for this analysis include:

- Which classification method best predicts the destination country?

- Which user variables are significant predictors of destination country?

Variables of Interest The response variable of interest is destination country. The predictor variables include all available user

information, including characteristics such as age, gender, signup method, and other web session data

collected when a user makes their first booking.

1 https://press.airbnb.com/about-us

https://press.airbnb.com/about-us

2

Exploratory Data Analysis

Data Source

The “airbnb” datasets are from the Airbnb New User Bookings data science competition hosted on

Kaggle. The competition goal was to predict a new user’s first booking destination given a list of users

along with their demographics, web session records, and some summary statistics. The dataset used in

this analysis is:

● train_users_2.csv - the training set of users consisting of 213,451 observations on the following

16 variables:

○ id: user id

○ date_account_created: the date of account creation

○ timestamp_first_active: timestamp of the first activity, note that it can be earlier than

date_account_created or date_first_booking because a user can search before

signing up

○ date_first_booking: date of first booking

○ gender

○ age

○ signup_method

○ signup_flow: the page a user came to signup up from

○ language: international language preference

○ affiliate_channel: what kind of paid marketing

○ affiliate_provider: where the marketing is e.g. google, craigslist, other

○ first_affiliate_tracked: what is the first marketing the user interacted with before the

signing up

○ signup_app

○ first_device_type

○ first_browser

○ country_destination: the destination country of the user’s first booking.

■ There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES',

'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. Please note that

'NDF' is different from 'other' because 'other' means there was a booking, but is to a

country not included in the list, while 'NDF' means there wasn't a booking.

The train_users_2.csv data file is read into R as tr.users dataset.

Data Quality

The tr.users dataset is not a “tidy” dataset. For example, there are 87,990 (~41%) missing

observations in the age feature and 124,543 (~58%) “empty” observations in the

date_first_booking feature (see Table 1, Figure 1, and Figure 2 in Appendix A).

https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings

3

There is also evidence of class imbalance in the response variable, country_destination. Close to

60% of country_destination observations are 'NDF' and close to 30% are 'US' (see Figure 3 in

Appendix A). This may impact our classification methods.

Given the data missingness and class imbalance, a further analysis was performed to determine any

patterns that may arise in an effort to handle these dataset issues. Discovered all “empty” observations in

date_first_booking where restricted to country_destination = NDF (see Figure 4 in Appendix

A). Further, it is noted that the missingness in age and date_first_booked often occurred together

when the country_destination is ‘NDF’ (see Figures 5 in Appendix A).

To decrease the missingness in both age and date_first_booking, I eliminated the rows where

country_destination is ‘NDF’. This also slightly improves the class imbalance issue since NDF had

the highest count of observation. To further improve the class imbalance, I chose to eliminate those

country_destination frequency counts less than 1000 observations. The resulting dataset still

contained ~23% missing age observations of the 87,390 rows.

The age feature contained obvious erroneous values such as 1924, 1925, 2014, etc. which appeared to be

birth years instead of age values. Choose to “correct” these erroneous age values by subtracting them

from 2014 to derive a more realistic age value. Also, the age feature contained several values below 18

years of age and many in 100+ years of age. Given the end-user-license agreement for Airbnb references

a minimum age of 18, chose to replace age values less than 18 with NA values. Also, chose 85 years of

age as the upper limit since it is difficult to envision older US individuals being tech-savvy enough and

open to an experience like Airbnb stays. Therefore, for those age values over 85 they were replaced with

NA values. The resulting missingness for age increased to ~24.5%. I then used the knnImputation()

function from the DMwR library to impute the missing values in age.

I attempted to use Principal Component Analysis (PCA) with k-Means Clustering for visualizing the

cleansed tr.users dataset. However, the resulting clustering was not useful for summarizing and

visualizing the dataset given the large portion of categorical features.

Feature Engineering

I converted date_account_created and date_first_booking features to seasons

(date_account_created_season and date_first_booking_season) based on the Northern

Meteorological Seasons definition where the season starts within the month of the equinox instead of a

specific date within the month. I also created lag features using the date_account_created,

date_first_booking, and timestamp_first_active as follows:

● activity_to_account = date_account_created - timestamp_first_active

● activity_to_booking = Date first booking - timestamp_first_active

● account_to_booking = Date_first_booking - date_account_created

Finally, I converted the categorical features using one-hot-encoding. Chose to create two datasets based

on different one-hot-encoded (OHE) methodologies. For distance based classification modeling, I created

https://www.timeanddate.com/calendar/aboutseasons.html

https://www.timeanddate.com/calendar/aboutseasons.html

4

a OHE dataset (airbnb.false.rank) where the dummy variables were established for each and every

unique value within the categorical variable, also known as the k-dummy variable approach which

resulted in 145 total features. For non-distance-based classification modeling, I created a OHE dataset

(airbnb.full.rank) where the dummy variables were established for the k-1 unique values within the

categorical variable which resulted in 133 total features.

The response variable, country_destination, was converted to a numeric factor,

country_destination_num, using the following mapping: ‘CA’ = 0, ‘DE’ = 1, ‘ES’ = 2,

‘FR’ = 3, ‘GB’ = 4, ‘IT’ = 5, ‘other’ = 6, and ‘US’ = 7. The conversion was

necessary since many of the classification methods prefer numerics instead of character values

Train/Test Dataset Creation

Each of the OHE encoded datasets were split using an 80/20 ratio to create training and test datasets.

These training and test (hold out) datasets were used in the modeling efforts detailed in the Analysis

section of the report. The training datasets contained 69,913 observations while the test datasets contained

17,477 observations. The training datasets were used for modeling and the test datasets were used to

measure the prediction accuracy of the models.

Analysis

Applied several classification methodologies to determine which method achieved the best overall

prediction accuracy for the datasets. Each model was trained using the training data and then tested

against the hold out (test) dataset to determine the overall prediction accuracy.

k-Nearest Neighbor (KNN)

KNN is a classifier that first identifies the K points in the training data that are closest in the

characteristics of their features to the test observations, and then based on the responses of the K-

neighbors, assigns a conditional probability for that observation for each potential class, and then assigns

the test observation to the class with the highest probability. For this analysis, I used KNN in order to

predict the country destination for each user in the test data based on the country destination of the users

in the training data that were closest to each test user.

I tried different values of K(1,5,10) and then produced confusion matrices for all three values of K.

When using K = 1, the model correctly assigns the test observation to the right class 55.9% of the time.

Refer to Table 3 in Appendix A for the confusion matrix.

In order to increase the proportion of correct predictions, the number of neighbors used will be increased

to 5.

5

Using K = 5, the model assigns the test observation to the right class 68.9% of the time, an increase of

13% from K = 1. I will try again with a larger value for K to see if I get a larger proportion of correct

assignments. Refer to Table 4 in Appendix A for the confusion matrix.

Using K = 10, the model correctly assigns the test observation to the right class 71% of the time. While

increasing K caused the number of correct predictions to increase, the percentage only climbed by 2.1%.

Refer to the table below for the confusion matrix.

Confusion matrix for KNN when k = 10

When increasing the value of K, the KNN model correctly assigned more of the test observations to the

right class. However, the model mostly increased its number of correct predictions by assigning more

observations to the number 7 class. When looking at the data, 71.4% of the observation in the test data set

belong to the 7 class, meaning that by just assigning more of the observations to the number 7 class, it

will appear that the model is doing a better job of predicting the responses of the test observations.

Extreme Gradient Boosting (XGBoost)

Gradient boosting is a machine learning technique for regression and classification problems, which

produces a prediction model in the form of an ensemble of weak prediction models, typically decision

trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them

by allowing optimization of an arbitrary differentiable loss function.2

XGBoost is one of the implementations of Gradient Boosting concept, but what makes XGBoost unique

is that it uses “a more regularized model formalization to control over-fitting, which gives it better

performance,” according to the author of the algorithm, Tianqi Chen. Therefore, it helps to reduce

overfitting.3

2 https://en.wikipedia.org/wiki/Gradient_boosting 3 https://blog.exploratory.io/introduction-to-extreme-gradient-boosting-in-exploratory-7bbec554ac7

https://en.wikipedia.org/wiki/Machine_learning

https://en.wikipedia.org/wiki/Regression_%28machine_learning%29

https://en.wikipedia.org/wiki/Classification_%28machine_learning%29

https://en.wikipedia.org/wiki/Ensemble_learning

https://en.wikipedia.org/wiki/Decision_tree_learning

https://en.wikipedia.org/wiki/Decision_tree_learning

https://en.wikipedia.org/wiki/Boosting_%28meta-algorithm%29

https://en.wikipedia.org/wiki/Differentiable_function

https://en.wikipedia.org/wiki/Loss_function

https://www.quora.com/What-is-the-difference-between-the-R-gbm-gradient-boosting-machine-and-xgboost-extreme-gradient-boosting/answer/Tianqi-Chen-1?srid=8Ze

https://www.quora.com/What-is-the-difference-between-the-R-gbm-gradient-boosting-machine-and-xgboost-extreme-gradient-boosting/answer/Tianqi-Chen-1?srid=8Ze

https://twitter.com/tqchenml

https://en.wikipedia.org/wiki/Gradient_boosting

https://blog.exploratory.io/introduction-to-extreme-gradient-boosting-in-exploratory-7bbec554ac7

6

I attempted to use hyperparameter tuning, but model run times were exceedingly long (e.g. multiple days)

so the effort was abandoned due to time constraints. Hyperparameter tuning was lengthy due to the

training data dimensions (69,113 observations of 131 predictors).

I chose to run the default parameter values for the xgboost() function from the xgboost library. The

resulting xgboost model yielded a test prediction accuracy of 71.4% which is roughly the same as

guessing ‘US’ as destination country for each prediction (see XGBoost Confusion Matrix for Test Data

below). The class imbalance is likely the cause of the model favoring ‘US’ as the destination country of

the first booking.

XGBoost Confusion Matrix for Test Data

The feature importance was determined using the xgb.importance() function on the training dataset.

The top ten (10) features listed in decreasing level of importance are: age, activity_to_booking,

account_to_booking, affiliate_channel.other, date_first_booking_season.Spring,

gender.FEMALE, gender.MALE, signup_app.Web, signup_method.facebook, and

affiliate_channel.sem.non.brand. Refer to Table 6 in Appendix A for additional details.

Support Vector Machine (SVM)

The support vector machine is a classification model that is an extension of the support vector classifier;

the goal of these methods is to find a hyperplane that separates the data as well as possible. A kernel

approach is used in SVM to enlarge the feature space and allow for a non-linear boundary; the two kernel

types explored in this analysis are the polynomial and radial kernels. A value must be selected for the

tuning parameter, which is important as it determines the extent to which the model underfits or overfits

the data. A small value means the cost of misclassification is low and thus allows more values to be

misclassified, a larger value will result in higher accuracy4.

When using SVM for a multi-class problem such as for this analysis, two approaches can be used: one-vs-

one classification or one-vs-all classification. A one-vs-one approach is used for this analysis, meaning

that for K classes, SVMs are created for each pair of classes and an observation is classified as the class to

which it was most frequently assigned of these K-choose-2 SVMs5

4 Garret James, et al, “Support Vector Machines” in An Introduction to Statistical Learning with Applications in R,

337-358 5 ibid

7

SVM works well for smaller datasets with a larger number of features and for data with a clear margin of

separation. However, issues arise when the dataset is large and/or when classes overlap 6, as experienced

when attempting to perform SVM on this dataset. With almost 70,000 observations in the training

dataset, computation time was quite lengthy and thus attempts to tune the model were very limited.

Using the “e1017” library in R, the tune function ran for over 24 hours without completion and thus was

abandoned for SVM as well. Using a smaller portion of the training set (first 20,000 rows) allowed the

ksvm function from the “kernlab” library to produce results within a reasonable time frame. After several

iterations, it was found that a radial kernel with a high cost tuning value (C=1,000) resulted in the highest

prediction accuracy, 91.0%. A table showing the training prediction accuracies for all attempts can be

seen in table 7 in appendix A. Running the test data through this model results in a 90.6% prediction

accuracy and 13,773 support vectors. The confusion matrix for the test data is as follows:

SVM Confusion Matrix for Test Data

Given that the distribution for destination class 7 (‘US’) is about 70% of the training data and a large

number of the other classes are miscategorized into this class, weighting the response classes may be

necessary, however time did not allow for exploration of this approach.

Decision Trees (Random Forest)

Decision trees are a class of predictive data mining tools which predict either a categorical or continuous

response variable. They get their name from the structure of the models built. A series of decisions are

made to segment the data into homogeneous subgroups. This is also called recursive partitioning. When

drawn out graphically, the model can resemble a tree with branches. Random forests or random decision

forests are an ensemble learning method for classification, regression and other tasks that operates by

constructing a multitude of decision trees at training time and outputting the class that is the mode of the

classes (classification) or mean prediction (regression) of the individual trees. Random decision forests

correct for decision trees' habit of overfitting to their training set.7

Decision Trees and Random Forests do not need to have categorical variables one-hot-encoded with

dummy variables and can handle multiple factors within a feature; I will use a data set that follows this

6 https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/ 7 https://en.wikipedia.org/wiki/Random_forest

https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/

https://en.wikipedia.org/wiki/Random_forest

https://en.wikipedia.org/wiki/Random_forest

8

structure. Fit a Random Forest model using the default parameters for the randomforest() function

from the randomforest library and all 16 remaining predictor variables which can be referenced by

Table 2 in Appendix A. The resulting model yielded a test prediction accuracy rate of 71.1% (see

Random Forest Confusion Matrix below). The mean decrease in Gini coefficient is a measure of how

each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest.8 The

Age variable has the largest mean decrease at 4671.54, the values for the remaining variables can be seen

in Table 5.

Random Forest Confusion Matrix

Neural Network

A neural network uses algorithms that can recognize patterns in data, in a way similar to how the human

brain works. Neural networks are made of layers of nodes, which weight the inputs it receives in such a

way in order to optimize an algorithm to predict the specified output.9

To create this neural network, because of the restraints on computing with such a large dataset, the

problem was simplified to predict only either 1 for ‘US’, or 0 for ‘non-US’. Also, as neural networks do

not work well with sparse data sets 10, the model’s parameters were reduced and only two of the most

significant predictors, activity_to_booking and age, were used to predict the test observations.

The neural network was run with 2 hidden layers, with 4 and 2 nodes in the layers. The threshold value

chose was .01, the result ended with an error of 7134.3 and 1030 steps. The graph for the output network

plot can be found in Appendix A figure 6. The neural network after being run predicted that every test

observation would be ‘US’, resulting in a prediction accuracy of 71.4%.

Overall the neural network did not perform well with the data I had as neural networks do not work well

with sparse data sets. Below is the table showing the neural network simply choosing ‘US’ for all test

observations.

8 https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html 9 https://skymind.ai/wiki/neural-network#define 10https://www.quora.com/Why-are-deep-neural-networks-so-bad-with-sparse-data

https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html

https://skymind.ai/wiki/neural-network#define

https://www.quora.com/Why-are-deep-neural-networks-so-bad-with-sparse-data

9

Conclusion For most of the statistical methods, the problem of the models was predicting ‘US’ for most of the

observations. With 71.4% of the test observations having the country destination of ‘US’, this meant that

most of the models would approach this threshold without surpassing it. Also, because of the large

datasets, the ability to fine tune the models was limited as running cross validation to determine

parameters for the models took too much time to run.

For KNN, the model with the parameter k = 10 performed the best compared to either the k =1, or k =5

models, with a prediction accuracy of 71%. However, almost all of the test observation were predicted to

be ‘US’, showing the limitations of this model.

For XGBoost, the same problem was encountered where, while the model had a prediction accuracy of

71.4%, almost all of the test observations were predicted to be ‘US’.

For a Random forest decision tree, the prediction accuracy of the model was 71.1%, and had the same

problem as the previous models in predicting almost all of the test observations to be ‘US’.

For a neural network, the problem was simplified to either predict ‘US’ or non ‘US’, however even with

this simplification the the neural network still predicted every test observation to be ‘US’, resulting in a

prediction accuracy of 71.4%.

The best model was the SVM that used a one-vs-one approach. The model had a prediction accuracy for

the test data of 90.6%. This model seemed to work the best with the data and most accurately predicted

the country destination of the test users.

Looking at which variables were the best predictors, tables 5 and 6 in Appendix A shows a list of the best

predictors of country destination. The age, gender , account_to_booking, and

activity_to_booking variables were the best predictors overall of country destination.

Below is a summary table of each method used and the corresponding prediction accuracy of the model.

10

Modeling Method Prediction Accuracy

kNN, k = 1 55.9%

kNN, k = 5 68.9%

kNN, k = 10 71.0%

XGBoost (default parameters) 71.4%

SVM, kernel = radial, C = 1,000 90.6%

Random Forest (default parameters) 71.1%

Neural Network 71.4%

11

Appendix A

Table 1: tr.users (train_users_2.csv) Dataset Summary

> summary(tr.users)

id date_account_created timestamp_first_active

00023iyk9l: 1 2014-05-13: 674 Min. :2.01e+13

0005ytdols: 1 2014-06-24: 670 1st Qu.:2.01e+13

000guo2307: 1 2014-06-25: 636 Median :2.01e+13

000wc9mlv3: 1 2014-05-20: 632 Mean :2.01e+13

0012yo8hu2: 1 2014-05-14: 622 3rd Qu.:2.01e+13

001357912w: 1 2014-05-21: 602 Max. :2.01e+13

(Other) :213445 (Other) :209615

date_first_booking gender age signup_method

:124543 -unknown-:95688 Min. : 1 basic :152897

2014-05-22: 248 FEMALE :63041 1st Qu.: 28 facebook: 60008

2014-06-11: 231 MALE :54440 Median : 34 google : 546

2014-06-24: 226 OTHER : 282 Mean : 50

2014-05-21: 225 3rd Qu.: 43

2014-06-10: 223 Max. :2014

(Other) : 87755 NA's :87990

signup_flow language affiliate_channel affiliate_provider

Min. : 0.00 en :206314 direct :137727 direct :137426

1st Qu.: 0.00 zh : 1632 sem-brand : 26045 google : 51693

Median : 0.00 fr : 1172 sem-non-brand: 18844 other : 12549

Mean : 3.27 es : 915 other : 8961 craigslist: 3471

3rd Qu.: 0.00 ko : 747 seo : 8663 bing : 2328

Max. :25.00 de : 732 api : 8167 facebook : 2273

(Other): 1939 (Other) : 5044 (Other) : 3711

first_affiliate_tracked signup_app first_device_type

untracked :109232 Android: 5454 Mac Desktop :89600

linked : 46287 iOS : 19019 Windows Desktop:72716

omg : 43982 Moweb : 6261 iPhone :20759

tracked-other: 6156 Web :182717 iPad :14339

: 6065 Other/Unknown :10667

product : 1556 Android Phone : 2803

(Other) : 173 (Other) : 2567

first_browser country_destination

Chrome :63845 NDF :124543

Safari :45169 US : 62376

Firefox :33655 other : 10094

-unknown- :27266 FR : 5023

IE :21068 IT : 2835

Mobile Safari:19274 GB : 2324

(Other) : 3174 (Other): 6256

12

Table 2: Airbnb.not.ohe.train Dataset Summary

Figure 1: tr.user Missing Data

13

Figure 2: Distribution of Missing age Observation by country_destination

Figure 3: Distribution of Response Variable (country_destination)

14

Figure 4: Distribution of Missing date_first_booking Observations by

country_destination

Figure 5: Distribution of Missing age and date_first_booking (both_missing)

Observations by country_destination

15

Table 3: Confusion matrix for KNN when k = 1

Table 4: Confusion matrix for KNN when k = 5

Table 5: Random Forest Variables Importance

16

Table 6: XGBoost Top 10 Variable Importance

Table 7: SVM Prediction Accuracies for Reduced Training Data

Figure 6: Neural Network Plot

17

R Code Listing 1: KNN Model

library(class)

knn.pred=knn(train.X,test.X,train.Y,k=1)

table(knn.pred ,te.Y)

mean(knn.pred==te.Y)

[1] 0.5571322

knn.pred=knn(train.X,test.X,tr.Y,k=5)



[1] 0.688276

knn.pred=knn(train.X,test.X,tr.Y,k=10)



[1] 0.7094467

mean(7==te.Y)

[1] 0.7137953

R Code Listing 2: XGBoost Model

# load the processed datasets

airbnb.full.rank <-readRDS("final_airbnb.full.rank.RDS")

airbnb.full.rank.ids <-readRDS("final_airbnb.full.rank.ids.RDS")

X.train.full.rank <-readRDS("final_X.train.full.rank.RDS")

X.test.full.rank <-readRDS("final_X.test.full.rank.RDS")

Y.train.full.rank <-readRDS("final_Y.train.full.rank.RDS")

Y.test.full.rank <-readRDS("final_Y.test.full.rank.RDS")

airbnb.false.rank <-readRDS("final_airbnb.false.rank.RDS")

airbnb.false.rank.ids <-readRDS("final_airbnb.false.rank.ids.RDS")

X.train.false.rank <-readRDS("final_X.train.false.rank.RDS")

X.test.false.rank <-readRDS("final_X.test.false.rank.RDS")

X.test.false.rank <-readRDS("final_X.test.false.rank.RDS")

Y.train.false.rank <-readRDS("final_Y.train.false.rank.RDS")

Y.test.false.rank <-readRDS("final_Y.test.false.rank.RDS")

options(digits =3)

library(tidyverse)

## ── Attaching packages

────────────────────────────────────────────────────────────────────

tidyverse 1.2.1 ──

18

## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5

## ✔ tibble 1.4.2 ✔ dplyr 0.7.8

## ✔ tidyr 0.8.2 ✔ stringr 1.3.1

## ✔ readr 1.1.1 ✔ forcats 0.3.0

## ── Conflicts

───────────────────────────────────────────────────────────────────────

tidyverse_conflicts() ──

## ✖ dplyr::filter() masks stats::filter()

## ✖ dplyr::lag() masks stats::lag()

library(dlookr)

##

## Attaching package: 'dlookr'

## The following object is masked from 'package:base':

##

## transform

library(xgboost)

##

## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':

##

## slice

library(caret)

## Loading required package: lattice

##

## Attaching package: 'caret'

## The following object is masked from 'package:purrr':

##

## lift

# set up labels for xgboost() function

train.labels <-as.matrix(Y.train.full.rank)

test.labels <-as.matrix(Y.test.full.rank)

# train xgboost

xgb.model <-xgboost(data =data.matrix(X.train.full.rank),

label =train.labels,

eta =0.3,

max_depth =6,

min_child_weight =1,

nround =25,

subsample =1,

colsample_bytree =1,

num_parallel_tree =1,

seed =1,

eval_metric ="merror",

19

objective ="multi:softmax",

num_class =8,

nthread =8

)

## [1] train-merror:0.286070

## [2] train-merror:0.286156

## [3] train-merror:0.286141

## [4] train-merror:0.286113

## [5] train-merror:0.286056

## [6] train-merror:0.286056

## [7] train-merror:0.286013

## [8] train-merror:0.285970

## [9] train-merror:0.285855

## [10] train-merror:0.285798

## [11] train-merror:0.285769

## [12] train-merror:0.285612

## [13] train-merror:0.285512

## [14] train-merror:0.285440

## [15] train-merror:0.285383

## [16] train-merror:0.285369

## [17] train-merror:0.285355

## [18] train-merror:0.285340

## [19] train-merror:0.285297

## [20] train-merror:0.285183

## [21] train-merror:0.285054

## [22] train-merror:0.284954

## [23] train-merror:0.284954

## [24] train-merror:0.284897

## [25] train-merror:0.284854

# predict values in test set

y_pred <-predict(xgb.model, data.matrix(X.test.full.rank))

# ensure prediction factor levels match the test factor levels

y_pred <-factor(y_pred, levels =c(0,1,2,3,4,5,6,7))

# change the prediction and test labels back to character country values

y_pred <-as.factor(recode(y_pred,

'0'='CA', '1'='DE', '2'='ES', '3'='FR',

'4'='GB', '5'='IT', '6'='other', '7'='US')

)

y_test <-as.factor(recode(Y.test.full.rank,

'0'='CA', '1'='DE', '2'='ES', '3'='FR',

20

'4'='GB', '5'='IT', '6'='other', '7'='US')

)

# check the test error

table(y_pred, y_test)

## y_test

## y_pred CA DE ES FR GB IT other US

## CA 0 0 0 0 0 0 0 0

## DE 0 0 0 0 0 0 0 0

## ES 0 0 0 0 0 0 0 1

## FR 0 0 0 0 0 0 0 1

## GB 0 0 0 0 0 1 0 0

## IT 0 0 0 0 0 0 0 0

## other 0 0 0 0 0 0 1 2

## US 304 232 443 961 468 571 2021 12471

mean(y_test ==y_pred)

## [1] 0.714

confusionMatrix(y_pred,

y_test

)

## Confusion Matrix and Statistics

##

## Reference

## Prediction CA DE ES FR GB IT other US

## CA 0 0 0 0 0 0 0 0

## DE 0 0 0 0 0 0 0 0

## ES 0 0 0 0 0 0 0 1

## FR 0 0 0 0 0 0 0 1

## GB 0 0 0 0 0 1 0 0

## IT 0 0 0 0 0 0 0 0

## other 0 0 0 0 0 0 1 2

## US 304 232 443 961 468 571 2021 12471

##

## Overall Statistics

##

## Accuracy : 0.714

## 95% CI : (0.707, 0.72)

## No Information Rate : 0.714

## P-Value [Acc > NIR] : 0.524

##

## Kappa : 0

## Mcnemar's Test P-Value : NA

##

## Statistics by Class:

21

##

## Class: CA Class: DE Class: ES Class: FR Class: GB

## Sensitivity 0.0000 0.0000 0.00e+00 0.00e+00 0.00e+00

## Specificity 1.0000 1.0000 1.00e+00 1.00e+00 1.00e+00

## Pos Pred Value NaN NaN 0.00e+00 0.00e+00 0.00e+00

## Neg Pred Value 0.9826 0.9867 9.75e-01 9.45e-01 9.73e-01

## Prevalence 0.0174 0.0133 2.53e-02 5.50e-02 2.68e-02

## Detection Rate 0.0000 0.0000 0.00e+00 0.00e+00 0.00e+00

## Detection Prevalence 0.0000 0.0000 5.72e-05 5.72e-05 5.72e-05

## Balanced Accuracy 0.5000 0.5000 5.00e-01 5.00e-01 5.00e-01

## Class: IT Class: other Class: US

## Sensitivity 0.0000 4.95e-04 0.9997

## Specificity 1.0000 1.00e+00 0.0004

## Pos Pred Value NaN 3.33e-01 0.7138

## Neg Pred Value 0.9673 8.84e-01 0.3333

## Prevalence 0.0327 1.16e-01 0.7138

## Detection Rate 0.0000 5.72e-05 0.7136

## Detection Prevalence 0.0000 1.72e-04 0.9997

## Balanced Accuracy 0.5000 5.00e-01 0.5000

Feature importance

importance_matrix <- xgb.importance(names(input_x), model = xgb.model)

Importance_matrix[1:10, ] # list out top 10 most important features

## Feature Gain Cover Frequency Importance

## 1: age 0.1962 0.14446 0.21529 0.1962

## 2: activity_to_booking 0.1651 0.09900 0.16876 0.1651

## 3: account_to_booking 0.0672 0.05121 0.04281 0.0672

## 4: affiliate_channel.other 0.0283 0.04286 0.00955 0.0283

## 5: date_first_booking_season.Spring 0.0275 0.03514 0.01898 0.0275

## 6: gender.FEMALE 0.0263 0.01403 0.02246 0.0263

## 7: gender.MALE 0.0231 0.02638 0.02358 0.0231

## 8: signup_app.Web 0.0227 0.03282 0.01067 0.0227

## 9: signup_method.facebook 0.0204 0.00585 0.02370 0.0204

## 10: affiliate_channel.sem.non.brand 0.0180 0.01613 0.01228 0.0180

22

R Code Listing 3: Support Vector Machine

###training data

ytrain=as.factor(readRDS("final_Y.train.false.rank.RDS"))

xtrain=readRDS("final_X.train.false.rank.RDS")

train = cbind(xtrain, ytrain)

train2 = train[1:20000,]

library(kernlab)

set.seed(2345)

svm1 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="rbfdot", C=2)

pred_y=predict(svm1,train2[,1:144])

table(pred_y,train2$ytrain)

svm1

svm2 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="polydot", C=2)

pred_y2=predict(svm2,train2[,1:144])

table(pred_y2,train2$ytrain)

svm2

svm3 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="polydot", degree

= 4, C=2)



svm3




svm4

svm5 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="polydot", C=2)



svm5




svm6



23


svm7




svm8

svm9 = ksvm(train$ytrain~., data=train, scaled=T, kernel="rbfdot", C=1000)

pred_y9=predict(svm8,xtrain)

table(pred_y9,train$ytrain)

svm9

###test data

ytest=as.factor(readRDS("final_Y.test.false.rank.RDS"))

xtest=readRDS("final_X.test.false.rank.RDS")

test = cbind(xtest, ytest)

svm1_test = ksvm(test$ytest~., data=test, scaled=T, kernel="rbfdot", C=1000)

pred_y_test=predict(svm1_test,xtest)

table(pred_y_test,test$ytest)

(209+187+312+663+329+379+1305+12455)/(17477)

svm1_test

R Code Listing 4: Decision Trees (Random Forest)

airbnb.tree.3 = randomForest(country_destination~.-country_destination, data

= airbnb.not.ohe.train)

summary(airbnb.tree.3)

importance(airbnb.tree.3)

airbnb.tree.3.predict = predict(airbnb.tree.3, newdata=airbnb.not.ohe.test)

summary(airbnb.tree.3.predict)

#Confusion Matrix

table(airbnb.not.ohe.test$country_destination,airbnb.tree.3.predict)

misclass.pred = sum(airbnb.not.ohe.test$country_destination !=

airbnb.tree.3.predict)

misclass.pred/length(airbnb.tree.3.predict)

24

R Code Listing 5: Neural Network

train[,145] = cbind(train.Y, ifelse(train.Y<7, 0, 1))

names(train)[145] <- c( "US")

net = neuralnet(US ~ age + activity_to_booking, data=train, hidden=c(4,2),

linear.output=FALSE, threshold=0.01)

newTest.Y[,1] = cbind(test.Y, ifelse(test.Y<7, 0, 1))

net.results <- compute(net, newTest.X)

results <- data.frame(actual = newTest.Y, prediction=net.results$net.result)

roundedresults<-sapply(results,round,digits=0)

roundedresultsdf=data.frame(roundedresults)

table(roundedresultsdf$actual, roundedresultsdf$prediction)