Classifying What Destinations Airbnb Users Choose
Moazum Munawer
Abstract
The importance of an industry-wide emphasis on analytics underscores the valuable insights that can be
derived from user data. Machine learning and predictive modeling in analyzing user behavior and data
can result in highly effective targeted ads which increase the probability of a user making a purchase. If a
company knows what their user is interested in then they can market the right product to them. Using data
from 2010 to 2015 of Airbnb users, the current study uses a range of classification and predictive models
to understand the relationship between select Airbnb user behavioral features and which country
destination a user books. Significant predictors of country destinations included user age, length of time a
user took from their first activity to their booking, and length of time between creating an account and
booking. Methods utilized were k-Nearest Neighbors, Extreme Gradient Boosting, Support Vector
Machine, Random Forest, and Neural Networks. All models showed similar performance predicting
country destination with accuracies hovering near 71%. Note that the data is unbalanced with ‘US’ as the
country destination 71% of the observations in the processed training data. The best performing model
was the Support Vector Machine with an accuracy of 90.6%.
Keywords
Airbnb, Classification, Decision Trees, Boosting, Country Destination, Support Vector Machine, Extreme
Gradient Boosting, Random Forest, k-Nearest Neighbors, Neural Networks, Marketing, Predictive
Models, Exploratory Data Analysis
1
Introduction
Description The goal of this analysis is to build a model that will accurately predict where a new Airbnb user will
book their first travel experience. Airbnb allows users to book accommodations in more than 81,000
cities and 191 countries1; predicting the destination country of a user’s first booking allows Airbnb to
create a more personalized experience for their users, as well as make the booking process more efficient.
The objective of this analysis is to determine which classification technique best predicts the destination
country of first booking. Several classification techniques are explored, including K-nearest neighbors,
decision trees, support vector machines, boosting, and neural networks. The dataset used for analysis
includes user data collected between 2010 and 2015; any records from users that did not book an
accommodation were disregarded. Both qualitative and quantitative variables are included as predictors;
one-hot encoding is used to convert each level of the categorical variables to its own dummy variable for
better model performance. The dataset was split into separate training and test sets for model building and
validation.
Research Questions
Interested in determining how user characteristics can be used to best predict the destination country for a
user’s first booking on Airbnb.
Statistical Questions
Specific statistical questions of interest for this analysis include:
- Which classification method best predicts the destination country?
- Which user variables are significant predictors of destination country?
Variables of Interest The response variable of interest is destination country. The predictor variables include all available user
information, including characteristics such as age, gender, signup method, and other web session data
collected when a user makes their first booking.
1 https://press.airbnb.com/about-us
2
Exploratory Data Analysis
Data Source
The “airbnb” datasets are from the Airbnb New User Bookings data science competition hosted on
Kaggle. The competition goal was to predict a new user’s first booking destination given a list of users
along with their demographics, web session records, and some summary statistics. The dataset used in
this analysis is:
● train_users_2.csv - the training set of users consisting of 213,451 observations on the following
16 variables:
○ id: user id
○ date_account_created: the date of account creation
○ timestamp_first_active: timestamp of the first activity, note that it can be earlier than
date_account_created or date_first_booking because a user can search before
signing up
○ date_first_booking: date of first booking
○ gender
○ age
○ signup_method
○ signup_flow: the page a user came to signup up from
○ language: international language preference
○ affiliate_channel: what kind of paid marketing
○ affiliate_provider: where the marketing is e.g. google, craigslist, other
○ first_affiliate_tracked: what is the first marketing the user interacted with before the
signing up
○ signup_app
○ first_device_type
○ first_browser
○ country_destination: the destination country of the user’s first booking.
■ There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES',
'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. Please note that
'NDF' is different from 'other' because 'other' means there was a booking, but is to a
country not included in the list, while 'NDF' means there wasn't a booking.
The train_users_2.csv data file is read into R as tr.users dataset.
Data Quality
The tr.users dataset is not a “tidy” dataset. For example, there are 87,990 (~41%) missing
observations in the age feature and 124,543 (~58%) “empty” observations in the
date_first_booking feature (see Table 1, Figure 1, and Figure 2 in Appendix A).
3
There is also evidence of class imbalance in the response variable, country_destination. Close to
60% of country_destination observations are 'NDF' and close to 30% are 'US' (see Figure 3 in
Appendix A). This may impact our classification methods.
Given the data missingness and class imbalance, a further analysis was performed to determine any
patterns that may arise in an effort to handle these dataset issues. Discovered all “empty” observations in
date_first_booking where restricted to country_destination = NDF (see Figure 4 in Appendix
A). Further, it is noted that the missingness in age and date_first_booked often occurred together
when the country_destination is ‘NDF’ (see Figures 5 in Appendix A).
To decrease the missingness in both age and date_first_booking, I eliminated the rows where
country_destination is ‘NDF’. This also slightly improves the class imbalance issue since NDF had
the highest count of observation. To further improve the class imbalance, I chose to eliminate those
country_destination frequency counts less than 1000 observations. The resulting dataset still
contained ~23% missing age observations of the 87,390 rows.
The age feature contained obvious erroneous values such as 1924, 1925, 2014, etc. which appeared to be
birth years instead of age values. Choose to “correct” these erroneous age values by subtracting them
from 2014 to derive a more realistic age value. Also, the age feature contained several values below 18
years of age and many in 100+ years of age. Given the end-user-license agreement for Airbnb references
a minimum age of 18, chose to replace age values less than 18 with NA values. Also, chose 85 years of
age as the upper limit since it is difficult to envision older US individuals being tech-savvy enough and
open to an experience like Airbnb stays. Therefore, for those age values over 85 they were replaced with
NA values. The resulting missingness for age increased to ~24.5%. I then used the knnImputation()
function from the DMwR library to impute the missing values in age.
I attempted to use Principal Component Analysis (PCA) with k-Means Clustering for visualizing the
cleansed tr.users dataset. However, the resulting clustering was not useful for summarizing and
visualizing the dataset given the large portion of categorical features.
Feature Engineering
I converted date_account_created and date_first_booking features to seasons
(date_account_created_season and date_first_booking_season) based on the Northern
Meteorological Seasons definition where the season starts within the month of the equinox instead of a
specific date within the month. I also created lag features using the date_account_created,
date_first_booking, and timestamp_first_active as follows:
● activity_to_account = date_account_created - timestamp_first_active
● activity_to_booking = Date first booking - timestamp_first_active
● account_to_booking = Date_first_booking - date_account_created
Finally, I converted the categorical features using one-hot-encoding. Chose to create two datasets based
on different one-hot-encoded (OHE) methodologies. For distance based classification modeling, I created
4
a OHE dataset (airbnb.false.rank) where the dummy variables were established for each and every
unique value within the categorical variable, also known as the k-dummy variable approach which
resulted in 145 total features. For non-distance-based classification modeling, I created a OHE dataset
(airbnb.full.rank) where the dummy variables were established for the k-1 unique values within the
categorical variable which resulted in 133 total features.
The response variable, country_destination, was converted to a numeric factor,
country_destination_num, using the following mapping: ‘CA’ = 0, ‘DE’ = 1, ‘ES’ = 2,
‘FR’ = 3, ‘GB’ = 4, ‘IT’ = 5, ‘other’ = 6, and ‘US’ = 7. The conversion was
necessary since many of the classification methods prefer numerics instead of character values
Train/Test Dataset Creation
Each of the OHE encoded datasets were split using an 80/20 ratio to create training and test datasets.
These training and test (hold out) datasets were used in the modeling efforts detailed in the Analysis
section of the report. The training datasets contained 69,913 observations while the test datasets contained
17,477 observations. The training datasets were used for modeling and the test datasets were used to
measure the prediction accuracy of the models.
Analysis
Applied several classification methodologies to determine which method achieved the best overall
prediction accuracy for the datasets. Each model was trained using the training data and then tested
against the hold out (test) dataset to determine the overall prediction accuracy.
k-Nearest Neighbor (KNN)
KNN is a classifier that first identifies the K points in the training data that are closest in the
characteristics of their features to the test observations, and then based on the responses of the K-
neighbors, assigns a conditional probability for that observation for each potential class, and then assigns
the test observation to the class with the highest probability. For this analysis, I used KNN in order to
predict the country destination for each user in the test data based on the country destination of the users
in the training data that were closest to each test user.
I tried different values of K(1,5,10) and then produced confusion matrices for all three values of K.
When using K = 1, the model correctly assigns the test observation to the right class 55.9% of the time.
Refer to Table 3 in Appendix A for the confusion matrix.
In order to increase the proportion of correct predictions, the number of neighbors used will be increased
to 5.
5
Using K = 5, the model assigns the test observation to the right class 68.9% of the time, an increase of
13% from K = 1. I will try again with a larger value for K to see if I get a larger proportion of correct
assignments. Refer to Table 4 in Appendix A for the confusion matrix.
Using K = 10, the model correctly assigns the test observation to the right class 71% of the time. While
increasing K caused the number of correct predictions to increase, the percentage only climbed by 2.1%.
Refer to the table below for the confusion matrix.
Confusion matrix for KNN when k = 10
When increasing the value of K, the KNN model correctly assigned more of the test observations to the
right class. However, the model mostly increased its number of correct predictions by assigning more
observations to the number 7 class. When looking at the data, 71.4% of the observation in the test data set
belong to the 7 class, meaning that by just assigning more of the observations to the number 7 class, it
will appear that the model is doing a better job of predicting the responses of the test observations.
Extreme Gradient Boosting (XGBoost)
Gradient boosting is a machine learning technique for regression and classification problems, which
produces a prediction model in the form of an ensemble of weak prediction models, typically decision
trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them
by allowing optimization of an arbitrary differentiable loss function.2
XGBoost is one of the implementations of Gradient Boosting concept, but what makes XGBoost unique
is that it uses “a more regularized model formalization to control over-fitting, which gives it better
performance,” according to the author of the algorithm, Tianqi Chen. Therefore, it helps to reduce
overfitting.3
2 https://en.wikipedia.org/wiki/Gradient_boosting 3 https://blog.exploratory.io/introduction-to-extreme-gradient-boosting-in-exploratory-7bbec554ac7
6
I attempted to use hyperparameter tuning, but model run times were exceedingly long (e.g. multiple days)
so the effort was abandoned due to time constraints. Hyperparameter tuning was lengthy due to the
training data dimensions (69,113 observations of 131 predictors).
I chose to run the default parameter values for the xgboost() function from the xgboost library. The
resulting xgboost model yielded a test prediction accuracy of 71.4% which is roughly the same as
guessing ‘US’ as destination country for each prediction (see XGBoost Confusion Matrix for Test Data
below). The class imbalance is likely the cause of the model favoring ‘US’ as the destination country of
the first booking.
XGBoost Confusion Matrix for Test Data
The feature importance was determined using the xgb.importance() function on the training dataset.
The top ten (10) features listed in decreasing level of importance are: age, activity_to_booking,
account_to_booking, affiliate_channel.other, date_first_booking_season.Spring,
gender.FEMALE, gender.MALE, signup_app.Web, signup_method.facebook, and
affiliate_channel.sem.non.brand. Refer to Table 6 in Appendix A for additional details.
Support Vector Machine (SVM)
The support vector machine is a classification model that is an extension of the support vector classifier;
the goal of these methods is to find a hyperplane that separates the data as well as possible. A kernel
approach is used in SVM to enlarge the feature space and allow for a non-linear boundary; the two kernel
types explored in this analysis are the polynomial and radial kernels. A value must be selected for the
tuning parameter, which is important as it determines the extent to which the model underfits or overfits
the data. A small value means the cost of misclassification is low and thus allows more values to be
misclassified, a larger value will result in higher accuracy4.
When using SVM for a multi-class problem such as for this analysis, two approaches can be used: one-vs-
one classification or one-vs-all classification. A one-vs-one approach is used for this analysis, meaning
that for K classes, SVMs are created for each pair of classes and an observation is classified as the class to
which it was most frequently assigned of these K-choose-2 SVMs5
4 Garret James, et al, “Support Vector Machines” in An Introduction to Statistical Learning with Applications in R,
337-358 5 ibid
7
SVM works well for smaller datasets with a larger number of features and for data with a clear margin of
separation. However, issues arise when the dataset is large and/or when classes overlap 6, as experienced
when attempting to perform SVM on this dataset. With almost 70,000 observations in the training
dataset, computation time was quite lengthy and thus attempts to tune the model were very limited.
Using the “e1017” library in R, the tune function ran for over 24 hours without completion and thus was
abandoned for SVM as well. Using a smaller portion of the training set (first 20,000 rows) allowed the
ksvm function from the “kernlab” library to produce results within a reasonable time frame. After several
iterations, it was found that a radial kernel with a high cost tuning value (C=1,000) resulted in the highest
prediction accuracy, 91.0%. A table showing the training prediction accuracies for all attempts can be
seen in table 7 in appendix A. Running the test data through this model results in a 90.6% prediction
accuracy and 13,773 support vectors. The confusion matrix for the test data is as follows:
SVM Confusion Matrix for Test Data
Given that the distribution for destination class 7 (‘US’) is about 70% of the training data and a large
number of the other classes are miscategorized into this class, weighting the response classes may be
necessary, however time did not allow for exploration of this approach.
Decision Trees (Random Forest)
Decision trees are a class of predictive data mining tools which predict either a categorical or continuous
response variable. They get their name from the structure of the models built. A series of decisions are
made to segment the data into homogeneous subgroups. This is also called recursive partitioning. When
drawn out graphically, the model can resemble a tree with branches. Random forests or random decision
forests are an ensemble learning method for classification, regression and other tasks that operates by
constructing a multitude of decision trees at training time and outputting the class that is the mode of the
classes (classification) or mean prediction (regression) of the individual trees. Random decision forests
correct for decision trees' habit of overfitting to their training set.7
Decision Trees and Random Forests do not need to have categorical variables one-hot-encoded with
dummy variables and can handle multiple factors within a feature; I will use a data set that follows this
6 https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/ 7 https://en.wikipedia.org/wiki/Random_forest
8
structure. Fit a Random Forest model using the default parameters for the randomforest() function
from the randomforest library and all 16 remaining predictor variables which can be referenced by
Table 2 in Appendix A. The resulting model yielded a test prediction accuracy rate of 71.1% (see
Random Forest Confusion Matrix below). The mean decrease in Gini coefficient is a measure of how
each variable contributes to the homogeneity of the nodes and leaves in the resulting random forest.8 The
Age variable has the largest mean decrease at 4671.54, the values for the remaining variables can be seen
in Table 5.
Random Forest Confusion Matrix
Neural Network
A neural network uses algorithms that can recognize patterns in data, in a way similar to how the human
brain works. Neural networks are made of layers of nodes, which weight the inputs it receives in such a
way in order to optimize an algorithm to predict the specified output.9
To create this neural network, because of the restraints on computing with such a large dataset, the
problem was simplified to predict only either 1 for ‘US’, or 0 for ‘non-US’. Also, as neural networks do
not work well with sparse data sets 10, the model’s parameters were reduced and only two of the most
significant predictors, activity_to_booking and age, were used to predict the test observations.
The neural network was run with 2 hidden layers, with 4 and 2 nodes in the layers. The threshold value
chose was .01, the result ended with an error of 7134.3 and 1030 steps. The graph for the output network
plot can be found in Appendix A figure 6. The neural network after being run predicted that every test
observation would be ‘US’, resulting in a prediction accuracy of 71.4%.
Overall the neural network did not perform well with the data I had as neural networks do not work well
with sparse data sets. Below is the table showing the neural network simply choosing ‘US’ for all test
observations.
8 https://dinsdalelab.sdsu.edu/metag.stats/code/randomforest.html 9 https://skymind.ai/wiki/neural-network#define 10https://www.quora.com/Why-are-deep-neural-networks-so-bad-with-sparse-data
9
Conclusion For most of the statistical methods, the problem of the models was predicting ‘US’ for most of the
observations. With 71.4% of the test observations having the country destination of ‘US’, this meant that
most of the models would approach this threshold without surpassing it. Also, because of the large
datasets, the ability to fine tune the models was limited as running cross validation to determine
parameters for the models took too much time to run.
For KNN, the model with the parameter k = 10 performed the best compared to either the k =1, or k =5
models, with a prediction accuracy of 71%. However, almost all of the test observation were predicted to
be ‘US’, showing the limitations of this model.
For XGBoost, the same problem was encountered where, while the model had a prediction accuracy of
71.4%, almost all of the test observations were predicted to be ‘US’.
For a Random forest decision tree, the prediction accuracy of the model was 71.1%, and had the same
problem as the previous models in predicting almost all of the test observations to be ‘US’.
For a neural network, the problem was simplified to either predict ‘US’ or non ‘US’, however even with
this simplification the the neural network still predicted every test observation to be ‘US’, resulting in a
prediction accuracy of 71.4%.
The best model was the SVM that used a one-vs-one approach. The model had a prediction accuracy for
the test data of 90.6%. This model seemed to work the best with the data and most accurately predicted
the country destination of the test users.
Looking at which variables were the best predictors, tables 5 and 6 in Appendix A shows a list of the best
predictors of country destination. The age, gender , account_to_booking, and
activity_to_booking variables were the best predictors overall of country destination.
Below is a summary table of each method used and the corresponding prediction accuracy of the model.
10
Modeling Method Prediction Accuracy
kNN, k = 1 55.9%
kNN, k = 5 68.9%
kNN, k = 10 71.0%
XGBoost (default parameters) 71.4%
SVM, kernel = radial, C = 1,000 90.6%
Random Forest (default parameters) 71.1%
Neural Network 71.4%
11
Appendix A
Table 1: tr.users (train_users_2.csv) Dataset Summary
> summary(tr.users)
id date_account_created timestamp_first_active
00023iyk9l: 1 2014-05-13: 674 Min. :2.01e+13
0005ytdols: 1 2014-06-24: 670 1st Qu.:2.01e+13
000guo2307: 1 2014-06-25: 636 Median :2.01e+13
000wc9mlv3: 1 2014-05-20: 632 Mean :2.01e+13
0012yo8hu2: 1 2014-05-14: 622 3rd Qu.:2.01e+13
001357912w: 1 2014-05-21: 602 Max. :2.01e+13
(Other) :213445 (Other) :209615
date_first_booking gender age signup_method
:124543 -unknown-:95688 Min. : 1 basic :152897
2014-05-22: 248 FEMALE :63041 1st Qu.: 28 facebook: 60008
2014-06-11: 231 MALE :54440 Median : 34 google : 546
2014-06-24: 226 OTHER : 282 Mean : 50
2014-05-21: 225 3rd Qu.: 43
2014-06-10: 223 Max. :2014
(Other) : 87755 NA's :87990
signup_flow language affiliate_channel affiliate_provider
Min. : 0.00 en :206314 direct :137727 direct :137426
1st Qu.: 0.00 zh : 1632 sem-brand : 26045 google : 51693
Median : 0.00 fr : 1172 sem-non-brand: 18844 other : 12549
Mean : 3.27 es : 915 other : 8961 craigslist: 3471
3rd Qu.: 0.00 ko : 747 seo : 8663 bing : 2328
Max. :25.00 de : 732 api : 8167 facebook : 2273
(Other): 1939 (Other) : 5044 (Other) : 3711
first_affiliate_tracked signup_app first_device_type
untracked :109232 Android: 5454 Mac Desktop :89600
linked : 46287 iOS : 19019 Windows Desktop:72716
omg : 43982 Moweb : 6261 iPhone :20759
tracked-other: 6156 Web :182717 iPad :14339
: 6065 Other/Unknown :10667
product : 1556 Android Phone : 2803
(Other) : 173 (Other) : 2567
first_browser country_destination
Chrome :63845 NDF :124543
Safari :45169 US : 62376
Firefox :33655 other : 10094
-unknown- :27266 FR : 5023
IE :21068 IT : 2835
Mobile Safari:19274 GB : 2324
(Other) : 3174 (Other): 6256
12
Table 2: Airbnb.not.ohe.train Dataset Summary
Figure 1: tr.user Missing Data
13
Figure 2: Distribution of Missing age Observation by country_destination
Figure 3: Distribution of Response Variable (country_destination)
14
Figure 4: Distribution of Missing date_first_booking Observations by
country_destination
Figure 5: Distribution of Missing age and date_first_booking (both_missing)
Observations by country_destination
15
Table 3: Confusion matrix for KNN when k = 1
Table 4: Confusion matrix for KNN when k = 5
Table 5: Random Forest Variables Importance
16
Table 6: XGBoost Top 10 Variable Importance
Table 7: SVM Prediction Accuracies for Reduced Training Data
Figure 6: Neural Network Plot
17
R Code Listing 1: KNN Model
library(class)
knn.pred=knn(train.X,test.X,train.Y,k=1)
table(knn.pred ,te.Y)
mean(knn.pred==te.Y)
[1] 0.5571322
knn.pred=knn(train.X,test.X,tr.Y,k=5)
table(knn.pred ,te.Y)
mean(knn.pred==te.Y)
[1] 0.688276
knn.pred=knn(train.X,test.X,tr.Y,k=10)
table(knn.pred ,te.Y)
mean(knn.pred==te.Y)
[1] 0.7094467
mean(7==te.Y)
[1] 0.7137953
R Code Listing 2: XGBoost Model
# load the processed datasets
airbnb.full.rank <-readRDS("final_airbnb.full.rank.RDS")
airbnb.full.rank.ids <-readRDS("final_airbnb.full.rank.ids.RDS")
X.train.full.rank <-readRDS("final_X.train.full.rank.RDS")
X.test.full.rank <-readRDS("final_X.test.full.rank.RDS")
Y.train.full.rank <-readRDS("final_Y.train.full.rank.RDS")
Y.test.full.rank <-readRDS("final_Y.test.full.rank.RDS")
airbnb.false.rank <-readRDS("final_airbnb.false.rank.RDS")
airbnb.false.rank.ids <-readRDS("final_airbnb.false.rank.ids.RDS")
X.train.false.rank <-readRDS("final_X.train.false.rank.RDS")
X.test.false.rank <-readRDS("final_X.test.false.rank.RDS")
X.test.false.rank <-readRDS("final_X.test.false.rank.RDS")
Y.train.false.rank <-readRDS("final_Y.train.false.rank.RDS")
Y.test.false.rank <-readRDS("final_Y.test.false.rank.RDS")
options(digits =3)
library(tidyverse)
## ── Attaching packages
────────────────────────────────────────────────────────────────────
tidyverse 1.2.1 ──
18
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts
───────────────────────────────────────────────────────────────────────
tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dlookr)
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:base':
##
## transform
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# set up labels for xgboost() function
train.labels <-as.matrix(Y.train.full.rank)
test.labels <-as.matrix(Y.test.full.rank)
# train xgboost
xgb.model <-xgboost(data =data.matrix(X.train.full.rank),
label =train.labels,
eta =0.3,
max_depth =6,
min_child_weight =1,
nround =25,
subsample =1,
colsample_bytree =1,
num_parallel_tree =1,
seed =1,
eval_metric ="merror",
19
objective ="multi:softmax",
num_class =8,
nthread =8
)
## [1] train-merror:0.286070
## [2] train-merror:0.286156
## [3] train-merror:0.286141
## [4] train-merror:0.286113
## [5] train-merror:0.286056
## [6] train-merror:0.286056
## [7] train-merror:0.286013
## [8] train-merror:0.285970
## [9] train-merror:0.285855
## [10] train-merror:0.285798
## [11] train-merror:0.285769
## [12] train-merror:0.285612
## [13] train-merror:0.285512
## [14] train-merror:0.285440
## [15] train-merror:0.285383
## [16] train-merror:0.285369
## [17] train-merror:0.285355
## [18] train-merror:0.285340
## [19] train-merror:0.285297
## [20] train-merror:0.285183
## [21] train-merror:0.285054
## [22] train-merror:0.284954
## [23] train-merror:0.284954
## [24] train-merror:0.284897
## [25] train-merror:0.284854
# predict values in test set
y_pred <-predict(xgb.model, data.matrix(X.test.full.rank))
# ensure prediction factor levels match the test factor levels
y_pred <-factor(y_pred, levels =c(0,1,2,3,4,5,6,7))
# change the prediction and test labels back to character country values
y_pred <-as.factor(recode(y_pred,
'0'='CA', '1'='DE', '2'='ES', '3'='FR',
'4'='GB', '5'='IT', '6'='other', '7'='US')
)
y_test <-as.factor(recode(Y.test.full.rank,
'0'='CA', '1'='DE', '2'='ES', '3'='FR',
20
'4'='GB', '5'='IT', '6'='other', '7'='US')
)
# check the test error
table(y_pred, y_test)
## y_test
## y_pred CA DE ES FR GB IT other US
## CA 0 0 0 0 0 0 0 0
## DE 0 0 0 0 0 0 0 0
## ES 0 0 0 0 0 0 0 1
## FR 0 0 0 0 0 0 0 1
## GB 0 0 0 0 0 1 0 0
## IT 0 0 0 0 0 0 0 0
## other 0 0 0 0 0 0 1 2
## US 304 232 443 961 468 571 2021 12471
mean(y_test ==y_pred)
## [1] 0.714
confusionMatrix(y_pred,
y_test
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CA DE ES FR GB IT other US
## CA 0 0 0 0 0 0 0 0
## DE 0 0 0 0 0 0 0 0
## ES 0 0 0 0 0 0 0 1
## FR 0 0 0 0 0 0 0 1
## GB 0 0 0 0 0 1 0 0
## IT 0 0 0 0 0 0 0 0
## other 0 0 0 0 0 0 1 2
## US 304 232 443 961 468 571 2021 12471
##
## Overall Statistics
##
## Accuracy : 0.714
## 95% CI : (0.707, 0.72)
## No Information Rate : 0.714
## P-Value [Acc > NIR] : 0.524
##
## Kappa : 0
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
21
##
## Class: CA Class: DE Class: ES Class: FR Class: GB
## Sensitivity 0.0000 0.0000 0.00e+00 0.00e+00 0.00e+00
## Specificity 1.0000 1.0000 1.00e+00 1.00e+00 1.00e+00
## Pos Pred Value NaN NaN 0.00e+00 0.00e+00 0.00e+00
## Neg Pred Value 0.9826 0.9867 9.75e-01 9.45e-01 9.73e-01
## Prevalence 0.0174 0.0133 2.53e-02 5.50e-02 2.68e-02
## Detection Rate 0.0000 0.0000 0.00e+00 0.00e+00 0.00e+00
## Detection Prevalence 0.0000 0.0000 5.72e-05 5.72e-05 5.72e-05
## Balanced Accuracy 0.5000 0.5000 5.00e-01 5.00e-01 5.00e-01
## Class: IT Class: other Class: US
## Sensitivity 0.0000 4.95e-04 0.9997
## Specificity 1.0000 1.00e+00 0.0004
## Pos Pred Value NaN 3.33e-01 0.7138
## Neg Pred Value 0.9673 8.84e-01 0.3333
## Prevalence 0.0327 1.16e-01 0.7138
## Detection Rate 0.0000 5.72e-05 0.7136
## Detection Prevalence 0.0000 1.72e-04 0.9997
## Balanced Accuracy 0.5000 5.00e-01 0.5000
Feature importance
importance_matrix <- xgb.importance(names(input_x), model = xgb.model)
Importance_matrix[1:10, ] # list out top 10 most important features
## Feature Gain Cover Frequency Importance
## 1: age 0.1962 0.14446 0.21529 0.1962
## 2: activity_to_booking 0.1651 0.09900 0.16876 0.1651
## 3: account_to_booking 0.0672 0.05121 0.04281 0.0672
## 4: affiliate_channel.other 0.0283 0.04286 0.00955 0.0283
## 5: date_first_booking_season.Spring 0.0275 0.03514 0.01898 0.0275
## 6: gender.FEMALE 0.0263 0.01403 0.02246 0.0263
## 7: gender.MALE 0.0231 0.02638 0.02358 0.0231
## 8: signup_app.Web 0.0227 0.03282 0.01067 0.0227
## 9: signup_method.facebook 0.0204 0.00585 0.02370 0.0204
## 10: affiliate_channel.sem.non.brand 0.0180 0.01613 0.01228 0.0180
22
R Code Listing 3: Support Vector Machine
###training data
ytrain=as.factor(readRDS("final_Y.train.false.rank.RDS"))
xtrain=readRDS("final_X.train.false.rank.RDS")
train = cbind(xtrain, ytrain)
train2 = train[1:20000,]
library(kernlab)
set.seed(2345)
svm1 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="rbfdot", C=2)
pred_y=predict(svm1,train2[,1:144])
table(pred_y,train2$ytrain)
svm1
svm2 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="polydot", C=2)
pred_y2=predict(svm2,train2[,1:144])
table(pred_y2,train2$ytrain)
svm2
svm3 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="polydot", degree
= 4, C=2)
pred_y3=predict(svm3,train2[,1:144])
table(pred_y3,train2$ytrain)
svm3
svm4 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="rbfdot", C=10)
pred_y4=predict(svm4,train2[,1:144])
table(pred_y4,train2$ytrain)
svm4
svm5 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="polydot", C=2)
pred_y5=predict(svm5,train2[,1:144])
table(pred_y5,train2$ytrain)
svm5
svm6 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="rbfdot", C=100)
pred_y6=predict(svm5,train2[,1:144])
table(pred_y6,train2$ytrain)
svm6
svm7 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="rbfdot", C=500)
pred_y7=predict(svm7,train2[,1:144])
23
table(pred_y7,train2$ytrain)
svm7
svm8 = ksvm(train2$ytrain~., data=train2, scaled=T, kernel="rbfdot", C=1000)
pred_y8=predict(svm8,train2[,1:144])
table(pred_y7,train2$ytrain)
svm8
svm9 = ksvm(train$ytrain~., data=train, scaled=T, kernel="rbfdot", C=1000)
pred_y9=predict(svm8,xtrain)
table(pred_y9,train$ytrain)
svm9
###test data
ytest=as.factor(readRDS("final_Y.test.false.rank.RDS"))
xtest=readRDS("final_X.test.false.rank.RDS")
test = cbind(xtest, ytest)
svm1_test = ksvm(test$ytest~., data=test, scaled=T, kernel="rbfdot", C=1000)
pred_y_test=predict(svm1_test,xtest)
table(pred_y_test,test$ytest)
(209+187+312+663+329+379+1305+12455)/(17477)
svm1_test
R Code Listing 4: Decision Trees (Random Forest)
airbnb.tree.3 = randomForest(country_destination~.-country_destination, data
= airbnb.not.ohe.train)
summary(airbnb.tree.3)
importance(airbnb.tree.3)
airbnb.tree.3.predict = predict(airbnb.tree.3, newdata=airbnb.not.ohe.test)
summary(airbnb.tree.3.predict)
#Confusion Matrix
table(airbnb.not.ohe.test$country_destination,airbnb.tree.3.predict)
misclass.pred = sum(airbnb.not.ohe.test$country_destination !=
airbnb.tree.3.predict)
misclass.pred/length(airbnb.tree.3.predict)
24
R Code Listing 5: Neural Network
train[,145] = cbind(train.Y, ifelse(train.Y<7, 0, 1))
names(train)[145] <- c( "US")
net = neuralnet(US ~ age + activity_to_booking, data=train, hidden=c(4,2),
linear.output=FALSE, threshold=0.01)
newTest.Y[,1] = cbind(test.Y, ifelse(test.Y<7, 0, 1))
net.results <- compute(net, newTest.X)
results <- data.frame(actual = newTest.Y, prediction=net.results$net.result)
roundedresults<-sapply(results,round,digits=0)
roundedresultsdf=data.frame(roundedresults)
table(roundedresultsdf$actual, roundedresultsdf$prediction)