Nyc open data project ii -- predict where to get and return my citibike

CCiittiibbiikkee ddaattaa aanndd pprreeddiiccttiioonnWhich station should I choose?

Data Scientist Team at SupStat Inc (Vivian Zhang, Yibo Chen, Kai Xiao, Tong He)Check out our blog and newsletters at http://www.supstat.com and http://nycdatascience.com

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

1 of 58 6/12/14, 5:37 PM

OverviewOverview1.

Citibike Data2.

Scrapping3.

Data Description4.

Modeling5.

2/58


2 of 58 6/12/14, 5:37 PM

CitibikeCitibike is hosting a public bike service.

There are many bike stations in NYC.

People want to take bike from a station with at least one available bike.

And when they get to the destination, they want to return bike to a station with at least oneavailable slot.

Our goal is to predict where to rent and where to return

3/58


3 of 58 6/12/14, 5:37 PM

Citibike

4/58


4 of 58 6/12/14, 5:37 PM

Citibike dataWhere are the data sets?

citibike is great in opening their datasets.

They provide previous datasets about trips.

But that's not what we are looking for now.

5/58


5 of 58 6/12/14, 5:37 PM

Citibike dataWhere to find data for each stations bikes and slots?

We can visit http://citibikenyc.com/stations/json to see the current data.

With historical data, We want to provide prediction and guide people with a better choice.

6/58


6 of 58 6/12/14, 5:37 PM

Historical datawe want to scrap data from the website every 5 minutes.

How to do that in R?

7/58


7 of 58 6/12/14, 5:37 PM

Data scrapingWe use the following code:

This is the time we get the data.

require(rjson)jsonURL = "http://citibikenyc.com/stations/json"json_data = fromJSON(file = jsonURL)

names(json_data)

## [1] "executionTime" "stationBeanList"

json_data$executionTime

## [1] "2014-04-24 11:11:03 AM"

8/58


8 of 58 6/12/14, 5:37 PM

Data scrapingOur data is in the form of list. We want to change it into data.frame.

What can we get from this data?

names(json_data$stationBeanList[[1]])

## [1] "id" "stationName" ## [3] "availableDocks" "totalDocks" ## [5] "latitude" "longitude" ## [7] "statusValue" "statusKey" ## [9] "availableBikes" "stAddress1" ## [11] "stAddress2" "city" ## [13] "postalCode" "location" ## [15] "altitude" "testStation" ## [17] "lastCommunicationTime" "landMark"

9/58


9 of 58 6/12/14, 5:37 PM

Data scrapingWe just need id, availableDocks, availableBikes, and executionTime.

executionTime = json_data$executionTimeids = sapply(json_data$stationBeanList, function(x) x$id)free = sapply(json_data$stationBeanList, function(x) x$availableDocks)bikes = sapply(json_data$stationBeanList, function(x) x$availableBikes)data = data.frame(time = executionTime, station_id = ids, free = free, bikes = bikes)

10/58


10 of 58 6/12/14, 5:37 PM

Data scrapingAnd we can get something like this:

head(data)

## time station_id free bikes## 1 2014-04-24 11:11:03 AM 72 19 18## 2 2014-04-24 11:11:03 AM 79 13 15## 3 2014-04-24 11:11:03 AM 82 10 17## 4 2014-04-24 11:11:03 AM 83 44 17## 5 2014-04-24 11:11:03 AM 116 8 30## 6 2014-04-24 11:11:03 AM 119 16 2

11/58


11 of 58 6/12/14, 5:37 PM

CRONWe use cron to schedule our tasks, including our web scrapper.

The log service for cron is off by default. We can first

and delete the '#' before '#cron.*'. Then we restart rsyslog with

And now we have successfully enable the log management system of cron.

Use this to check the log of cron:

sudo vi /etc/rsyslog.d/50-default.conf

sudo service rsyslog restart

sudo vi /var/log/cron.log

12/58


12 of 58 6/12/14, 5:37 PM

CRONThen we can restart the CRON service.

If the following command return a pid, then our cron service is on.

Or you can use this alternative command:

sudo service cron restart

pgrep cron

ps aux | grep 'cron'

13/58


13 of 58 6/12/14, 5:37 PM

CRONThe simplest way to add tasks is create a .sh script.

For example, we create a shell script named "citibike.sh".

It is preferred to use the absolute path.

/usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/citibike.R/usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/writeDB.R

14/58


14 of 58 6/12/14, 5:37 PM

CRONThe final step is to add our script to the list of cron tasks.

And we can add the following line to the end of crontab:

And restart cron to validate our operation.

Here, the first parameter "*/5" means do it every 5 minutes.

Next four parameters correspond to hour, day, month, weekday.

And finally is the command to run.

sudo vi /etc/crontab

*/5 * * * * root /home/vivianzhang/citibike/citibike.sh

15/58


15 of 58 6/12/14, 5:37 PM

CRONOther Examples for cron tasks.

0th min, 23:00 to 7:00,every 2 hours,"," mean 23:00-7:00 or 8:00

This task will print a sentence into test.txt at 23:00,1:00,3:00,5:00,7:00,8:00.

what if we want to cron every 30 minutes?

0 23-7/2,8 * * * echo "Have a good dream：）" >> /tmp/test.txt

0 0,3,6,9,12,15,18,21 ...30 1,4,7,10,13,16,19,22 ...

16/58


16 of 58 6/12/14, 5:37 PM

CRONTABOn Apple MAC machine, we use crontab.

Create a file, or open an existing file to put your task description. such as 'crontest'1.

Edit your tasks as stated previously.2.

Start crontab, and list running tasks.3.

Check whether it run correctly4.

You can remove all the cron tasks after you are done5.

17/58


17 of 58 6/12/14, 5:37 PM

CRONTAB# make a new crontab filesudo touch /etc/crontest

# change the content into thissudo vi /etc/crontest

# content of the file # solution to cron every minute */1 * * * * echo "test cron" >> /tmp/test.txt

# run the job into your cron task listcrontab /etc/crontest

# check crontab listcrontab -l

# check whether the log is written to your temp filevi /tmp/test.txt

18/58


18 of 58 6/12/14, 5:37 PM

CRONTAB# you should see a few works in the file

# remove the cron jobcrontab -r

# double check to see if the job is removedcrontab -l

19/58


19 of 58 6/12/14, 5:37 PM

PostgreSQLWe choose PostgreSQL as the database, which is open-sourced and R-friendly.

We can easily connect to it with a command like this:

require(RPostgreSQL)

conn = dbConnect(dbDriver("PostgreSQL"), user = "vivianzhang", password = "123456", dbname = "station_all", host = "127.0.0.1", port = "5432")

20/58


20 of 58 6/12/14, 5:37 PM

PostgreSQLOur server has limited memory of 1GB, we can’t fetch too many records at once. 10000records/fetch is okay.

The following code enable us extract the first 100 records in table:

And we can fetch 101th record to 10,000th record in the table

res <- dbSendQuery(conn, statement = "SELECT * FROM citibike limit 10000")data1 <- fetch(res, n = 100)

data2 <- fetch(res, n = -1)

21/58


21 of 58 6/12/14, 5:37 PM

PostgreSQLThe size of the table may be larger than the memory.

An alternative method is to directly play with PostgreSQL. We can copy the table to a local file.

First we need to use a valid database user.

To use the default user in PostgreSQL, one can

Then in the interactive interface, use the following SQL command to export the table.

sudo su - postgres

psql\c station_all\copy (SELECT * FROM citibike) TO '/tmp/data.csv' WITH CSV HEADER

22/58


22 of 58 6/12/14, 5:37 PM

Data preprocessingIt is easy to handle date type of data with the following code:

Our data is clean, and useful information includes

dat$station_time = as.POSIXct(dat$station_time, format = "%Y-%m-%d %H:%M:%S")

time

available bikes

available spots.

·

·

·

23/58


23 of 58 6/12/14, 5:37 PM

Data preprocessingWe extract data from a single station, and name it "data_all". This is what we are gonna use:

Let us explore first 10,000 records.

load("data_all.rda")

head(data_all)

## station_time bikes free## 1 2013-08-21 14:10:00 1 37## 2 2013-08-21 14:15:00 2 36## 3 2013-08-21 14:20:00 2 36## 4 2013-08-21 14:25:00 2 36## 5 2013-08-21 14:30:00 2 36## 6 2013-08-21 14:35:00 3 35

data = data_all[1:10000, ]

24/58


24 of 58 6/12/14, 5:37 PM

Time Series ModelWe would like to predict the ratio of bikes in this station.

data$total <- data$bikes + data$freedata$ratio <- data$bikes/data$totalhead(data)

## station_time bikes free total ratio## 1 2013-08-21 14:10:00 1 37 38 0.02632## 2 2013-08-21 14:15:00 2 36 38 0.05263## 3 2013-08-21 14:20:00 2 36 38 0.05263## 4 2013-08-21 14:25:00 2 36 38 0.05263## 5 2013-08-21 14:30:00 2 36 38 0.05263## 6 2013-08-21 14:35:00 3 35 38 0.07895

25/58


25 of 58 6/12/14, 5:37 PM

Time Series ModelThe time interval between our data points is 5 minutes. Let's check if there's any trends:

five_day_ind = 1:(288 * 5)plot(data$ratio[five_day_ind], type = "l")

26/58


26 of 58 6/12/14, 5:37 PM

Time Series ModelThen we turn it into a time series object with frequency=288

Let's check our data

There is an NA value in our sequence.

data.ts <- ts(data$ratio, start = 1, frequency = 288)

sum(is.na(data.ts))

## [1] 1

27/58


27 of 58 6/12/14, 5:37 PM

Time Series ModelUse the following code to fill them with the previous value.

na.position <- which(is.na(data.ts))data.ts[na.position] <- data.ts[na.position - 1]any(is.na(data.ts))

## [1] FALSE

28/58


28 of 58 6/12/14, 5:37 PM

Time Series ModelThe "seasonal" trend is obvious. We need to make use of this information.

It is a smooth function, extract seasonal pattern and enable us to focus on the higher-leveltrends.

fit <- stl(data.ts, "periodic")

colnames(fit$time.series)

## [1] "seasonal" "trend" "remainder"

29/58


29 of 58 6/12/14, 5:37 PM

Time Series ModelThe fitted result looks like:

head(fit$time.series)

## seasonal trend remainder## [1,] -0.2251 0.2772 -0.025791## [2,] -0.2133 0.2784 -0.012396## [3,] -0.2126 0.2795 -0.014250## [4,] -0.2156 0.2806 -0.012383## [5,] -0.2067 0.2817 -0.022373## [6,] -0.2089 0.2828 0.005042

30/58


30 of 58 6/12/14, 5:37 PM

Time Series ModelBlack line is original data showing how much percentage of bikes are available at each timepoint. Red line is extracted seasonal effect.

plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))lines(fit$time.series[five_day_ind, 1], col = 2)leg.txt = c("origin", "seasonal")legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:2)

31/58


31 of 58 6/12/14, 5:37 PM

Time Series ModelThe green line is the trend:

plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))lines(fit$time.series[five_day_ind, 1], col = 2)lines(fit$time.series[five_day_ind, 2], col = 3)leg.txt = c("origin", "seasonal", "trends")legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:3)

32/58


32 of 58 6/12/14, 5:37 PM

Time Series ModelWe get an approximation of our data by adding trend and seasonal effects. Blue line shows themixed effect of trend and seasonal. The remaining difference is the remainder.

plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))lines(fit$time.series[five_day_ind, 1] + fit$time.series[five_day_ind, 2], col = 4)leg.txt = c("origin", "approx")legend(1200, 1, leg.txt, cex = 1, lty = 1, col = c(1, 4))

33/58


33 of 58 6/12/14, 5:37 PM

Time Series ModelGenerally, a single trip with citibike is around 30 minutes. And normal user will pay additionalcharges for a journey over 30 minutes.

We want to focus on the prediction for next 30 minutes, given the update happens every 5minutes, we will fit 6 data points.

34/58


34 of 58 6/12/14, 5:37 PM

Time Series ModelWith the R package 'forecast', we can do time series prediction easily.

library(forecast)

# h is number of periods for forecastingpred = as.numeric(forecast(fit, h = 6)$mean)

35/58


35 of 58 6/12/14, 5:37 PM

Machine Learning ModelMachine learning could also be applied to the time series data.

Here we are going to use GBM for demonstration.

Before we apply gbm to our data. We need to extract some more time related features.

Especially, we need to use previous values to predict.

36/58


36 of 58 6/12/14, 5:37 PM

Feature extractiontraindata = data[1:2000, ]traindata = traindata[c("station_time", "ratio")]names(traindata) <- c("time", "y")head(traindata)

## time y## 1 2013-08-21 14:10:00 0.02632## 2 2013-08-21 14:15:00 0.05263## 3 2013-08-21 14:20:00 0.05263## 4 2013-08-21 14:25:00 0.05263## 5 2013-08-21 14:30:00 0.05263## 6 2013-08-21 14:35:00 0.07895

37/58


37 of 58 6/12/14, 5:37 PM

Feature extractionTime points to make prediction:

h = 6new_time <- seq(from=traindata$time[nrow(traindata)], by='5 min', length.out=h+1)[-1]new_time

## [1] "2013-08-28 12:50:00 EST" "2013-08-28 12:55:00 EST"## [3] "2013-08-28 13:00:00 EST" "2013-08-28 13:05:00 EST"## [5] "2013-08-28 13:10:00 EST" "2013-08-28 13:15:00 EST"

38/58


38 of 58 6/12/14, 5:37 PM

Feature extractionLet's combind our train and test data for further features.

test_id <- seq(nrow(traindata) + 1, by = 1, length.out = h)traindata <- rbind(traindata, data.frame(time = new_time, y = NA))test_id

## [1] 2001 2002 2003 2004 2005 2006

39/58


39 of 58 6/12/14, 5:37 PM

Feature extractionOf course, this service may be popular in weekends than weekdays. So we need a variable tomark it.

traindata$weekday <- as.factor(weekdays(traindata$time))head(traindata$weekday)

## [1] Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday## Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday

40/58


40 of 58 6/12/14, 5:37 PM

Feature extractionTime stamp is useful:

hh <- as.numeric(strftime(traindata$time, format = "%H", tz = "EST"))mm <- as.numeric(strftime(traindata$time, format = "%M", tz = "EST"))ss <- as.numeric(strftime(traindata$time, format = "%S", tz = "EST"))traindata$time_hms <- hh + 60 * mm + 3600 * sshead(traindata)

## time y weekday time_hms## 1 2013-08-21 14:10:00 0.02632 Wednesday 614## 2 2013-08-21 14:15:00 0.05263 Wednesday 914## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214## 4 2013-08-21 14:25:00 0.05263 Wednesday 1514## 5 2013-08-21 14:30:00 0.05263 Wednesday 1814## 6 2013-08-21 14:35:00 0.07895 Wednesday 2114

41/58


41 of 58 6/12/14, 5:37 PM

Feature extractionHow to combine previous information? We need to compute a lagged time series.

A lagged time series is a "delayed" time series, as shown below

f_lag <- function(x, lag=0) c(rep(NA, lag), x[1:(length(x)-lag)])

f_lag(1:10, 1)

## [1] NA 1 2 3 4 5 6 7 8 9

f_lag(1:10, 4)

## [1] NA NA NA NA 1 2 3 4 5 6

42/58


42 of 58 6/12/14, 5:37 PM

Feature extractionTo use the information from 12:30 in 12:40, we can do it with lagged time series.

for (lag in 1:12) { traindata[[paste("lag_", lag, sep = "")]] <- f_lag(traindata$y, lag)}traindata[1:3, ]

## time y weekday time_hms lag_1 lag_2 lag_3## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 NA NA NA## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 0.02632 NA NA## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 0.05263 0.02632 NA## lag_4 lag_5 lag_6 lag_7 lag_8 lag_9 lag_10 lag_11 lag_12## 1 NA NA NA NA NA NA NA NA NA## 2 NA NA NA NA NA NA NA NA NA## 3 NA NA NA NA NA NA NA NA NA

43/58


43 of 58 6/12/14, 5:37 PM

Feature extractionDon't worry about those NAs! They are inevitable in a lagged series.

traindata[1:10, 5:7]

## lag_1 lag_2 lag_3## 1 NA NA NA## 2 0.02632 NA NA## 3 0.05263 0.02632 NA## 4 0.05263 0.05263 0.02632## 5 0.05263 0.05263 0.05263## 6 0.05263 0.05263 0.05263## 7 0.07895 0.05263 0.05263## 8 0.05263 0.07895 0.05263## 9 0.05263 0.05263 0.07895## 10 0.05263 0.05263 0.05263

44/58


44 of 58 6/12/14, 5:37 PM

Feature extractionFinally, we have our data

test <- traindata[test_id, -1]train <- traindata[-test_id, -1]train <- train[!is.na(train$y), ]head(train)

## y weekday time_hms lag_1 lag_2 lag_3 lag_4 lag_5 lag_6## 1 0.02632 Wednesday 614 NA NA NA NA NA NA## 2 0.05263 Wednesday 914 0.02632 NA NA NA NA NA## 3 0.05263 Wednesday 1214 0.05263 0.02632 NA NA NA NA## 4 0.05263 Wednesday 1514 0.05263 0.05263 0.02632 NA NA NA## 5 0.05263 Wednesday 1814 0.05263 0.05263 0.05263 0.02632 NA NA## 6 0.07895 Wednesday 2114 0.05263 0.05263 0.05263 0.05263 0.02632 NA## lag_7 lag_8 lag_9 lag_10 lag_11 lag_12## 1 NA NA NA NA NA NA## 2 NA NA NA NA NA NA## 3 NA NA NA NA NA NA## 4 NA NA NA NA NA NA## 5 NA NA NA NA NA NA## 6 NA NA NA NA NA NA

45/58


45 of 58 6/12/14, 5:37 PM

Machine Learning ModelNow we can use gbm to do prediction.

Wait, what is gbm?

46/58


46 of 58 6/12/14, 5:37 PM

Machine Learning Modelgbm refers to a certain supervised learning algorithm. It has a lot of names.

In the original publication, "gbm" is short for "Gradient Boosting Machine".

In the R package, it is short for "Generalized Boosting Model".

And its wiki page names it as "Gradient boosting".

·

·

·

47/58


47 of 58 6/12/14, 5:37 PM

Machine Learning Modelgbm is derived from a relatively simple principle.

Briefly speaking, it is "hundreds of heads are better than one".

This algorithm generate many regression trees and combine their results for the final model.

48/58


48 of 58 6/12/14, 5:37 PM

Machine Learning ModelWith the following code, we can calculate the model:

Here n.trees is the number of "heads"(trees) for this problem.

model <- gbm(formula=y~., data=train[c('y','weekday','time_hms', paste('lag_',1:12,sep=''))], distribution='gaussian', n.trees=2000, interaction.depth=5, shrinkage=0.01, cv.folds=0, keep.data=F)

49/58


49 of 58 6/12/14, 5:37 PM

Machine Learning ModelIn prediction, using too many trees may cause overfitting problem.

Therefore we need to use cross-validation to choose the number of trees to avoid it.

gbm provide us a convenient tool, here OOB means "Out Of Bag":

best_ntree <- gbm.perf(model, method = "OOB")

50/58


50 of 58 6/12/14, 5:37 PM

Machine Learning ModelThen we can make the prediction:

best_ntree

## [1] 539

predict(model, as.data.frame(test[1,,drop=F]), n.trees=best_ntree, type='response')

## [1] 0.1287

51/58


51 of 58 6/12/14, 5:37 PM

Performance testingHow to compare these two models? We set up a test.

Every day we will get 288 data points. And now we want to predict next 6 points with data fromthe previous week, i.e. 2016 data points.

We randomly choose 50 time points and make prediction for the next 30 minutes.

Then compare their performance with RMSE:

rmse = function(pred, real) sqrt(mean((pred - real)^2))

52/58


52 of 58 6/12/14, 5:37 PM

Performance testingHere is the result:

We can see that gbm is slightly better than the time series prediction.

stl_precision

## [1] 0.03496 0.04656 0.05912 0.07045 0.07626 0.08698

gbm_precision

## [1] 0.02011 0.03447 0.04900 0.06536 0.07186 0.08258

53/58


53 of 58 6/12/14, 5:37 PM

Performance testingHowever, our performance is not ideal.

We can use a straight-forward prediction: treat the data stay stable in 30 minutes. How's theresult?

Why is this happenning?

y_precision

## [1] 0.01903 0.03021 0.02599 0.02401 0.02541 0.03311

54/58


54 of 58 6/12/14, 5:37 PM

Performance testingThis picture have some hints.

plot(diff(data.ts), type = "l")

55/58


55 of 58 6/12/14, 5:37 PM

Performance testingWe can see that this data is tend to stay the same in the next 5 minutes, or even longer.

There are so many 5-minutes that nobody come to this station. Therefore the most straight-forward prediction out-performed those two advanced methods.

sum(diff(data.ts) == 0)

## [1] 6622

56/58


56 of 58 6/12/14, 5:37 PM

More to doThere are many things to do in the future:

The sky is the limit!

Apply other algorithms to this problem, like neural networks.

Use information from nearby station: empty nearby stations will lead people come to thisone.

Combine with weather record: nobody ride in rainy day!

Path finding: design the whole trip for people.

·

·

·

·

57/58


57 of 58 6/12/14, 5:37 PM

Our PackagesWe are developing an R package for citibike, including

There was an app written in Ruby-On-Rails here, offering our prediction service. Our herokuwent to sleep since the service didn't get much traffic, but one of our meetup member spentsometime to make it live today and emailed me the link! here 2

Data scraping

Database interaction and retrieve

Time Series prediction

GBM prediction

·

·

·

·

58/58


58 of 58 6/12/14, 5:37 PM

Engineering

Nyc open data project ii -- predict where to get and return my citibike