58
CCiittiibbiikkee ddaattaa aanndd pprreeddiiccttiioonn Which station should I choose? Data Scientist Team at SupStat Inc (Vivian Zhang, Yibo Chen, Kai Xiao, Tong He) Check out our blog and newsletters at http://www.supstat.com and http://nycdatascience.com Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1 1 of 58 6/12/14, 5:37 PM

Nyc open data project ii -- predict where to get and return my citibike

Embed Size (px)

DESCRIPTION

NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc,NYC, GBM, Machine learning, Time Series, Citibike usage prodiction, advanced R

Citation preview

Page 1: Nyc open data project ii -- predict where to get and return my citibike

CCiittiibbiikkee ddaattaa aanndd pprreeddiiccttiioonnWhich station should I choose?

Data Scientist Team at SupStat Inc (Vivian Zhang, Yibo Chen, Kai Xiao, Tong He)Check out our blog and newsletters at http://www.supstat.com and http://nycdatascience.com

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

1 of 58 6/12/14, 5:37 PM

Page 2: Nyc open data project ii -- predict where to get and return my citibike

OverviewOverview1.

Citibike Data2.

Scrapping3.

Data Description4.

Modeling5.

2/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

2 of 58 6/12/14, 5:37 PM

Page 3: Nyc open data project ii -- predict where to get and return my citibike

CitibikeCitibike is hosting a public bike service.

There are many bike stations in NYC.

People want to take bike from a station with at least one available bike.

And when they get to the destination, they want to return bike to a station with at least oneavailable slot.

Our goal is to predict where to rent and where to return

3/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

3 of 58 6/12/14, 5:37 PM

Page 4: Nyc open data project ii -- predict where to get and return my citibike

Citibike

4/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

4 of 58 6/12/14, 5:37 PM

Page 5: Nyc open data project ii -- predict where to get and return my citibike

Citibike dataWhere are the data sets?

citibike is great in opening their datasets.

They provide previous datasets about trips.

But that's not what we are looking for now.

5/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

5 of 58 6/12/14, 5:37 PM

Page 6: Nyc open data project ii -- predict where to get and return my citibike

Citibike dataWhere to find data for each stations bikes and slots?

We can visit http://citibikenyc.com/stations/json to see the current data.

With historical data, We want to provide prediction and guide people with a better choice.

6/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

6 of 58 6/12/14, 5:37 PM

Page 7: Nyc open data project ii -- predict where to get and return my citibike

Historical datawe want to scrap data from the website every 5 minutes.

How to do that in R?

7/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

7 of 58 6/12/14, 5:37 PM

Page 8: Nyc open data project ii -- predict where to get and return my citibike

Data scrapingWe use the following code:

This is the time we get the data.

require(rjson)jsonURL = "http://citibikenyc.com/stations/json"json_data = fromJSON(file = jsonURL)

names(json_data)

## [1] "executionTime" "stationBeanList"

json_data$executionTime

## [1] "2014-04-24 11:11:03 AM"

8/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

8 of 58 6/12/14, 5:37 PM

Page 9: Nyc open data project ii -- predict where to get and return my citibike

Data scrapingOur data is in the form of list. We want to change it into data.frame.

What can we get from this data?

names(json_data$stationBeanList[[1]])

## [1] "id" "stationName" ## [3] "availableDocks" "totalDocks" ## [5] "latitude" "longitude" ## [7] "statusValue" "statusKey" ## [9] "availableBikes" "stAddress1" ## [11] "stAddress2" "city" ## [13] "postalCode" "location" ## [15] "altitude" "testStation" ## [17] "lastCommunicationTime" "landMark"

9/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

9 of 58 6/12/14, 5:37 PM

Page 10: Nyc open data project ii -- predict where to get and return my citibike

Data scrapingWe just need id, availableDocks, availableBikes, and executionTime.

executionTime = json_data$executionTimeids = sapply(json_data$stationBeanList, function(x) x$id)free = sapply(json_data$stationBeanList, function(x) x$availableDocks)bikes = sapply(json_data$stationBeanList, function(x) x$availableBikes)data = data.frame(time = executionTime, station_id = ids, free = free, bikes = bikes)

10/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

10 of 58 6/12/14, 5:37 PM

Page 11: Nyc open data project ii -- predict where to get and return my citibike

Data scrapingAnd we can get something like this:

head(data)

## time station_id free bikes## 1 2014-04-24 11:11:03 AM 72 19 18## 2 2014-04-24 11:11:03 AM 79 13 15## 3 2014-04-24 11:11:03 AM 82 10 17## 4 2014-04-24 11:11:03 AM 83 44 17## 5 2014-04-24 11:11:03 AM 116 8 30## 6 2014-04-24 11:11:03 AM 119 16 2

11/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

11 of 58 6/12/14, 5:37 PM

Page 12: Nyc open data project ii -- predict where to get and return my citibike

CRONWe use cron to schedule our tasks, including our web scrapper.

The log service for cron is off by default. We can first

and delete the '#' before '#cron.*'. Then we restart rsyslog with

And now we have successfully enable the log management system of cron.

Use this to check the log of cron:

sudo vi /etc/rsyslog.d/50-default.conf

sudo service rsyslog restart

sudo vi /var/log/cron.log

12/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

12 of 58 6/12/14, 5:37 PM

Page 13: Nyc open data project ii -- predict where to get and return my citibike

CRONThen we can restart the CRON service.

If the following command return a pid, then our cron service is on.

Or you can use this alternative command:

sudo service cron restart

pgrep cron

ps aux | grep 'cron'

13/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

13 of 58 6/12/14, 5:37 PM

Page 14: Nyc open data project ii -- predict where to get and return my citibike

CRONThe simplest way to add tasks is create a .sh script.

For example, we create a shell script named "citibike.sh".

It is preferred to use the absolute path.

/usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/citibike.R/usr/R/R-3.0/bin/Rscript /home/vivianzhang/citibike/writeDB.R

14/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

14 of 58 6/12/14, 5:37 PM

Page 15: Nyc open data project ii -- predict where to get and return my citibike

CRONThe final step is to add our script to the list of cron tasks.

And we can add the following line to the end of crontab:

And restart cron to validate our operation.

Here, the first parameter "*/5" means do it every 5 minutes.

Next four parameters correspond to hour, day, month, weekday.

And finally is the command to run.

sudo vi /etc/crontab

*/5 * * * * root /home/vivianzhang/citibike/citibike.sh

15/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

15 of 58 6/12/14, 5:37 PM

Page 16: Nyc open data project ii -- predict where to get and return my citibike

CRONOther Examples for cron tasks.

0th min, 23:00 to 7:00,every 2 hours,"," mean 23:00-7:00 or 8:00

This task will print a sentence into test.txt at 23:00,1:00,3:00,5:00,7:00,8:00.

what if we want to cron every 30 minutes?

0 23-7/2,8 * * * echo "Have a good dream:)" >> /tmp/test.txt

0 0,3,6,9,12,15,18,21 ...30 1,4,7,10,13,16,19,22 ...

16/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

16 of 58 6/12/14, 5:37 PM

Page 17: Nyc open data project ii -- predict where to get and return my citibike

CRONTABOn Apple MAC machine, we use crontab.

Create a file, or open an existing file to put your task description. such as 'crontest'1.

Edit your tasks as stated previously.2.

Start crontab, and list running tasks.3.

Check whether it run correctly4.

You can remove all the cron tasks after you are done5.

17/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

17 of 58 6/12/14, 5:37 PM

Page 18: Nyc open data project ii -- predict where to get and return my citibike

CRONTAB# make a new crontab filesudo touch /etc/crontest

# change the content into thissudo vi /etc/crontest

# content of the file # solution to cron every minute */1 * * * * echo "test cron" >> /tmp/test.txt

# run the job into your cron task listcrontab /etc/crontest

# check crontab listcrontab -l

# check whether the log is written to your temp filevi /tmp/test.txt

18/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

18 of 58 6/12/14, 5:37 PM

Page 19: Nyc open data project ii -- predict where to get and return my citibike

CRONTAB# you should see a few works in the file

# remove the cron jobcrontab -r

# double check to see if the job is removedcrontab -l

19/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

19 of 58 6/12/14, 5:37 PM

Page 20: Nyc open data project ii -- predict where to get and return my citibike

PostgreSQLWe choose PostgreSQL as the database, which is open-sourced and R-friendly.

We can easily connect to it with a command like this:

require(RPostgreSQL)

conn = dbConnect(dbDriver("PostgreSQL"), user = "vivianzhang", password = "123456", dbname = "station_all", host = "127.0.0.1", port = "5432")

20/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

20 of 58 6/12/14, 5:37 PM

Page 21: Nyc open data project ii -- predict where to get and return my citibike

PostgreSQLOur server has limited memory of 1GB, we can’t fetch too many records at once. 10000records/fetch is okay.

The following code enable us extract the first 100 records in table:

And we can fetch 101th record to 10,000th record in the table

res <- dbSendQuery(conn, statement = "SELECT * FROM citibike limit 10000")data1 <- fetch(res, n = 100)

data2 <- fetch(res, n = -1)

21/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

21 of 58 6/12/14, 5:37 PM

Page 22: Nyc open data project ii -- predict where to get and return my citibike

PostgreSQLThe size of the table may be larger than the memory.

An alternative method is to directly play with PostgreSQL. We can copy the table to a local file.

First we need to use a valid database user.

To use the default user in PostgreSQL, one can

Then in the interactive interface, use the following SQL command to export the table.

sudo su - postgres

psql\c station_all\copy (SELECT * FROM citibike) TO '/tmp/data.csv' WITH CSV HEADER

22/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

22 of 58 6/12/14, 5:37 PM

Page 23: Nyc open data project ii -- predict where to get and return my citibike

Data preprocessingIt is easy to handle date type of data with the following code:

Our data is clean, and useful information includes

dat$station_time = as.POSIXct(dat$station_time, format = "%Y-%m-%d %H:%M:%S")

time

available bikes

available spots.

·

·

·

23/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

23 of 58 6/12/14, 5:37 PM

Page 24: Nyc open data project ii -- predict where to get and return my citibike

Data preprocessingWe extract data from a single station, and name it "data_all". This is what we are gonna use:

Let us explore first 10,000 records.

load("data_all.rda")

head(data_all)

## station_time bikes free## 1 2013-08-21 14:10:00 1 37## 2 2013-08-21 14:15:00 2 36## 3 2013-08-21 14:20:00 2 36## 4 2013-08-21 14:25:00 2 36## 5 2013-08-21 14:30:00 2 36## 6 2013-08-21 14:35:00 3 35

data = data_all[1:10000, ]

24/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

24 of 58 6/12/14, 5:37 PM

Page 25: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelWe would like to predict the ratio of bikes in this station.

data$total <- data$bikes + data$freedata$ratio <- data$bikes/data$totalhead(data)

## station_time bikes free total ratio## 1 2013-08-21 14:10:00 1 37 38 0.02632## 2 2013-08-21 14:15:00 2 36 38 0.05263## 3 2013-08-21 14:20:00 2 36 38 0.05263## 4 2013-08-21 14:25:00 2 36 38 0.05263## 5 2013-08-21 14:30:00 2 36 38 0.05263## 6 2013-08-21 14:35:00 3 35 38 0.07895

25/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

25 of 58 6/12/14, 5:37 PM

Page 26: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelThe time interval between our data points is 5 minutes. Let's check if there's any trends:

five_day_ind = 1:(288 * 5)plot(data$ratio[five_day_ind], type = "l")

26/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

26 of 58 6/12/14, 5:37 PM

Page 27: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelThen we turn it into a time series object with frequency=288

Let's check our data

There is an NA value in our sequence.

data.ts <- ts(data$ratio, start = 1, frequency = 288)

sum(is.na(data.ts))

## [1] 1

27/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

27 of 58 6/12/14, 5:37 PM

Page 28: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelUse the following code to fill them with the previous value.

na.position <- which(is.na(data.ts))data.ts[na.position] <- data.ts[na.position - 1]any(is.na(data.ts))

## [1] FALSE

28/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

28 of 58 6/12/14, 5:37 PM

Page 29: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelThe "seasonal" trend is obvious. We need to make use of this information.

It is a smooth function, extract seasonal pattern and enable us to focus on the higher-leveltrends.

fit <- stl(data.ts, "periodic")

colnames(fit$time.series)

## [1] "seasonal" "trend" "remainder"

29/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

29 of 58 6/12/14, 5:37 PM

Page 30: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelThe fitted result looks like:

head(fit$time.series)

## seasonal trend remainder## [1,] -0.2251 0.2772 -0.025791## [2,] -0.2133 0.2784 -0.012396## [3,] -0.2126 0.2795 -0.014250## [4,] -0.2156 0.2806 -0.012383## [5,] -0.2067 0.2817 -0.022373## [6,] -0.2089 0.2828 0.005042

30/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

30 of 58 6/12/14, 5:37 PM

Page 31: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelBlack line is original data showing how much percentage of bikes are available at each timepoint. Red line is extracted seasonal effect.

plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))lines(fit$time.series[five_day_ind, 1], col = 2)leg.txt = c("origin", "seasonal")legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:2)

31/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

31 of 58 6/12/14, 5:37 PM

Page 32: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelThe green line is the trend:

plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))lines(fit$time.series[five_day_ind, 1], col = 2)lines(fit$time.series[five_day_ind, 2], col = 3)leg.txt = c("origin", "seasonal", "trends")legend(1200, 1, leg.txt, cex = 1, lty = 1, col = 1:3)

32/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

32 of 58 6/12/14, 5:37 PM

Page 33: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelWe get an approximation of our data by adding trend and seasonal effects. Blue line shows themixed effect of trend and seasonal. The remaining difference is the remainder.

plot(data$ratio[five_day_ind], type = "l", ylim = c(-0.5, 1), xlim = c(0, 1500))lines(fit$time.series[five_day_ind, 1] + fit$time.series[five_day_ind, 2], col = 4)leg.txt = c("origin", "approx")legend(1200, 1, leg.txt, cex = 1, lty = 1, col = c(1, 4))

33/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

33 of 58 6/12/14, 5:37 PM

Page 34: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelGenerally, a single trip with citibike is around 30 minutes. And normal user will pay additionalcharges for a journey over 30 minutes.

We want to focus on the prediction for next 30 minutes, given the update happens every 5minutes, we will fit 6 data points.

34/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

34 of 58 6/12/14, 5:37 PM

Page 35: Nyc open data project ii -- predict where to get and return my citibike

Time Series ModelWith the R package 'forecast', we can do time series prediction easily.

library(forecast)

# h is number of periods for forecastingpred = as.numeric(forecast(fit, h = 6)$mean)

35/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

35 of 58 6/12/14, 5:37 PM

Page 36: Nyc open data project ii -- predict where to get and return my citibike

Machine Learning ModelMachine learning could also be applied to the time series data.

Here we are going to use GBM for demonstration.

Before we apply gbm to our data. We need to extract some more time related features.

Especially, we need to use previous values to predict.

36/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

36 of 58 6/12/14, 5:37 PM

Page 37: Nyc open data project ii -- predict where to get and return my citibike

Feature extractiontraindata = data[1:2000, ]traindata = traindata[c("station_time", "ratio")]names(traindata) <- c("time", "y")head(traindata)

## time y## 1 2013-08-21 14:10:00 0.02632## 2 2013-08-21 14:15:00 0.05263## 3 2013-08-21 14:20:00 0.05263## 4 2013-08-21 14:25:00 0.05263## 5 2013-08-21 14:30:00 0.05263## 6 2013-08-21 14:35:00 0.07895

37/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

37 of 58 6/12/14, 5:37 PM

Page 38: Nyc open data project ii -- predict where to get and return my citibike

Feature extractionTime points to make prediction:

h = 6new_time <- seq(from=traindata$time[nrow(traindata)], by='5 min', length.out=h+1)[-1]new_time

## [1] "2013-08-28 12:50:00 EST" "2013-08-28 12:55:00 EST"## [3] "2013-08-28 13:00:00 EST" "2013-08-28 13:05:00 EST"## [5] "2013-08-28 13:10:00 EST" "2013-08-28 13:15:00 EST"

38/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

38 of 58 6/12/14, 5:37 PM

Page 39: Nyc open data project ii -- predict where to get and return my citibike

Feature extractionLet's combind our train and test data for further features.

test_id <- seq(nrow(traindata) + 1, by = 1, length.out = h)traindata <- rbind(traindata, data.frame(time = new_time, y = NA))test_id

## [1] 2001 2002 2003 2004 2005 2006

39/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

39 of 58 6/12/14, 5:37 PM

Page 40: Nyc open data project ii -- predict where to get and return my citibike

Feature extractionOf course, this service may be popular in weekends than weekdays. So we need a variable tomark it.

traindata$weekday <- as.factor(weekdays(traindata$time))head(traindata$weekday)

## [1] Wednesday Wednesday Wednesday Wednesday Wednesday Wednesday## Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday

40/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

40 of 58 6/12/14, 5:37 PM

Page 41: Nyc open data project ii -- predict where to get and return my citibike

Feature extractionTime stamp is useful:

hh <- as.numeric(strftime(traindata$time, format = "%H", tz = "EST"))mm <- as.numeric(strftime(traindata$time, format = "%M", tz = "EST"))ss <- as.numeric(strftime(traindata$time, format = "%S", tz = "EST"))traindata$time_hms <- hh + 60 * mm + 3600 * sshead(traindata)

## time y weekday time_hms## 1 2013-08-21 14:10:00 0.02632 Wednesday 614## 2 2013-08-21 14:15:00 0.05263 Wednesday 914## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214## 4 2013-08-21 14:25:00 0.05263 Wednesday 1514## 5 2013-08-21 14:30:00 0.05263 Wednesday 1814## 6 2013-08-21 14:35:00 0.07895 Wednesday 2114

41/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

41 of 58 6/12/14, 5:37 PM

Page 42: Nyc open data project ii -- predict where to get and return my citibike

Feature extractionHow to combine previous information? We need to compute a lagged time series.

A lagged time series is a "delayed" time series, as shown below

f_lag <- function(x, lag=0) c(rep(NA, lag), x[1:(length(x)-lag)])

f_lag(1:10, 1)

## [1] NA 1 2 3 4 5 6 7 8 9

f_lag(1:10, 4)

## [1] NA NA NA NA 1 2 3 4 5 6

42/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

42 of 58 6/12/14, 5:37 PM

Page 43: Nyc open data project ii -- predict where to get and return my citibike

Feature extractionTo use the information from 12:30 in 12:40, we can do it with lagged time series.

for (lag in 1:12) { traindata[[paste("lag_", lag, sep = "")]] <- f_lag(traindata$y, lag)}traindata[1:3, ]

## time y weekday time_hms lag_1 lag_2 lag_3## 1 2013-08-21 14:10:00 0.02632 Wednesday 614 NA NA NA## 2 2013-08-21 14:15:00 0.05263 Wednesday 914 0.02632 NA NA## 3 2013-08-21 14:20:00 0.05263 Wednesday 1214 0.05263 0.02632 NA## lag_4 lag_5 lag_6 lag_7 lag_8 lag_9 lag_10 lag_11 lag_12## 1 NA NA NA NA NA NA NA NA NA## 2 NA NA NA NA NA NA NA NA NA## 3 NA NA NA NA NA NA NA NA NA

43/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

43 of 58 6/12/14, 5:37 PM

Page 44: Nyc open data project ii -- predict where to get and return my citibike

Feature extractionDon't worry about those NAs! They are inevitable in a lagged series.

traindata[1:10, 5:7]

## lag_1 lag_2 lag_3## 1 NA NA NA## 2 0.02632 NA NA## 3 0.05263 0.02632 NA## 4 0.05263 0.05263 0.02632## 5 0.05263 0.05263 0.05263## 6 0.05263 0.05263 0.05263## 7 0.07895 0.05263 0.05263## 8 0.05263 0.07895 0.05263## 9 0.05263 0.05263 0.07895## 10 0.05263 0.05263 0.05263

44/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

44 of 58 6/12/14, 5:37 PM

Page 45: Nyc open data project ii -- predict where to get and return my citibike

Feature extractionFinally, we have our data

test <- traindata[test_id, -1]train <- traindata[-test_id, -1]train <- train[!is.na(train$y), ]head(train)

## y weekday time_hms lag_1 lag_2 lag_3 lag_4 lag_5 lag_6## 1 0.02632 Wednesday 614 NA NA NA NA NA NA## 2 0.05263 Wednesday 914 0.02632 NA NA NA NA NA## 3 0.05263 Wednesday 1214 0.05263 0.02632 NA NA NA NA## 4 0.05263 Wednesday 1514 0.05263 0.05263 0.02632 NA NA NA## 5 0.05263 Wednesday 1814 0.05263 0.05263 0.05263 0.02632 NA NA## 6 0.07895 Wednesday 2114 0.05263 0.05263 0.05263 0.05263 0.02632 NA## lag_7 lag_8 lag_9 lag_10 lag_11 lag_12## 1 NA NA NA NA NA NA## 2 NA NA NA NA NA NA## 3 NA NA NA NA NA NA## 4 NA NA NA NA NA NA## 5 NA NA NA NA NA NA## 6 NA NA NA NA NA NA

45/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

45 of 58 6/12/14, 5:37 PM

Page 46: Nyc open data project ii -- predict where to get and return my citibike

Machine Learning ModelNow we can use gbm to do prediction.

Wait, what is gbm?

46/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

46 of 58 6/12/14, 5:37 PM

Page 47: Nyc open data project ii -- predict where to get and return my citibike

Machine Learning Modelgbm refers to a certain supervised learning algorithm. It has a lot of names.

In the original publication, "gbm" is short for "Gradient Boosting Machine".

In the R package, it is short for "Generalized Boosting Model".

And its wiki page names it as "Gradient boosting".

·

·

·

47/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

47 of 58 6/12/14, 5:37 PM

Page 48: Nyc open data project ii -- predict where to get and return my citibike

Machine Learning Modelgbm is derived from a relatively simple principle.

Briefly speaking, it is "hundreds of heads are better than one".

This algorithm generate many regression trees and combine their results for the final model.

48/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

48 of 58 6/12/14, 5:37 PM

Page 49: Nyc open data project ii -- predict where to get and return my citibike

Machine Learning ModelWith the following code, we can calculate the model:

Here n.trees is the number of "heads"(trees) for this problem.

model <- gbm(formula=y~., data=train[c('y','weekday','time_hms', paste('lag_',1:12,sep=''))], distribution='gaussian', n.trees=2000, interaction.depth=5, shrinkage=0.01, cv.folds=0, keep.data=F)

49/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

49 of 58 6/12/14, 5:37 PM

Page 50: Nyc open data project ii -- predict where to get and return my citibike

Machine Learning ModelIn prediction, using too many trees may cause overfitting problem.

Therefore we need to use cross-validation to choose the number of trees to avoid it.

gbm provide us a convenient tool, here OOB means "Out Of Bag":

best_ntree <- gbm.perf(model, method = "OOB")

50/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

50 of 58 6/12/14, 5:37 PM

Page 51: Nyc open data project ii -- predict where to get and return my citibike

Machine Learning ModelThen we can make the prediction:

best_ntree

## [1] 539

predict(model, as.data.frame(test[1,,drop=F]), n.trees=best_ntree, type='response')

## [1] 0.1287

51/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

51 of 58 6/12/14, 5:37 PM

Page 52: Nyc open data project ii -- predict where to get and return my citibike

Performance testingHow to compare these two models? We set up a test.

Every day we will get 288 data points. And now we want to predict next 6 points with data fromthe previous week, i.e. 2016 data points.

We randomly choose 50 time points and make prediction for the next 30 minutes.

Then compare their performance with RMSE:

rmse = function(pred, real) sqrt(mean((pred - real)^2))

52/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

52 of 58 6/12/14, 5:37 PM

Page 53: Nyc open data project ii -- predict where to get and return my citibike

Performance testingHere is the result:

We can see that gbm is slightly better than the time series prediction.

stl_precision

## [1] 0.03496 0.04656 0.05912 0.07045 0.07626 0.08698

gbm_precision

## [1] 0.02011 0.03447 0.04900 0.06536 0.07186 0.08258

53/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

53 of 58 6/12/14, 5:37 PM

Page 54: Nyc open data project ii -- predict where to get and return my citibike

Performance testingHowever, our performance is not ideal.

We can use a straight-forward prediction: treat the data stay stable in 30 minutes. How's theresult?

Why is this happenning?

y_precision

## [1] 0.01903 0.03021 0.02599 0.02401 0.02541 0.03311

54/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

54 of 58 6/12/14, 5:37 PM

Page 55: Nyc open data project ii -- predict where to get and return my citibike

Performance testingThis picture have some hints.

plot(diff(data.ts), type = "l")

55/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

55 of 58 6/12/14, 5:37 PM

Page 56: Nyc open data project ii -- predict where to get and return my citibike

Performance testingWe can see that this data is tend to stay the same in the next 5 minutes, or even longer.

There are so many 5-minutes that nobody come to this station. Therefore the most straight-forward prediction out-performed those two advanced methods.

sum(diff(data.ts) == 0)

## [1] 6622

56/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

56 of 58 6/12/14, 5:37 PM

Page 57: Nyc open data project ii -- predict where to get and return my citibike

More to doThere are many things to do in the future:

The sky is the limit!

Apply other algorithms to this problem, like neural networks.

Use information from nearby station: empty nearby stations will lead people come to thisone.

Combine with weather record: nobody ride in rainy day!

Path finding: design the whole trip for people.

·

·

·

·

57/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

57 of 58 6/12/14, 5:37 PM

Page 58: Nyc open data project ii -- predict where to get and return my citibike

Our PackagesWe are developing an R package for citibike, including

There was an app written in Ruby-On-Rails here, offering our prediction service. Our herokuwent to sleep since the service didn't get much traffic, but one of our meetup member spentsometime to make it live today and emailed me the link! here 2

Data scraping

Database interaction and retrieve

Time Series prediction

GBM prediction

·

·

·

·

58/58

Citibike data and prediction http://nycdatascience.com/slides/NYCOpenDataMeetup/citibike_v2/citibike.html#1

58 of 58 6/12/14, 5:37 PM