Predict NBA Game Pace ECE 539 final project Fa Wanghomepages.cae.wisc.edu/~ece539/fall13/project/WangFa_rpt.pdf · Predict NBA Game Pace ECE 539 final project Fa Wang Abstract Long

Predict NBA Game Pace

ECE 539 final project

Fa Wang

Abstract

Long time ago, people still believed that having all the best players on the court could

guarantee winning of a game or even a champion title. Recently, more historical data proved

that a winning team might not have a highest score on offense and lowest score on defense. To

evaluate a quality of a team, an index of pace value is introduced. The goal of our project is to

predict pace value of a NBA game by analyzing a large amount of historical data on the

Internet. To do that, we first extract key features from play-by-play data and game stats data

with map reduce pattern. For sake of large amount of historical play-by-play data, we take

advantage of HBase[1] to store and process our data. Different from any traditional

standalone Machine Learning theory designs, the paper would be more focusing on applying

map-reduce pattern along with Machine Learning method to solve big data problem. By

examining the final result’s mean-square error (MSE), we can judge how far our prediction is

away from the real value as to evaluate our model’s correctness and accuracy.

I. INTRODUCTION

To predict NBA game score has been a heated topic for a long time. In recent year, people

turn to a possession based score model, which has proved to provide more accurate

statistical result. Later on, people introduced a new index to evaluate performance of a

team for a certain game. The index is called Pace value. [2] According to

basketball-reference, pace factor is an estimate of the number of possessions a team uses

per game. Understanding opponent's pace value allows coach to better preparing a game as

pace factor is closely related to score of a game, the formula of Score is given as: Score =

ASPM * Pace. ASPM[3] denotes points / possession and Pace is the metric for possession.

That means an accurate prediction of pace value will result in a precise final score. In our

paper, we are trying to merge play-by-play (PBP) data and each game’s stats data to predict

future game’s pace value. The challenge of predicting game's pace is obvious. Different

styles of defense and offense strategies may directly influence the pace of a game,

especially during clutch time when a game comes down to a few processions; different

teams may use the clock differently to defeat their opponents. Also an older team may be

more likely to spend longer time on each offensive procession than a younger team. Factors

of pace value may even depend on players' health and attitude. As a result, Pace factor is

nonlinear and difficult to predict with a small set of data.

II. MOTIVATION

To precisely predict a future NBA game’s result is a tough but appealing task to both NBA

fans and non-fans all the time. Recent research shows that a precise game pace value is the

key factor, so our team decide to use this sub-topic to take a trial.

With a precise pace evaluation of the team, we can apply it to all of its players so that each

player can be correctly tagged with a pace capability. Faster team may choose faster player

so that our assessment can be taken as a toolbox for team to select desired player. Also

NBA game bet is another appealing target for our research.

III. RELATED WORK

There are many researches on the topic of pace value. The paper [4] examines the

optimality of the shooting decisions of National Basketball Association (NBA) players using

a rich dataset of 1.4 million offensive possessions. The decision to shoot is a complex

problem that involves weighing the continuation value of the possession and the outside

option of a teammate shooting. To apply our abstract model to the data we make

assumptions about the distribution of potential shots. In line with dynamic efficiency, we

find that the “cut threshold” declines monotonically with time remaining on the shot clock

at approximately the correct rate. Most line-ups show strong adherence to allocative

efficiency. We link departures in optimality to line-up experience, player salary and overall

ability.

Paper [5] also talks about the how the teams trade of the value of controlling the length of

the games. The author uses the 859 games played by the 30 teams in the NBA during

2008-2009 regular season to get the result of the leading team tried to decrease the pace

but the trailed team tried to increase the pace when the game is tend to over. The author

also concluded that the leading team got fewer points in the last few minutes of the game

comparing to the 3rd quarter. But whether the player’s decision would increase or

decrease the pace, given the strategy of their opponent should be done more investigation.

Bigtable

Bigtable[6] data structure is a key component of the project and main stream of nowadays

big data analytic. A paper “Bigtable: A Distributed Storage System for Structured Data” has

a thorough overview of the design and usage of the data structure, from index key variables,

data compression, implement to Google Earth example. The reason we consider Bigtable is

a big help for our project is the defined data structure is sparse and distributed through all

clusters. The majority operations for our project are mapping and reducing, the Bigtable

with indexed by row key, column key and timestamp key can help us retrieving data across

multiple clusters efficiently.

HBase.

Performance Management huge amount of data is a nightmare to engineers, it requires

zero fault-tolerance to the system and the program. Processing a large data efficiently and

safely is also a key part of the project. MySQL has been a reliable ACID storing platform for

the last decent until the introduction of big data analytic. A relative paper “Solving Big Data

Challenges for Enterprise Application” compares the workload performance of MySQL and

HBase.[7]. The result of their experiment shows that MySQL has a very good scan

throughput for a single node, but it doesn't scale with the number of nodes as data size

increase. On the other hand, HBase could be scaled to obtain a linear increase in

throughput with the number of nodes.

Another reason for using the HBase is due to increase the speed of program execution, in

the traditional way, data is copied and run on local machine. The paper “Map Reduce:

Simplified Data Processing on Large Clusters” [8] talked about the locality on large clusters

environment. The Map Reduce master takes the location information into account and

schedules map task near a replica task location.

Fg1. Program Diagram

IV. DESIGN

The design of the program has three main phases. According to the design diagram above,

first data source retrieval phase, a large set of HTML play by play files would be interpreted

and stored into a HBase , alongside with Game status data source. Second, Features

generation phase, map-reduce would be used to combine two data sources in the HBase

and create 20 features for computation. Third, data training phase, since the data size has

been reduced in phase 2, we could use the support vector regression (SVR) algorithm to

train the data set in an “old fashion” way.

Data source retrieval

There are two types of data source used in this paper. The first data source is called PBP

(Play By Play) data which is retrieval from the NBA website. The PBP data records

possession of ball, shot attempt and possession turnover at certain moment of each NBA

game. The PBP data used in this paper consists of 4000 games where each game's PBP

record is presented in a HTML format on the NBA website. The total PBP data size is about

2.76GB, each of PBP record has two tables that are useful to our project, one is the

abbreviation of the home team and the away team, the other one is the detailed of PBP data.

To reduce size and improve accuracy, a JAVA preprocessor program is created to translate

the raw data and eliminate noise from the website. After preprocessing, 5000 PBP records

are inserted into HBase table as html format string for features generation phase. Game

status data is another data source which contains more static information about each

team's condition, such as pace value of the current game, and how long has the home team

be resting before the current game. Parts of Game status data can be computed after

map-reduce process during the features generation phase.

gameID Column

20131203 <html>.........</html>

Features generation

With 5000 html file entries store in the HBase, now we could implement the map-reduce

function. The input to a mapper function is a specification with a pair of gameID and

column value in the HBase table. In the mapper function, each html file would be translated

into home team attack time and home team defense time. Table2 is a sample of PBP html

record, to compute home team attack time, we can subtract Time value from row one to

row two. And similar process can be done for computing the defense time. After Mapper

function, a pair of gameID and a value features (set of attack time and defense time) would

output to the reduce function.

Time New York Score Indiana

11:35.0 0-2 L. Stephenson makes

2-pt shot from 1 ft

11:15.0 Shumpert makes

3-pt shot from 23

ft(assist by C

Anthony)

3-2

10:56.0 3-2 Personal block foul

by I.

Shumper

The reduce function is called once for each gameID. In our case, there has no real

assignment for the reduce phase, only task for the reduce function is to output data from

the mapper function directly into a database alongside with the pace value of the current

game. The output of mapper phase has extracted features from the raw data and reduced

the data size heavily. With a set of base features for each gameID entry, more features could

be computed during post map reduce phase. Average home attack time for the last 5 games

could be computed by iterating through the last 5 records. Average of game pace value

could be computed in a similar fashion, etc. A complete half of features list is shown in

Table3, second half of features will be same except that it's for away team.

Here is list:

Features Description

gameID ID to identify each

game.

HomeTeam Home team name

Avg1 Home

defend time

Average of home

team's defend

time for all the

games before the

current game.

Avg2 Home

defend time

Average of home

team's defend

time for all the

games this season.

Avg1 Home

offense time

Average of home

team's offense

time for all the

games before the

current game.

Avg2 Home

offense time

Average of home

team's offense

time for all the

games this season.

Avg1 Home Pace Average of home

team's pace value

for all the games

before the current

game.


team's pace value

for all the games

this season.

Std1 Var Home

Pace

Standard variance

of home team's

pace value for all

the games before

the current game.

Std2 Var Home

Pace

Standard variance

of home team's

pace value for all

the games this

season.


team's pace value

for last 3 games

before the current

game.


team's pace value

for last 5 games

before the current

game.


team's pace value

for last 7 games

before the current

game.

The first step for calculating features is reading data from Hbase and stores it into a

container. Because features are calculated according to Game ID and team name, we need

to search data and get it when we use it. Hash Map is selected as the container because

searching is really fast if the key is known in a Hash Map.

Three Hash Maps are needed. The first one is used to store the imported data. Game Id is

the key, then store all the other data such as average attack time, defend time, actual day

since start, rest day and actual pace into an array and the array is the value of the Hash Map.

The most important thing for the first Hash Map is that team name should also be an

element in the array because we need to calculate features not only according to Game Id,

but also team name. The first value of the array is the home’s name and the first value of

the second half of the array is the away’s name. The second Hash Map is used to store

intermediate data, team names are considered as keys, store other data such as the sum of

average attack time before the day considered and the sum of the average defend time till

this day considered, the number of game the team attended and so on. The third Hash Map

is used to store the final result. And Game Id is considered as the key, all the features are

stored in an array and the array is considered as the value of the Hash Map.

The second step is the Calculating. The average attack time is calculated by dividing the

sum of attack time by the number of games it attended. For average defend time and

average pace value is using similar way. For variance, it’s calculated by average the square

of the difference between each pace and average pace. For the rest day of each team, there

are eight forms. Eight features are added to each team. The team’s corresponding feature

will be 1 and other seven features will be 0. All of the data is stored into the third Hash Map

directly. Game Id is keys and array stores these features are values.

Data training

We decided to use Support Vector Regression (SVR) as our main learning method for this

project. As what we discussed in the previous session, the pace value for each game is

difficult predict, because there are multiple non-linear factors, such as player health

condition, team chemistry, and coach's game plan. These factors are unpredictable.

Therefore Support Vector Regression (SVR) is the primary algorithm for the data training

phase. To understand this model, let's start looking at the most elementary algorithm linear

regression which is to minimize the quadratic cost function(1). In linear regression, we can

solve the equation by finding the optimal vector w.

There is another problem of the equation (1) even with a set of linear training data. The

problem is called over-fitting which appears when the function performs perfectly well only

with the training set but poorly with unknown set of data. To avoid this problem, a

penalized term of w is introduced to the equation (2).

With the penalized form of linear equation, now we could develop a non-linear form of

equation by adding basic function to move existing vectors into a higher dimension. The

idea is very simple, if we could not predict a pace value accurate with an existing features,

we should simply add more features! Basic function is a mapping function from finite

vectors to finite/infinite vectors. Before creating basic function, we could represent

equation (2) to (3). The basic function is applied to convert xi to Bi = B(xi).

with some algebra tricks. The final form of w is derived. Noted that B is in form of vector

with feature sizes. Now substitute 4 to the original equation, we get the form 5 which is the

final form of the equation. We know that inner product of basic function can be evaluated

by a kernel function.

In fact, the kernel function that we used in this project is called Radial Basis function which

move data from finite dimension to infinite dimension. The way to find optimal parameters

for the algorithm is through quadratic optimization. The complete derivation and

optimization of the algorithm can be found in the paper by Alex J. Smola[10]

We slightly change our feature file by appending a certain number ahead as required by the

SVR input format. Then we split the features file into two parts, one from year 2009 to

2012, one of year 2013. The former one is used as training data and the latter one is used as

testing data. The training data is used to predict the outcome and the testing data is used to

evaluate the estimate. We use mean-square error to evaluate our prediction. That is, the

variance of the error between estimated result and real result. Until now, the best team in

the world does it in 14.2, so we set our goal to be 16 or less. As a matter of fact, we follow

the principle: the less the better.

Then, we have to set the SVR core and mode to build the most likely model for our data. As

our data of features do not fall into 2 or certain category, we use regression core instead of

classification core. As for the mode, there are 2 main choices, one is linear fit and the other

is Gaussian fit. We choose to test both of them and compare the outcome.

V. RESULTS

Since choosing representative features is the key part to achieve a good result, the first step

of our experiment is to filter a set of features that best describe the target game. We consult

industry expert to generate tens of candidate features that may contribute to our system.

Then we draw visual graph like in Fg2 to see which of them may be more important.

Fg2. Feature Selection

From charts shown above, we can easily find out that pace is highly related to features like

attach/defense time but less sensitive to pace standard deviation. Since extra feature will

not have too much bad influence on our system, so we don’t go deep in proving the

reliability or correctness of those chosen features. We just let the final prediction result to

tell which kind of features is positive meanwhile others are not.

Prediction Accuracy

We’ve run through bunch of combination of features to perform the final prediction. We

take use of MSE (Min Square Error) to reflect the accuracy of our prediction.

MSE is a classic metric to evaluate the average error of the system.

Fg3. MSE result

Seen from Fg3, we can figure out that with more complex combination of features, the MSE

can be reduced to a certain degree. Pace is with about 100, so 16 MSE means that the

average error is about 4%.

VI. Future Work

Evaluating pace value of a NBA game is not an easy task in general. Despite with the current

play-by-play data we have collected from the website, reaching the real meaningful and

analytical data that truly describe a game is still challenging. The play-by-play data is an

abstract description of a game at certain time frame; we could not know exact context and

motivation of the each play. Another limitation may be related to a team reasons strategic.

In our current model, we treat each game equally. In reality, to get a better place and shape

in post-season, some teams may have different game plans throughout a season. For

example, some teams may slow down their paces after securing their playoff positions.

Players, coaches and training staffs may also different from one year to another. As a result,

historical data may not be impacted to the current team's pace value. However, these more

dynamic factors are not taking into account of the project. In future, new parameter could

be introduced to characterize this kind of impacts. From a technical point of view, we may

15.4

15.6

15.8

16

16.2

16.4

16.6

16.8

17

Feature Set 1 Feature Set 2 Feature Set 3

MSE Linear Kernal

Gaussian Kernal

require more data to improve the prediction accuracy. However, from paper by Alex J.

Smola[10], we realize that SVR is a very expansive operation which may take quadratic

time to solve a optimization equation. When data size reaches the bottleneck level, training

phase in our design would be a time consuming process. One way to solve this problem is

to add another layer of map reduce phase to highly compress the raw data, also we could

migrate the training phase into another map reduce task to solve the Newtonian equation.

VII. Conclusion

We have successfully accomplished the objective that we proposed in the early stage. Each

design phases truly allows us to practice and develop skills to deal with big data. Using Map

Reduce and HBase to compress and extract features from 2.1GB data down to a few MB is

really a key component to our project. It would be very inefficient to train and analyze a

huge data size with a sophisticated algorithm like SVM. Furthermore, studying big data is

really a topic with focusing on revealing favorable characteristic data from tons of data, but

not just a huge data with dummy noise. Lastly, with that in mind, our key features describe

the behavior of pace value of each game as it shows in our result.

Acknowledgment

I would like to acknowledge the help and support of my friends: Jiawei Dong, Cai Qi.

References

[1] Shoji Nishimura: MD-HBase: A Scalable Multi-dimensional Data

. 2010

[2] NBA Basketball reference: http://www.basketball-reference.com/about/glossary.html,

CA, May 2012.

[3] Historical NBA ASPM:http://godismyjudgeok.com/

DStats/2013/nba-stats/historical-nba-aspm-and-hall-rating-released/ CA, May 2009.

[4] Matt Goldman and Justin M. Rao. Tick-Tock Shot Clock: Optimal Stopping in the NBA.

Fall 2011.

[5] Tim Xin: The Value of Pace In the NBA, Spring 2012

[6] Fay Chang: Bigtable: A Distributed Storage System for Structured Data, 2006.

[7] Craig Franke, Samuel Morin, Artem Chebotko, John Abraham, and Pearl Brazier:

Distributed Semantic Web Data Management in HBase and MySQL Cluster

[8] Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large

Clusters. CA,2004

[9] Hao Helen Zhang and Marc Gento: Compactly Supported Radial Basis Function Kernels.

[10] Alex J. Smola and Bernhard Scholkopf: A Tutorial on Support Vector

Regression.September 30, 2003

Documents

Predict NBA Game Pace ECE 539 final project Fa Wanghomepages.cae.wisc.edu/~ece539/fall13/project/WangFa_rpt.pdf · Predict NBA Game Pace ECE 539 final project Fa Wang Abstract Long