Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Predict NBA Game Pace
ECE 539 final project
Fa Wang
Abstract
Long time ago, people still believed that having all the best players on the court could
guarantee winning of a game or even a champion title. Recently, more historical data proved
that a winning team might not have a highest score on offense and lowest score on defense. To
evaluate a quality of a team, an index of pace value is introduced. The goal of our project is to
predict pace value of a NBA game by analyzing a large amount of historical data on the
Internet. To do that, we first extract key features from play-by-play data and game stats data
with map reduce pattern. For sake of large amount of historical play-by-play data, we take
advantage of HBase[1] to store and process our data. Different from any traditional
standalone Machine Learning theory designs, the paper would be more focusing on applying
map-reduce pattern along with Machine Learning method to solve big data problem. By
examining the final result’s mean-square error (MSE), we can judge how far our prediction is
away from the real value as to evaluate our model’s correctness and accuracy.
I. INTRODUCTION
To predict NBA game score has been a heated topic for a long time. In recent year, people
turn to a possession based score model, which has proved to provide more accurate
statistical result. Later on, people introduced a new index to evaluate performance of a
team for a certain game. The index is called Pace value. [2] According to
basketball-reference, pace factor is an estimate of the number of possessions a team uses
per game. Understanding opponent's pace value allows coach to better preparing a game as
pace factor is closely related to score of a game, the formula of Score is given as: Score =
ASPM * Pace. ASPM[3] denotes points / possession and Pace is the metric for possession.
That means an accurate prediction of pace value will result in a precise final score. In our
paper, we are trying to merge play-by-play (PBP) data and each game’s stats data to predict
future game’s pace value. The challenge of predicting game's pace is obvious. Different
styles of defense and offense strategies may directly influence the pace of a game,
especially during clutch time when a game comes down to a few processions; different
teams may use the clock differently to defeat their opponents. Also an older team may be
more likely to spend longer time on each offensive procession than a younger team. Factors
of pace value may even depend on players' health and attitude. As a result, Pace factor is
nonlinear and difficult to predict with a small set of data.
II. MOTIVATION
To precisely predict a future NBA game’s result is a tough but appealing task to both NBA
fans and non-fans all the time. Recent research shows that a precise game pace value is the
key factor, so our team decide to use this sub-topic to take a trial.
With a precise pace evaluation of the team, we can apply it to all of its players so that each
player can be correctly tagged with a pace capability. Faster team may choose faster player
so that our assessment can be taken as a toolbox for team to select desired player. Also
NBA game bet is another appealing target for our research.
III. RELATED WORK
There are many researches on the topic of pace value. The paper [4] examines the
optimality of the shooting decisions of National Basketball Association (NBA) players using
a rich dataset of 1.4 million offensive possessions. The decision to shoot is a complex
problem that involves weighing the continuation value of the possession and the outside
option of a teammate shooting. To apply our abstract model to the data we make
assumptions about the distribution of potential shots. In line with dynamic efficiency, we
find that the “cut threshold” declines monotonically with time remaining on the shot clock
at approximately the correct rate. Most line-ups show strong adherence to allocative
efficiency. We link departures in optimality to line-up experience, player salary and overall
ability.
Paper [5] also talks about the how the teams trade of the value of controlling the length of
the games. The author uses the 859 games played by the 30 teams in the NBA during
2008-2009 regular season to get the result of the leading team tried to decrease the pace
but the trailed team tried to increase the pace when the game is tend to over. The author
also concluded that the leading team got fewer points in the last few minutes of the game
comparing to the 3rd quarter. But whether the player’s decision would increase or
decrease the pace, given the strategy of their opponent should be done more investigation.
Bigtable
Bigtable[6] data structure is a key component of the project and main stream of nowadays
big data analytic. A paper “Bigtable: A Distributed Storage System for Structured Data” has
a thorough overview of the design and usage of the data structure, from index key variables,
data compression, implement to Google Earth example. The reason we consider Bigtable is
a big help for our project is the defined data structure is sparse and distributed through all
clusters. The majority operations for our project are mapping and reducing, the Bigtable
with indexed by row key, column key and timestamp key can help us retrieving data across
multiple clusters efficiently.
HBase.
Performance Management huge amount of data is a nightmare to engineers, it requires
zero fault-tolerance to the system and the program. Processing a large data efficiently and
safely is also a key part of the project. MySQL has been a reliable ACID storing platform for
the last decent until the introduction of big data analytic. A relative paper “Solving Big Data
Challenges for Enterprise Application” compares the workload performance of MySQL and
HBase.[7]. The result of their experiment shows that MySQL has a very good scan
throughput for a single node, but it doesn't scale with the number of nodes as data size
increase. On the other hand, HBase could be scaled to obtain a linear increase in
throughput with the number of nodes.
Another reason for using the HBase is due to increase the speed of program execution, in
the traditional way, data is copied and run on local machine. The paper “Map Reduce:
Simplified Data Processing on Large Clusters” [8] talked about the locality on large clusters
environment. The Map Reduce master takes the location information into account and
schedules map task near a replica task location.
Fg1. Program Diagram
IV. DESIGN
The design of the program has three main phases. According to the design diagram above,
first data source retrieval phase, a large set of HTML play by play files would be interpreted
and stored into a HBase , alongside with Game status data source. Second, Features
generation phase, map-reduce would be used to combine two data sources in the HBase
and create 20 features for computation. Third, data training phase, since the data size has
been reduced in phase 2, we could use the support vector regression (SVR) algorithm to
train the data set in an “old fashion” way.
Data source retrieval
There are two types of data source used in this paper. The first data source is called PBP
(Play By Play) data which is retrieval from the NBA website. The PBP data records
possession of ball, shot attempt and possession turnover at certain moment of each NBA
game. The PBP data used in this paper consists of 4000 games where each game's PBP
record is presented in a HTML format on the NBA website. The total PBP data size is about
2.76GB, each of PBP record has two tables that are useful to our project, one is the
abbreviation of the home team and the away team, the other one is the detailed of PBP data.
To reduce size and improve accuracy, a JAVA preprocessor program is created to translate
the raw data and eliminate noise from the website. After preprocessing, 5000 PBP records
are inserted into HBase table as html format string for features generation phase. Game
status data is another data source which contains more static information about each
team's condition, such as pace value of the current game, and how long has the home team
be resting before the current game. Parts of Game status data can be computed after
map-reduce process during the features generation phase.
gameID Column
20131203 <html>.........</html>
Features generation
With 5000 html file entries store in the HBase, now we could implement the map-reduce
function. The input to a mapper function is a specification with a pair of gameID and
column value in the HBase table. In the mapper function, each html file would be translated
into home team attack time and home team defense time. Table2 is a sample of PBP html
record, to compute home team attack time, we can subtract Time value from row one to
row two. And similar process can be done for computing the defense time. After Mapper
function, a pair of gameID and a value features (set of attack time and defense time) would
output to the reduce function.
Time New York Score Indiana
11:35.0 0-2 L. Stephenson makes
2-pt shot from 1 ft
11:15.0 Shumpert makes
3-pt shot from 23
ft(assist by C
Anthony)
3-2
10:56.0 3-2 Personal block foul
by I.
Shumper
The reduce function is called once for each gameID. In our case, there has no real
assignment for the reduce phase, only task for the reduce function is to output data from
the mapper function directly into a database alongside with the pace value of the current
game. The output of mapper phase has extracted features from the raw data and reduced
the data size heavily. With a set of base features for each gameID entry, more features could
be computed during post map reduce phase. Average home attack time for the last 5 games
could be computed by iterating through the last 5 records. Average of game pace value
could be computed in a similar fashion, etc. A complete half of features list is shown in
Table3, second half of features will be same except that it's for away team.
Here is list:
Features Description
gameID ID to identify each
game.
HomeTeam Home team name
Avg1 Home
defend time
Average of home
team's defend
time for all the
games before the
current game.
Avg2 Home
defend time
Average of home
team's defend
time for all the
games this season.
Avg1 Home
offense time
Average of home
team's offense
time for all the
games before the
current game.
Avg2 Home
offense time
Average of home
team's offense
time for all the
games this season.
Avg1 Home Pace Average of home
team's pace value
for all the games
before the current
game.
Avg2 Home Pace Average of home
team's pace value
for all the games
this season.
Std1 Var Home
Pace
Standard variance
of home team's
pace value for all
the games before
the current game.
Std2 Var Home
Pace
Standard variance
of home team's
pace value for all
the games this
season.
Avg3 Home Pace Average of home
team's pace value
for last 3 games
before the current
game.
Avg5 Home Pace Average of home
team's pace value
for last 5 games
before the current
game.
Avg7 Home Pace Average of home
team's pace value
for last 7 games
before the current
game.
The first step for calculating features is reading data from Hbase and stores it into a
container. Because features are calculated according to Game ID and team name, we need
to search data and get it when we use it. Hash Map is selected as the container because
searching is really fast if the key is known in a Hash Map.
Three Hash Maps are needed. The first one is used to store the imported data. Game Id is
the key, then store all the other data such as average attack time, defend time, actual day
since start, rest day and actual pace into an array and the array is the value of the Hash Map.
The most important thing for the first Hash Map is that team name should also be an
element in the array because we need to calculate features not only according to Game Id,
but also team name. The first value of the array is the home’s name and the first value of
the second half of the array is the away’s name. The second Hash Map is used to store
intermediate data, team names are considered as keys, store other data such as the sum of
average attack time before the day considered and the sum of the average defend time till
this day considered, the number of game the team attended and so on. The third Hash Map
is used to store the final result. And Game Id is considered as the key, all the features are
stored in an array and the array is considered as the value of the Hash Map.
The second step is the Calculating. The average attack time is calculated by dividing the
sum of attack time by the number of games it attended. For average defend time and
average pace value is using similar way. For variance, it’s calculated by average the square
of the difference between each pace and average pace. For the rest day of each team, there
are eight forms. Eight features are added to each team. The team’s corresponding feature
will be 1 and other seven features will be 0. All of the data is stored into the third Hash Map
directly. Game Id is keys and array stores these features are values.
Data training
We decided to use Support Vector Regression (SVR) as our main learning method for this
project. As what we discussed in the previous session, the pace value for each game is
difficult predict, because there are multiple non-linear factors, such as player health
condition, team chemistry, and coach's game plan. These factors are unpredictable.
Therefore Support Vector Regression (SVR) is the primary algorithm for the data training
phase. To understand this model, let's start looking at the most elementary algorithm linear
regression which is to minimize the quadratic cost function(1). In linear regression, we can
solve the equation by finding the optimal vector w.
There is another problem of the equation (1) even with a set of linear training data. The
problem is called over-fitting which appears when the function performs perfectly well only
with the training set but poorly with unknown set of data. To avoid this problem, a
penalized term of w is introduced to the equation (2).
With the penalized form of linear equation, now we could develop a non-linear form of
equation by adding basic function to move existing vectors into a higher dimension. The
idea is very simple, if we could not predict a pace value accurate with an existing features,
we should simply add more features! Basic function is a mapping function from finite
vectors to finite/infinite vectors. Before creating basic function, we could represent
equation (2) to (3). The basic function is applied to convert xi to Bi = B(xi).
with some algebra tricks. The final form of w is derived. Noted that B is in form of vector
with feature sizes. Now substitute 4 to the original equation, we get the form 5 which is the
final form of the equation. We know that inner product of basic function can be evaluated
by a kernel function.
In fact, the kernel function that we used in this project is called Radial Basis function which
move data from finite dimension to infinite dimension. The way to find optimal parameters
for the algorithm is through quadratic optimization. The complete derivation and
optimization of the algorithm can be found in the paper by Alex J. Smola[10]
We slightly change our feature file by appending a certain number ahead as required by the
SVR input format. Then we split the features file into two parts, one from year 2009 to
2012, one of year 2013. The former one is used as training data and the latter one is used as
testing data. The training data is used to predict the outcome and the testing data is used to
evaluate the estimate. We use mean-square error to evaluate our prediction. That is, the
variance of the error between estimated result and real result. Until now, the best team in
the world does it in 14.2, so we set our goal to be 16 or less. As a matter of fact, we follow
the principle: the less the better.
Then, we have to set the SVR core and mode to build the most likely model for our data. As
our data of features do not fall into 2 or certain category, we use regression core instead of
classification core. As for the mode, there are 2 main choices, one is linear fit and the other
is Gaussian fit. We choose to test both of them and compare the outcome.
V. RESULTS
Since choosing representative features is the key part to achieve a good result, the first step
of our experiment is to filter a set of features that best describe the target game. We consult
industry expert to generate tens of candidate features that may contribute to our system.
Then we draw visual graph like in Fg2 to see which of them may be more important.
Fg2. Feature Selection
From charts shown above, we can easily find out that pace is highly related to features like
attach/defense time but less sensitive to pace standard deviation. Since extra feature will
not have too much bad influence on our system, so we don’t go deep in proving the
reliability or correctness of those chosen features. We just let the final prediction result to
tell which kind of features is positive meanwhile others are not.
Prediction Accuracy
We’ve run through bunch of combination of features to perform the final prediction. We
take use of MSE (Min Square Error) to reflect the accuracy of our prediction.
MSE is a classic metric to evaluate the average error of the system.
Fg3. MSE result
Seen from Fg3, we can figure out that with more complex combination of features, the MSE
can be reduced to a certain degree. Pace is with about 100, so 16 MSE means that the
average error is about 4%.
VI. Future Work
Evaluating pace value of a NBA game is not an easy task in general. Despite with the current
play-by-play data we have collected from the website, reaching the real meaningful and
analytical data that truly describe a game is still challenging. The play-by-play data is an
abstract description of a game at certain time frame; we could not know exact context and
motivation of the each play. Another limitation may be related to a team reasons strategic.
In our current model, we treat each game equally. In reality, to get a better place and shape
in post-season, some teams may have different game plans throughout a season. For
example, some teams may slow down their paces after securing their playoff positions.
Players, coaches and training staffs may also different from one year to another. As a result,
historical data may not be impacted to the current team's pace value. However, these more
dynamic factors are not taking into account of the project. In future, new parameter could
be introduced to characterize this kind of impacts. From a technical point of view, we may
15.4
15.6
15.8
16
16.2
16.4
16.6
16.8
17
Feature Set 1 Feature Set 2 Feature Set 3
MSE Linear Kernal
Gaussian Kernal
require more data to improve the prediction accuracy. However, from paper by Alex J.
Smola[10], we realize that SVR is a very expansive operation which may take quadratic
time to solve a optimization equation. When data size reaches the bottleneck level, training
phase in our design would be a time consuming process. One way to solve this problem is
to add another layer of map reduce phase to highly compress the raw data, also we could
migrate the training phase into another map reduce task to solve the Newtonian equation.
VII. Conclusion
We have successfully accomplished the objective that we proposed in the early stage. Each
design phases truly allows us to practice and develop skills to deal with big data. Using Map
Reduce and HBase to compress and extract features from 2.1GB data down to a few MB is
really a key component to our project. It would be very inefficient to train and analyze a
huge data size with a sophisticated algorithm like SVM. Furthermore, studying big data is
really a topic with focusing on revealing favorable characteristic data from tons of data, but
not just a huge data with dummy noise. Lastly, with that in mind, our key features describe
the behavior of pace value of each game as it shows in our result.
Acknowledgment
I would like to acknowledge the help and support of my friends: Jiawei Dong, Cai Qi.
References
[1] Shoji Nishimura: MD-HBase: A Scalable Multi-dimensional Data
. 2010
[2] NBA Basketball reference: http://www.basketball-reference.com/about/glossary.html,
CA, May 2012.
[3] Historical NBA ASPM:http://godismyjudgeok.com/
DStats/2013/nba-stats/historical-nba-aspm-and-hall-rating-released/ CA, May 2009.
[4] Matt Goldman and Justin M. Rao. Tick-Tock Shot Clock: Optimal Stopping in the NBA.
Fall 2011.
[5] Tim Xin: The Value of Pace In the NBA, Spring 2012
[6] Fay Chang: Bigtable: A Distributed Storage System for Structured Data, 2006.
[7] Craig Franke, Samuel Morin, Artem Chebotko, John Abraham, and Pearl Brazier:
Distributed Semantic Web Data Management in HBase and MySQL Cluster
[8] Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large
Clusters. CA,2004
[9] Hao Helen Zhang and Marc Gento: Compactly Supported Radial Basis Function Kernels.
[10] Alex J. Smola and Bernhard Scholkopf: A Tutorial on Support Vector
Regression.September 30, 2003