21
© 2015 Royal Statistical Society 0964–1998/15/179000 J. R. Statist. Soc. A (2015) Beyond completion rate: evaluating the passing ability of footballers Lukasz Szczepa ´ nski University of Salford, and Smartodds, London, UK and Ian McHale University of Manchester, UK [Received February 2013. Final revision January 2015] Summary. Passing the ball is one of the key skills of a football player yet the metrics commonly used to evaluate passing ability are crude and largely limited to various forms of a pass comple- tion rate. These metrics can be misleading for two general reasons: they do not account for the difficulty of the attempted pass nor the various levels of uncertainty involved in empirical obser- vations based on different numbers of passes per player.We address both these deficiencies by building a statistical model in which the success of a pass depends on the skill of the executing player as well as other factors including the origin and destination of the pass, the skill of his teammates and the opponents, and proxies for the defensive pressure put on the executing player as well as random chance.We fit the model by using data from the 2006–2007 season of the English Premier League provided by Opta, estimate each player’s passing skill and make predictions for the next season.The model predictions considerably outperform a naive method of simply using the previous season’s completion rate as a predictor of the following season’s completion rate. In particular, we show how a change in the difficulty of passes attempted in both seasons explains a significant proportion of the shift in the observed performance of some players—a fact that is ignored if the raw completion rate is used to evaluate player skill. Keywords: Generalized additive mixed models; Ranking; Rating; Soccer; Sport 1. Introduction Applications of statistics to problems in sport have been increasing in recent years, both within the industry and within academia. US sports such as baseball have embraced the idea that statistics can help to gain an advantage over the competition. Association football (herein football) has been a little slower in adopting statistics as a helpful tool. Perhaps this is in part due to the complex, dynamic nature of the game, compared with the simpler discrete nature of baseball, meaning that applying statistical thinking to football is more difficult. However, with ever increasing computing power and data becoming readily available, football is now more accessible for statistical analysis. There are many potential ways to use statistics in football to aid team performance, but the two most obvious are to inform playing strategy (e.g. which players play well together, where should players be on the pitch and which of the opposition players is most dangerous?), and to advise on player recruitment. It is the latter of these with which we are concerned in this paper. Address for correspondence: Ian McHale, School of Mathematics, University of Manchester, Manchester, M13 9PL, UK. E-mail: [email protected]

Beyond completion rate: evaluating the passing ability of footballers

Embed Size (px)

Citation preview

© 2015 Royal Statistical Society 0964–1998/15/179000

J. R. Statist. Soc. A (2015)

Beyond completion rate: evaluating the passingability of footballers

Łukasz Szczepanski

University of Salford, and Smartodds, London, UK

and Ian McHale

University of Manchester, UK

[Received February 2013. Final revision January 2015]

Summary.Passing the ball is one of the key skills of a football player yet the metrics commonlyused to evaluate passing ability are crude and largely limited to various forms of a pass comple-tion rate. These metrics can be misleading for two general reasons: they do not account for thedifficulty of the attempted pass nor the various levels of uncertainty involved in empirical obser-vations based on different numbers of passes per player.We address both these deficiencies bybuilding a statistical model in which the success of a pass depends on the skill of the executingplayer as well as other factors including the origin and destination of the pass, the skill of histeammates and the opponents, and proxies for the defensive pressure put on the executingplayer as well as random chance. We fit the model by using data from the 2006–2007 seasonof the English Premier League provided by Opta, estimate each player’s passing skill and makepredictions for the next season.The model predictions considerably outperform a naive methodof simply using the previous season’s completion rate as a predictor of the following season’scompletion rate. In particular, we show how a change in the difficulty of passes attempted inboth seasons explains a significant proportion of the shift in the observed performance of someplayers—a fact that is ignored if the raw completion rate is used to evaluate player skill.

Keywords: Generalized additive mixed models; Ranking; Rating; Soccer; Sport

1. Introduction

Applications of statistics to problems in sport have been increasing in recent years, both withinthe industry and within academia. US sports such as baseball have embraced the idea thatstatistics can help to gain an advantage over the competition. Association football (hereinfootball) has been a little slower in adopting statistics as a helpful tool. Perhaps this is in partdue to the complex, dynamic nature of the game, compared with the simpler discrete nature ofbaseball, meaning that applying statistical thinking to football is more difficult. However, withever increasing computing power and data becoming readily available, football is now moreaccessible for statistical analysis.

There are many potential ways to use statistics in football to aid team performance, but thetwo most obvious are to inform playing strategy (e.g. which players play well together, whereshould players be on the pitch and which of the opposition players is most dangerous?), and toadvise on player recruitment. It is the latter of these with which we are concerned in this paper.

Address for correspondence: Ian McHale, School of Mathematics, University of Manchester, Manchester, M139PL, UK.E-mail: [email protected]

2 Ł. Szczepanski and I. McHale

Player recruitment essentially boils down to two tasks: rating players and estimating eachplayer’s value to a team or club. Estimating a player’s value is perhaps the task of financialanalysts but statisticians can certainly help with rating the players, and there have been manyattempts. Duch et al. (2010), Oberstone (2011) and McHale et al. (2012) have all presented meth-ods to rate players. However, these methods look retrospectively at how players have performedin previous games and are primarily meant to be used by the media, fans and pundits for debateand not to estimate how players might perform in the future.

Some work has been done in football to predict future performance. McHale and Szczepanski(2014) used a mixed effects model to estimate the goal scoring ability of footballers and mea-sured the performance of their ratings by using them to predict the number of goals scored byeach player in the season after the data that were used in the model fitting process. Goals are un-doubtedly the most important aspect of a football game since goals are what decide the matchoutcome, but goals are normally a consequence of a passage of play in which a team passesthe ball from player to player, with the objective of creating a shooting (and hence scoring)opportunity.

As a consequence of its important role in creating scoring opportunities, the skill of passingthe ball is given great weight when evaluating players and the media often report each player’s‘number of passes’ and/or ‘pass completion percentage’ to demonstrate that a player has had agood (or bad) game. However, these statistics are particularly crude metrics with flaws. It is wellknown, for example, that completing a pass becomes more difficult as the pass approaches theopposition’s goal (the space on the pitch becomes more crowded with opposition players as theydefend their goal). Adjustments to these naive statistics can be made, like measuring ‘passes inthe final third of the pitch’. Putting empirical pass completion rates in the context of the part ofthe pitch that they were executed from or their direction is certainly an improvement over quotinga single value per player. However, it leads to some new problems: why for example, should wefocus on passes in the final third of the pitch and not, say, the final quarter? Making suchan arbitrary decision is clearly not attractive from a scientific standpoint. A more appropriatemetric to measure passing ability would not simply be a player’s pass completion percentage inthe final third of the pitch but would include information on the origin and intended destinationof the pass. To do this, and to measure passing ability properly, we employ statistical modellingtechniques.

The method that is proposed in this paper attempts to provide a comprehensive framework forevaluating the passing ability of footballers. Our approach to this problem is to use a generalizedadditive mixed effects model to estimate the probability that a pass is successful, i.e. that thepass finds another player on the passing player’s team. Our model controls for the effect ofexternal circumstances on the outcome of the pass, such as the quality of the player’s team andthe opposition, and the origin and destination of the pass. We believe that the key strength ofthe model comes from acknowledging the random component that is inherent in empirical passcompletion percentages. For example, players who boasted an exceptionally high or low passcompletion rate in one season are likely to have benefited from respectively good or bad luckand may not record as high a rate in the following season. Our mixed effects model recognizesthat there may be noise and shrinks each player’s empirical pass completion percentage towardsthe mean.

The influence of noise in empirical statistics relating to players’ performances has beenaccounted for previously. Leading the way is baseball: Efron and Morris (1975) used Stein’sestimator to shrink observed batting averages and showed the positive effect on predictions.Albert (1992, 2006) used random-effects models to analyse home run hitters and estimated theability of pitchers respectively. Related methods have been applied to analysing other aspects

Evaluating the Passing Ability of Footballers 3

of baseball with Jensen et al. (2009) focusing on fielding performance and Loughin and Bargen(2008) investigating pitcher and catcher influence on base stealing. As far as we are aware, theonly academic paper looking at a similar problem in football is McHale and Szczepanski (2014)who studied the goal scoring ability of players.

The paper is structured as follows. Section 2 outlines the data that we employ for our analysis.We present this before our modelling approach as understanding the data that are availablejustifies some modelling choices that we make. Section 3 describes our model of passing. InSection 4 we present the results of fitting our model and then use it to make various types ofprediction which we compare with the empirical data. We conclude with some discussion inSection 5.

2. Data

The dynamic nature of the game of football means that reducing a football match to a dataset ready for analysis is more difficult than for some other discrete sports like cricket andbaseball. However, companies like Prozone and Opta now record the timing and location(x–y-co-ordinates on the pitch) of every action (goal, shot, pass, tackle and so on) during a game.We obtained data from Opta (http://www.optasports.com). The data are recorded bya team of three analysts watching each game live (at the venue) and inputting the location andtype of event as it happens into specialist software on tablet computers. These data are used, forexample, to feed Web sites giving live commentary during match days. After each match, the dataare checked for errors and corrections made if necessary. The data that we have give the detailsof actions in the 760 games of the 2006–2007 and 2007–2008 seasons of the English PremierLeague. This database includes information on passes attempted during each match such as theexecuting player, the game time of the event in minutes and seconds, the pitch co-ordinates ofthe origin of the pass, the pitch co-ordinates of the destination of the pass, whether the pass wasattempted with a player’s foot or his head and a success indicator, among many others.

To focus our attention on what we expect to be a roughly homogeneous group of events weselect all open play passes between two outfield players, giving us I =253090 events to analyse.There are K = 481 outfield players among T = 20 teams in our fitting sample, which is the2006–2007 season only.

2.1. Factors influencing pass success and their proxiesOur approach to estimating each player’s passing ability is to model the probability that eachpass is successful, given information on the environment in which the pass was made and, ofcourse, the identity of the player making the pass. We thus want to create a set of covariatesdescribing the situation of the pass from the data that we have available to us.

Many factors potentially influence the outcome of a pass. These include

(a) the inherent skill of the player passing the ball,(b) the degree of control that the executing player has on the ball when attempting the pass

(for example, a ball bouncing at waist height is more difficult to pass than a ball that isstationary on the ground),

(c) the level of pressure that the opposition team put on the executor of the pass,(d) the distance of the attempted pass,(e) the level of pressure that the opposition team put on the player receiving the ball and(f) familiarity with the type of situation that the pass is attempted in (for example, home

team players may know the surface better; similarly a winger is more likely to make a

4 Ł. Szczepanski and I. McHale

successful cross from a wide area of the field than a central forward who finds himself inthis area only occasionally).

Of these factors only pass distance can be derived directly from our data set. The informationabout the other factors is not directly available; however, we can develop proxies for these factorsby using the data. For instance, we do not have information on how much pressure is being placedon the player receiving the pass, but we can hypothesize that it will generally be more the closerhe is to the opposition’s goal yet may be less as the opposition players tire because of fatiguetowards the end of the match. This leads us to the idea of using the intended destination of thepass and the timing of the pass as proxies for pressure on the player receiving the ball and toexperiment with including them as covariates in our model to estimate the probability that apass is successful.

Continuing with this mode of thought, we create several variables derived from the data toproxy factors influencing success of a pass. Each of these variables can be considered to influencethe success of the pass in many ways.

(a) The origin of the pass (x- and y-co-ordinates on the pitch) and the intended destinationof the pass (which we denote by the xend- and yend-co-ordinates on the pitch) proxy thepressure on the passing player, the pressure on the receiving player and the difficulty ofthe pass in terms of distance.

(b) The time since the previous pass (which we denote δt) and the pass number in the currentsequence of passes for that team, e, proxy the control that the passing player may have ofthe ball and the pressure the opposition players are placing him and the receiving playerunder.

(c) The game time (t, in minutes) proxies the pressure that the passing player and the receivingplayer might be under. In addition, it may reflect the fatigue of the player with the balland affect the passing player negatively.

(d) Whether the pass is performed with the player’s head or foot or indeed whether theprevious pass was executed with the head or foot serve as proxies of how well the playeris in control of the ball. We give this covariate the symbol a.

(e) From the data set we can extract information on whether each action followed a duel(according to the data provider’s definition, a duel is a closely balanced contest betweentwo players of opposing sides in the match) and for pass events this serves as a proxy forthe pressure that the passing player might be under. We denote duels as da for an aerialduel and dt for a duel on the ground. These variables equal 1 if the pass immediatelyfollows a duel. A third duel variable, ds, indicates that the pass was made by the playerinvolved in the duel. For example, if player A successfully tackles an opponent, takescontrol of the ball and makes a pass, dt = 1, da = 0 and ds = 1 for this pass. If followingthe tackle the ball falls to his teammate B instead, ds =0 for B’s pass that follows.

(f) Whether the player is at his home ground, h serves as a proxy for familiarity with theconditions.

(g) Lastly, we consider the player’s position as a proxy for whether he is under pressure fromthe opposition and whether the player whom he is passing to is under pressure. We denotethis in terms of the average x–y-co-ordinates for the kth player in games before the jthmatch as xk,j and ¯yk,j. We discuss the definition of and the meaning of this variable inmore detail below.

The resultant covariates are defined, with their symbols, in Table 1. Also included is the factor orfactors that each covariate is serving as a proxy for and whether we include a lag of the variable

Evaluating the Passing Ability of Footballers 5

Table 1. Covariates used to proxy factors influencing pass success†

Type Covariate Symbol Lags Approximated factor

Control Passing Distance Receiving Familiarityplayer player

pressure pressure

Continuous Origin and destination x, y, xend, yend 0 � � �Time since previous pass δt 0, 1 � � �

(s)Pass number in this e 0 � � �

sequence of passesGame time (min) t 0 � � �Player position in the xk,j , ¯yk,j 0 � � �

gameIndicator Headed pass a 0, 1 �

Duel (aerial, tackle, da, dt , ds 1 � �same player)

Home advantage h 0 �

†Lags indicate whether the value corresponding to the executed pass (lag = 0) or the previous pass (lag = 1) isconsidered.

in the model. Whether or not these covariates carry information about pass success rate can beverified when including them in our statistical model described in Section 3.

2.2. Defining a player’s average positionA key variable in our model is the passing player’s playing position during the game. We intendto use this as a proxy for the pressure that the passing player might be under from the opposition.A complicating factor with using this as a covariate is that of possible reverse causation betweenplaying position and pass success rate. For example, a player’s position (which is usually assignedby the manager or coach) may be a function of his pass success rate in that players who aredeemed to have high pass success rates are asked to play nearer the opposition’s goal, whichin turn lowers their pass success rate as a consequence of increased pressure on that player.To guard against the possibility of endogeneity, we use ‘anticipated playing position’ which is aweighted average of a player’s average co-ordinates in matches before the current match. Player’spassing ability in a given game cannot affect his anticipated playing position defined this way.

We now describe our methodology for deriving this from the individual events data. Moresophisticated algorithms could be used for this but, since this is not a primary focus of our study,we settle for the following method to anticipate the position of the kth player in the jth game.

(a) Calculate the absolute value of the distance from the centre of the pitch in the width co-ordinate (the pitch co-ordinates are x∈ 〈0, 1〉 for the pitch length (0 being the co-ordinateof the team’s on the ball goal) and y ∈ 〈0, 1〉 for its width (0 for the right sideline)) foreach event involving player i as

yi =|yi − 12 |:

We use the absolute value of this distance to avoid cancelling of terms when we calculatethe average distance from the centre of the pitch during a match for players who switchwide positions during the game.

6 Ł. Szczepanski and I. McHale

Fig. 1. Contour plot of anticipated player positions in games in the 2006–2007 season: the x -axis representsthe sideline, Nx D 0.0 is that team’s goal line and Nx D 1.0 is the opposition’s goal line; the y -axis is the dis-tance from the axis going through the centre of the goals so ¯y D 0.0 is the centre of the goals and ¯y D 0.5corresponds to the two throw-in lines; players’ nominal positions are based on the boundary definitions shown(LRD, left or right defender; CD, central defender; LRM, left or right midfielder; CM, central midfielder; CA,central attacker)

(b) Calculate .xk,j, ¯yk,j/ as the weighted average co-ordinates of all the kth player’s events(shots, passes, tackles, duels, dribbles, etc.) for matches before the jth game. The weightsfor events from a game played on day dm depend exponentially on the number of daysbetween that day and the day of the jth fixture and are given by exp{−φ.dj −dm/}. Weset φ = 0:1 which means that the co-ordinates from a game contribute around half asmuch to the average as the co-ordinates from a game played a week later. This choiceis entirely arbitrary but was found to work well in our model (in that players were, ingeneral, assigned to the position that we would expect them to be, given knowledge of aplayer’s expertise). These continuous variables are what enter our model as covariates.

To present our results and to ease interpretation we categorize players to nominal positionsin each game on the basis of the .xk,j, ¯yk,j/ values according to the rule that is illustrated inFig. 1. We further categorize players to nominal positions for the whole season on the basis ofhow frequently they were assigned to each position in that season. This last step is used just topresent player passing ratings by position and does not appear in the model of the outcome ofthe pass.

Finally, because we anticipate players’ positions on the basis of past games, data are missing(for .xk,j, ¯yk,j/) in the first game of each player in the sample. We remove these observationsfrom our fitting sample, leaving I = 242478 data points and K = 456 players who featured inmore than one game.

3. Model for estimating the passing ability of football players

As described above, we use a model to predict the probability that a pass is successful to estimatethe passing ability of players. The model that we use comes from the generalized additive mixedmodel framework of Lin and Zhang (1999). A generalized additive mixed model is an extensionof a generalized linear model in which the linear predictor is allowed to involve smooth functionsof covariates as well as random effects.

Let the outcome of the ith pass be oi where oi =1 for a successful pass and oi = 0 otherwise.We assume that the distribution of pass outcomes follows a Bernoulli distribution with theprobability of success represented by the inverse logit function of the linear predictor ηi:

.oi|ηi/∼Bernoulli.pi/ .1/

Evaluating the Passing Ability of Footballers 7

where

pi = exp.ηi/

1+ exp.ηi/: .2/

We let ηi be a function of fixed effects β and random effects b. The fixed effects correspond toa matrix of all the indicator variables listed in Table 1 and the intercept

W = .1, a.n/, a.n−1/, d.n−1/a , d

.n−1/t , d.n−1/

s , h.n//

so Wi is a row of this matrix corresponding to the ith pass. The superscript n − L indicates avalue lagged by L, i.e. corresponding to the event L before the current pass, e.g. d.n−1/

a = 1 forall the passes preceded by an aerial duel and d.n−1/

a =0 for the rest.The random effects are given by a vector b, with the first K = 456 elements representing the

passing ability of players and the remaining 2×T =2×20 elements corresponding to the abilityof the passing player’s team and the ability of the opposition facilitating and hampering passexecution respectively, so that

b.K+2T/×1 = ..b.p//T .b.t//T .b.o//T/T:

Within the generalized additive mixed model fromework ηi is given by

ηi =Wiβ+Zib + s1.t.n/i /+ s2.e

.n/i /+ s3.δt

.n/i /+ s4.δt

.n−1/i /+ s5.x

.n/[k,j].i/, ¯y.n/

[k,j].i//

+ s6.x.n/i , x

.n/end,i, |y.n/

i −0:5|, |y.n/end,i −0:5|/+ s7.x

.n/i , x

.n/end,i, .y

.n/i −0:5/.y

.n/end,i −0:5//, .3/

where the indices k and j correspond to the jth game of the kth player and x.n/[k,j].i/, ¯y.n/

[k,j].i/ areaverage co-ordinates of previous game events of the player executing the ith pass (see Fig. 1).Zi is a row of a design matrix selecting the elements of the random-effects vector b correspond-ing to the player executing the ith pass, the team he plays for and the opposition. The first twocolumns of Z consist of 0s and 1s whereas the third column consists of 0s and −1s so that good

(a) individual skill at passing,(b) team ability to facilitate passing and(c) team ability to hamper passing

are all reflected in relatively high values of the corresponding random effects. s1, : : : , s7 aresmooth functions that we discuss below. We note that we truncate e, the pass number in thesequence of passes, so that the covariate that we use in the model is actually e=min.e, 15/. Thisis because the shape of the fitted smooth function corresponding to this covariate suggests thatit is fitting noise for values above 15.

Finally, for the random effects we assume that

b ∼N{0,Σ.σ/}, .4/

where Σ.σ/=Σ.σp, σt, σo/ is a .K +2T/-dimensional diagonal skill covariance matrix with thefirst K elements on the diagonal equal to the player skill variance σ2

p, the next T elements equalto the player’s team skill variance σ2

t and the final T elements equal to the opposite team abilityvariance σ2

o. This reflects our belief that extremely good (and bad) players and teams are lesscommon than average ones.

In this application, the values of the random effects for the players are the key parametersof interest since they can be interpreted as the passing ability of the players. The other randomeffects can be used as estimates of the abilities of each team to facilitate and hamper passing.

8 Ł. Szczepanski and I. McHale

3.1. Smooth functionsThe sf , f = 1, : : : , 7, terms in equation (3) are smooth functions. Such functions offer a largeamount of flexibility in specifying the relationship between covariates and a response variable.In the model fitting procedure each of them is represented as a sum of some basis functions (ofthe covariates) weighted by regression coefficients that need to be estimated.

The risk that is attached to the flexibility of this approach is that, given a sufficiently largenumber of basis functions, the smooth functions can overfit the observed data with a shapethat is unlikely to represent the underlying data-generating process. There is a trade-off betweenthe smoothness of a function and the extent to which it fits the observed data. The optimalsmoothness, measured by a strictly defined formula (e.g. the integral of the second derivativesof the function), is determined in the model fitting procedure.

There are many ways to choose the basis, which is a set of basis functions that define thespace that is supposed to contain an approximation of the target function. Here, for the smoothfunctions of a single covariate, sf , f = 1, : : : , 4 (i.e. game time t, time since previous pass, δt,and its lag, and pass number in the current sequence of passes for the team, e) we use thin plateregression splines, which approximate thin plate splines. The latter are a theoretically appealingsolution to a general smoothing problem (see Wood (2006), pages 154–156) but impracticalfrom the computational point of view, and hence the approximation.

For the smooth functions s5, s6 and s7 we use tensor product smooths. This is because thinplate regression splines are isotropic in that they treat the smoothness of the fitted spline equallyin all dimensions. In our application there is no reason to believe that such isotropy exists.For example, the smoothness of the functions s5, s6 and s7 along the pitch is almost certainlydifferent from the smoothness of the functions across it, even if we scale both dimensions to thesame real scale (e.g. metres). In contrast, tensor product smooths are not necessarily isotropicand so are employed here.

We describe the location of the pass (the origin and destination) component with two func-tions: s6 and s7. To some extent we want to impose symmetry on pass completion with respectto the left and the right (along the y-axis) side of the pitch. For example, holding all exter-nalities constant, we expect that passes that are 10 m left from the axis going through thecentre of both goals have the same chance of success as passes 10 m right from it, and thesame for passes to this point. This belief is reflected in the use of the absolute values in thes6-function. However, we want to distinguish a pass from a point 10 m right from the axisplayed 1 m to the right, to a pass 21 m to the left (for the same x). The s6-function does notallow this distinction to be made (the values of |y.n/

i −0:5| and |y.n/end,i −0:5| are the same for both

these passes). For this reason we introduce the .y.n/i −0:5/.y

.n/end,i −0:5/ term which is positive for

passes played to the same side of the pitch and negative for those crossing the axis of the pitch.We use this variable in the s7 smoothing function together with x- and xend-covariates to allowits effect to differ with the distance of the origin and destination of the pass from both goallines.

The parameters of our generalized additive mixed model in equations (1)–(4) are estimatedby using the penalized quasi-likelihood method (Schall, 1991; Breslow and Clayton, 1993) in R(R Core Team, 2012) using the mgcv package (Wood, 2006).

3.2. Making predictions and estimating the passing ability of playersThe random effects for the players in equation (4) can be interpreted as the player passingabilities. From the model that is described in equations (1)–(4), we can calculate several passcompletion rate predictions which will be of interest in the analysis.

Evaluating the Passing Ability of Footballers 9

(a) Full predictions, p.f/i , can be obtained by substituting the fixed parameters in equation (3)

with their estimates β, the random effects with their predictions .b.p/

, b.t/

, b.o/

/ and usingthe fitted smooth functions of the remaining covariates. This is the most complete typeof our predictions in the sense that it accounts for both the skill of the passing player andthe difficulty of the predicted passes. It predicts the pass completion rate for the passesactually attempted. By comparing the actual pass completion rate with the average ¯p.f/

k,sof this value for the kth player passes in the second season (s ≡ 2007–2008) we can tellhow well he performed relative to the model expectations. However, this metric is not agood measure of a player’s passing skill as it also contains information about difficultyof the pass. It is useful though as an indicator of model fit. The next type of prediction isdesigned to filter the pass difficulty out so that a fair comparison can be made of players’passing abilities.

(b) Prediction for an ‘average’ difficulty pass, in the season 2006–2007 by the kth player, isobtained as p

.av/k,2006=2007. It averages the ease of pass out of the full predictions so that a

fair comparison of the player random-effect predictions can be made. It is calculated byusing the following procedure.(i) For each ith pass we calculate the linear predictors ηi in the same way as for the full

predictions except that players’ random effects b.p/ are set to 0.(ii) We calculate the average of the linear predictor for all the passes in the season 2006–

2007.(iii) We add the above averaged linear predictors to the players’ random-effect predic-

tions b.p/

.(iv) Finally, we put the values on the probability scale by calculating the inverse logit

function of the above adjusted linear predictors.We use this prediction as a measure of passing ability. Of course, we can use just the player’srandom-effect predictions b

.p/instead for this. However, we use this transformation to

put it on the scale of pass completion rate for ease of interpretation.(c) Fixture-specific prediction for an ‘average’ difficulty pass is obtained as p

.pto/k,j , for player k

in fixture j (the .pto/ abbreviation stands for ‘player, team, opponent’). Compared withthe full predictions, it ignores all pass difficulty information except for abilities to facilitateand hamper passing of the teams playing in a given fixture. It is calculated by using thefollowing procedure.(i) First, we calculate the average linear predictor for passes in the season 2006–2007 in

a similar way to that for the prediction for an average difficulty pass, except that foreach pass we set all the random effects, for players, their teams and their opponents,to 0 (and all the other parameters to their estimates).

(ii) For each player k in each fixture j in the season 2007–2008 we add the above averagedlinear predictor to the predictions of random effects for players, their teams and theiropponents. (For the teams newly promoted to the league in the season 2007–2008,which do not have their own random-effect predictions, we use averages of the respec-tive random effects of the teams relegated from the league in the season 2006–2007.)

(iii) We put the values on the probability scale by calculating the inverse logit function ofthe above adjusted linear predictors.

For each fixture j we calculate the average of these predictions for the home, ¯p.pto/h,j , and the

away team players, ¯p.pto/a,j . We also calculate corresponding averages of naive predictions,

¯oh,j and ¯oa,j, according to which in the jth fixture of the season 2007–2008 the kth playeris expected to complete passes at his average rate in the fitting sample (season 2006–2007).We use these two sets of averages as predictors of the score in the jth fixture to evaluate

10 Ł. Szczepanski and I. McHale

Table 2. Estimates of the parametric model terms (respective elements of vector β)

Covariate Name Estimate Standard z-valueerror

1 Intercept 1.28 0.03 41.91a.n/ Headed pass −1.22 0.02 −77.03a.n−1/ Previous pass was headed −0.21 0.02 −12.82d

.n−1/a Previous event was an aerial duel −0.51 0.05 −9.52

d.n−1/t Previous event was a tackle 0.22 0.04 5.03

d.n−1/s Previous event was a duel involving the pass 0.13 0.04 2.97

executorh.n/ Pass executor plays for the home team 0.09 0.01 8.11

(a) (b)

(c) (d)

Fig. 2. Time-related-component smooth functions on the scale of the linear predictor ( , 95% con-fidence intervals): (a) game time (in minutes); (b) pass number in a given possession; (c) time since theprevious pass (in seconds); (d) time between the previous pass and the one before

the utility of our model in comparison with the raw pass completion rate as a measure ofplayer skill.

(d) Average player prediction p.e/i predicts the probability that a given pass would be success-

fully completed if it was executed by an average player. It is calculated in the same way asthe full predictions except that players’ random effects b.p/ are set to 0. This value can bethought of as a proxy for the ease of pass. We also calculate the average ¯p.e/

k,s of this valuefor all the kth player’s passes in both seasons s.

Evaluating the Passing Ability of Footballers 11

4. Results

4.1. Generalized linear mixed model estimation resultsTable 2 presents estimates of the parametric model terms contained in the vector β.

As expected, headed passes (a.n/ = 1) are less accurate than passes played with a foot andthey also have a negative effect on the following pass (a.n−1/ = 1), perhaps because they forcethe receiver either to head it again or to take more time to control the ball and to bring it downto his foot. Headed passes are generally less accurate as in their case the executing player hasless control on the ball than when passing with a foot. If a pass is a direct result of winning anaerial duel (d.n−1/

a = 1), the chance of its completion drops further but this effect is somewhatcompensated for if the same player wins the duel and makes the pass (d.n−1/

s = 1). Passes thatare made immediately after regaining the ball from the opposition with a tackle are generallymore likely to be completed, perhaps because the opposition needs some time to reorganizethemselves (for instance, the tackled player may be on the ground when the pass is made).

Fig. 2 presents the estimated smooth functions of time-related covariates on the scale of thelinear predictor. Passes made under time pressure (Fig. 2(c)) have a relatively low probabilityof success as do those made before teams establish possession having exchanged a few passes(Fig. 2(b)). Interestingly, it is generally easier to pass later in the game (Fig. 2(a)) perhaps asteams become tired and cannot apply pressure on the passer as effectively as they might havedone in the early stages of the match; however, the effect is quite small.

The success of a pass is also related to the executing player’s average position in previousgames as evidenced in Fig. 3. Controlling for everything else, defenders (players with low x.n/)seem to have it easier than all the other players, followed by wingers and central midfielders.Central forwards are usually faced with the toughest task. To appreciate why this may be true,consider a pass from near one’s own goal. If it is a central forward making the pass, it is likelythat the team are under extreme pressure from the opposition and all the players are pushedback near their own goal with very few options for where to pass the ball. Alternatively, if itis a defender making the pass, it is less likely that that player (and his team) will be under lesspressure.

Presenting the influence of the origin and destination of a pass on its probability of successis a bigger challenge since in our model the linear predictor for the latter depends on the pitch

Fig. 3. Smooth function s6.x.n/, ¯y.n// of the executing player’s average position (the position of all eventsinvolving him) in previous games on the scale of the linear predictor: x.n/ corresponds to the length ofthe pitch and ¯y.n/ to its width; ¯y.n/ averages the distance from the axis going through the centre of both goalsso as not to allow values of y to cancel out for players who switch sides of the pitch

12 Ł. Szczepanski and I. McHale

co-ordinates through two multi-dimensional functions. Fig. 4 is an attempt to address thischallenge. The idea is to fix the location of the pass origin at a certain point (the thick dots inthe figure), the indicator variables at the most commonly occurring values and the continuousvariables at the closest observed value to the median. Fig. 4 shows the contours of the linearpredictor against the location of the destination of the pass.

Firstly, note the designed symmetry with respect to the axis going through the centre of bothgoals. Apart from this, passes played towards the opponent’s goal (along the horizontal axisof the thick dots) tend to have a smaller chance of success than those played sideways or, inparticular, backwards. Furthermore, passing to either of the wings is more likely to succeedthan straight ahead. This is because the defending team tend to concentrate their efforts on notallowing the team on the ball to move into convenient shooting positions straight ahead of goal.Finally, the probability of success tends to dip just ahead of the passing player (assuming thathe is facing the opponent’s goal). This may be because there is usually an opponent in front ofthe passing player obstructing the most direct route to the goal (the dotted line).

Predictions of the team random effects are presented in Fig. 5. The vertical axis correspondsto the terms b.t/ representing each team’s ability to facilitate passing (e.g. by clever ‘off-the-ball’movement). The higher it is the better the team is. The horizontal axis contains the terms b.o/

capturing each team’s ability to prevent passing of their opponents (e.g. by aggressive pressingand close marking). Again the higher the number the better it is. Distance from the diagonalbroken line can be viewed as a summary of the team’s ability in these two aspects. There are fourclear outliers in the plot: Arsenal, Chelsea, Manchester United and Liverpool, who dominatedthe league particularly in terms of the ability to facilitate passing. Arsenal are an extreme examplehere as they were the best at facilitating passing but only average at preventing it. Liverpool, incontrast, were almost equally good in both areas.

Football aficionados will probably find these results on the effect of the covariates on the passsuccess probability intuitive and reasonable. We believe that this is an indication that the modelis working well.

4.2. Ease of passWe approximate ease of each pass in the fitting sample with the probability of the pass beingcompleted had it been played by the average player, p

.e/i . Fig. 6(a) shows the resulting histogram.

The further to the right the easier the pass is (the more likely it is to be completed). Inter-estingly, the distribution is highly skewed towards the easy passes: half of the passes have anexpected completion probability of more than about 76% and a quarter of the passes are 90%or more likely to be successfully executed. In contrast, only about a quarter of the passes areless likely to be completed than not.

We can break down the pass difficulty information by the nominal position of the players.This is done in Fig. 6(b). A relatively high proportion of the easy passes (the furthest to theright) are attempted by players playing in the central midfield. The more difficult the passes, thelower proportion of them are executed by this group of players and, conversely, the more areattempted by the offensive players in the central (central attacker) and the wide positions (leftand right midfielder). The proportion of the passes that were made by the defenders (centraldefenders and left and right defenders) are virtually constant with respect to the ease of pass.

We argue that one of the reasons why the raw pass completion rate is a poor measure ofplayers’ passing ability is that it is polluted by the difficulty of the attempted passes. In otherwords, this simple metric can fluctuate purely because of changes in the type of attempted passesrather than the inherent level of skill of the executing player. If that is so, then we may expect

Evaluating the Passing Ability of Footballers 13

Fig. 4. Value of the linear predictor with respect to the location of the origin of the pass and its destin-ation: the contours are values of the linear predictor for the destination of the pass as on the horizontal(x.n/

end) and the vertical (y.n/end) axes and the pass origin variables (x.n/ and y.n/) fixed at the values indic-

ated by the thick dot, which are selected from a .0:25, 0:50, 0:75/� .0:25, 0:50, 0:75/ grid ( , direct routeto the goal)

the completion rate to increase from one season to another for players who attempted easierpasses in the second season and vice versa. This is what is analysed in Fig. 7. It comparesthe average 2007–2008 completion rate with that of 2006–2007 (Fig. 7(a)) and the average ofthe full model predictions, ¯p.f/

k,2007=2008 (Fig. 7(b)). Focusing on Fig. 7(a) first, there is somecorrelation between the empirical values from one season to another. However, it is also clearthat many of the deviations could be explained by the ease of passes attempted as players whoseperformance increased (above the broken identity line) tended to be faced with an easier task inthe season 2007–2008 than in the previous season, ¯p.e/

k,2007=2008 − ¯p.e/k,2006=2007 > 0. Conversely, the

completion rate of the players who attempted more difficult passes in the second season tended todrop. Since the model can control for the pass difficulty, the relationship between its predictionsand the 2007–2008 empirical values is much stronger (Fig. 7(b)) with the Pearson correlationcoefficient 0.92 for the model and 0.72 for the naive predictions. In addition to illustrating theeffect of pass difficulty, this analysis can also serve as some validation of the model.

Table 3 lists the top five players for each position according to the model together with theirpredictions and empirical values. The list is limited to players who made at least 100 passesin the season 2007–2008 to allow a reliable comparison between model predictions and theobserved values in the validation sample. Table 3 reveals some specific examples of how themodel incorporates, and accounts for, pass difficulty in making predictions.

14 Ł. Szczepanski and I. McHale

Fig. 5. Team random-effects prediction

For example, Carlos Tevez’s pass completion rate o jumped by a few percentage points fromthe first season 2006–2007 to the next (from 0.74 to 0.80). However, the model anticipates itvery well ( ¯p.f/

2007=2008 =0:80) as a big proportion of the improvement can be explained by the factthat the ease of the passes attempted was much higher in the second season ( ¯p.e/

2007=2008 = 0:78compared with ¯p.e/

2006=2007 = 0:71). In the case of Tevez this has much to do with the fact thathe played with players of better quality in 2007–2008 after he was transferred from West HamUnited, a team that was threatened with relegation in 2006–2007, to Manchester United, thePremier League champions in 2007–2008.

4.3. Evaluating playersFig. 8 plots model-derived players’ passing abilities, p.av/, against their observed pass completionrates o in the season 2006–2007. The broken line is the identity function. Specific player examplescan be examined in Table 3.

Naturally, there is positive correlation between the empirical completion rate in the fittingsample, o, and the model-based passing ability p.av/ as players who complete passes at a higherrate are generally considered to be better at this skill by the model. There are, however, consid-erable departures from this naive rule.

First, the circumstances from which the players attempted passes differ. Some of them passedin easier situations and/or chose easier options which boosted their observed completion rateabove what could be expected simply on the basis of their passing ability. Conversely, some werefaced with an unusually difficult task which made their empirical completion rate look worsethan they deserve when a fair comparison is made. This is reflected in the positive correlationbetween the average pass ease (the brightness of the points) and the observed success rate (thehorizontal axis). To illustrate how the model takes pass difficulty into account when ratingplayers’ skill consider the pair of central forwards John Carew and Yakubu Aiyegbeni. Theformer had a lower empirical pass completion rate in the fitting sample; however, his skill israted higher by the model as the passes that he attempted were generally more difficult (a darkerpoint). Similarly, in Table 3 Sami Hyypia’s passing skill (p.av/) is rated slightly above Ricardo

Evaluating the Passing Ability of Footballers 15

(a) (b)

Fig. 6. Ease of pass: (a) all positions ( , cut-offs of consecutive quartiles); (b) relative frequency of the easeof passes made by players from given nominal positions ( , left or right midfielder; , left or right defender;

, central midfielder; , central defender; , central attacker)

(a) (b)

Fig. 7. Average observed 2007–2008 pass completion rate ok,2007=2008 against the naive, ok,2006=2007,and the model, ¯p.f/

k,2007=2008, predictions: the quantity ¯p.e/

k,2007=2008� ¯p

k,2006=2007 is the change in the value ofthe proxy for ease of pass from season 2006–2007 to 2007–2008 for the kth player ( , positive; , negative)( , 0.05; , 0.10; , 0.15; , 0.05; , 0.10; , 0.15; , identity function)

Carvalho’s although his observed completion rate o was much lower because the passes that heattempted were on average considerably more difficult (lower ¯p.e/

).Secondly, some players’ success rates are based on few observations, making their numbers

less reliable. The model recognizes this fact by regressing the individual performance to theoverall mean, represented by the full horizontal line, the effect being stronger for fewer passes.

16 Ł. Szczepanski and I. McHale

Tab

le3.

Top

five

pass

ers

bypo

sitio

nba

sed

on20

06–2

007

seas

onpe

rfor

man

ce

Pos

itio

n†Fo

rena

me

Sur

nam

eR

esul

tfo

r20

06–2

007

seas

onR

esul

tfo

r20

07–2

008

seas

on

Tea

mn

Ave

rage

Ave

rage

Pas

sing

Tea

mn

Ave

rage

Ave

rage

Ave

rage

Com

plet

ion

obse

rved

ease

ofra

ting

,of

the

full

obse

rved

ease

ofra

tebi

as,

com

plet

ion

pass

,¯ p.e

/p

.av/

pred

icti

ons,

com

plet

ion

pass

,¯ p.e

/o−

¯ p.f/

rate

,o¯ p.f

/ra

te,o

CD

John

Ter

ryC

hels

ea10

580.

880.

810.

83C

hels

ea61

30.

840.

840.

790.

00C

DW

illia

mG

alla

sA

rsen

al73

00.

860.

810.

81A

rsen

al88

40.

890.

920.

870.

03C

DSa

mi

Hyy

pia

Liv

erpo

ol10

880.

750.

720.

79L

iver

pool

648

0.78

0.79

0.76

0.01

CD

Ric

ardo

Car

valh

oC

hels

ea11

930.

830.

800.

79C

hels

ea62

10.

820.

860.

810.

03C

DC

hris

Rig

gott

Mid

dles

brou

gh14

20.

720.

650.

79M

iddl

esbr

ough

220

0.67

0.72

0.65

0.05

LR

DP

asca

lC

him

bond

aW

igan

and

Tott

enha

m13

510.

750.

720.

80To

tten

ham

Hot

spur

1119

0.77

0.79

0.74

0.02

LR

DF

abio

Aur

elio

Liv

erpo

ol43

50.

710.

660.

80L

iver

pool

592

0.72

0.70

0.69

−0.0

3L

RD

And

rew

Tay

lor

Mid

dles

brou

gh11

740.

680.

650.

79M

iddl

esbr

ough

453

0.67

0.68

0.64

0.01

LR

DSt

eve

Fin

nan

Liv

erpo

ol13

110.

750.

730.

79L

iver

pool

631

0.76

0.74

0.74

−0.0

2L

RD

Step

hen

Car

rN

ewca

stle

Uni

ted

858

0.73

0.70

0.79

New

cast

leU

nite

d29

70.

720.

700.

70−0

.02

CM

Pau

lSc

hole

sM

anch

este

rU

nite

d18

980.

900.

840.

85M

anch

este

rU

nite

d13

410.

890.

890.

84−0

.00

CM

Stili

anP

etro

vA

ston

Vill

a99

60.

790.

740.

80A

ston

Vill

a56

20.

780.

790.

740.

01C

MM

icha

elE

ssie

nC

hels

ea16

370.

840.

810.

80C

hels

ea12

200.

820.

800.

79−0

.01

CM

Mic

hael

Car

rick

Man

ches

ter

Uni

ted

1762

0.82

0.79

0.80

Man

ches

ter

Uni

ted

1280

0.82

0.82

0.79

0.01

CM

Did

ier

Zok

ora

Tott

enha

mH

otsp

ur11

050.

820.

790.

80To

tten

ham

Hot

spur

1032

0.83

0.84

0.81

0.01

LR

MM

ikel

Art

eta

Eve

rton

1292

0.73

0.68

0.81

Eve

rton

877

0.69

0.72

0.65

0.03

LR

MA

lexa

nder

Hle

bA

rsen

al14

550.

800.

780.

79A

rsen

al13

490.

800.

820.

780.

02L

RM

Gar

eth

Bar

ryA

ston

Vill

a14

510.

680.

650.

79A

ston

Vill

a12

030.

710.

720.

690.

01L

RM

Kev

inK

ilban

eE

vert

onan

dW

igan

831

0.62

0.59

0.79

Wig

anA

thle

tic

725

0.64

0.57

0.62

−0.0

7L

RM

Cri

stia

noR

onal

doM

anch

este

rU

nite

d12

000.

780.

760.

78M

anch

este

rU

nite

d10

170.

760.

760.

740.

00C

AJo

hnC

arew

Ast

onV

illa

309

0.56

0.52

0.79

Ast

onV

illa

723

0.56

0.55

0.54

−0.0

1C

AD

arre

nB

ent

Cha

rlto

nA

thle

tic

679

0.65

0.63

0.78

Tott

enha

mH

otsp

ur16

00.

590.

580.

57−0

.01

CA

Car

los

Tev

ezW

est

Ham

Uni

ted

518

0.74

0.71

0.78

Man

ches

ter

Uni

ted

950

0.80

0.80

0.78

0.00

CA

Nic

olas

Ane

lka

Bol

ton

Wan

dere

rs81

30.

660.

640.

78B

olto

nan

dC

hels

ea62

00.

680.

660.

66−0

.01

CA

Nw

ankw

oK

anu

Port

smou

th90

20.

720.

700.

78Po

rtsm

outh

436

0.76

0.75

0.75

−0.0

1

†CD

,cen

tral

defe

nder

;LR

D,l

eft

orri

ght

defe

nder

;CM

,cen

tral

mid

field

er;L

RM

,lef

tor

righ

tm

idfie

lder

;CA

,cen

tral

atta

cker

.

Evaluating the Passing Ability of Footballers 17

Fig. 8. Players’ estimated passing ability p.av/k,2006=2007

against observed pass completion rate Nok,2006=2007,

a proxy for the ease of pass, Np.e/k,2006=2007

( , 0.8; , 0.7; , 0.6; , 0.5; , 0.4) and the number of passes

in the fitting sample ( , 500; , 1000; , 1500; , 2000): the points corresponding to the labelled players arethose to the bottom right of their names and are marked additionally with a vertical bar; , identity function;

, passing ability of an average player

As an extreme example, consider Matthew Upson who had a 100% completion rate but achievedit on just six passes. The model recognizes that very little information is contained in such smallsamples. In contrast, Paul Scholes completed many more passes (a bigger point) in the fittingsample and, as a result, is rated much higher although his empirical completion rate is lower.Similarly, in Table 3 Chris Riggott’s passing skill p.av/ is rated about the same as RicardoCarvalho’s although the difference between the empirical completion rate o and the difficulty ofpasses ¯p.e/

are much bigger for the former. This is because Carvalho proved his unusually highcompletion rate on many more passes.

In all, to be recognized for his empirical passing success in the model framework a playerwould have to pass at a higher rate than an average player in these circumstances and to provideenough evidence for this.

4.4. Comparing predictive utilityThe ultimate test of a rating method is its predictive utility. Verifying it is complicated in our casebecause what we try to rate, i.e. the passing skill of footballers, is not observable. For instance,we could not just use the observed pass completion rate in the season 2007–2008 as a benchmarkfor predictions, since the very essence of our argument is that it is a poor indicator of passingskill.

One objective measure that exists is team success. If the talent pool of a team as evaluated byone index is a better predictor of the team’s future results than one based on another index, thenthe former should be preferred. In other words, football clubs should assess players on the basisof methods that are informative about the future team results. A ‘good’ player is one who helpshis team to win. With this in mind, for every fixture of the 2007–2008 season we calculated two

18 Ł. Szczepanski and I. McHale

(a) (b)

Fig. 9. Home team goals supremacy in fixtures of the 2007–2008 season against the difference in theaverage predictor of the pass completion rate for the home and the away team players in that fixture: (a)predictor based on raw pass completion rate ( ¯o is the average of the average previous season’s pass com-pletion rates of players in team x in a given fixture); (b) predictor based on the model pass completion ratepredictions (p.pto/

x is the average of the pass completion rate predictions (conditionally only on the player andteams information) of players in team x in a given fixture)

statistics that are supposed to capture the general level of the passing skill in both competingteams: one based on the raw pass completion rate in the season 2006–2007, ¯o, and one basedon our model fitted on that season, ¯p.pto/

. The key feature of the p.pto/-predictions (which weredefined in more detail at the end of the list of prediction types in Section 3.2) is that they usethe information on the player executing the pass and the teams involved in the game but noinformation on the difficulty of the passes beyond that. As a result, one such value is producedper player per fixture in the 2007–2008 season (using the estimates based on the season 2006–2007). The average of these values among the home team players in a given fixture constitutesa model-based passing index for that team, ¯p.pto/

h , in that fixture. The corresponding index forthe away team is ¯p.pto/

a .We check how well a difference in values of these indices for competing teams predicts the

result of a fixture.Firstly, for each fixture we plot the difference in the home and away team goals against the

difference in the indices for both teams in Fig. 9. The Pearson correlation coefficient for thepass-completion-based index (Fig. 9(a)) with the home team goals supremacy in the season2007–2008 is 0:309 with a 90% confidence interval of .0:22, 0:392/, whereas its value for themodel-based predictor (Fig. 9(b)) is 0:417 with a .0:335, 0:493/ 90% confidence interval. Ofcourse, we would not expect there to be a ‘perfect’ relationship between measures of passingquality and game result as there are other factors which determine match outcomes, like qualityof shooting. However, the fact that the model-based measure of passing ability has a strongerlinear relationship with match outcome is reassuring.

Secondly, we fit two ordered logit regression models of the game outcome (home win, drawor away win) with the difference in the average passing index for the home and away team asthe only covariate: one model for the index based on the raw pass completion rate, ¯o, and onefor the model-based index, ¯p.pto/

. The latter model offers a better fit with the log-likelihood of

Evaluating the Passing Ability of Footballers 19

−291:28 compared with −303:25 for the model based on the pass completion rate (both modelshave the same number of parameters).

We checked that these results are not sensitive to the minimum number of players with apassing skill rating (i.e. players who were also observed in the fitting sample) in both teams in agiven fixture required to calculate the average passing skill indices.

5. Discussion

In this paper we present a method which can be used to evaluate passing skill of footballerscontrolling for the difficulty of their attempts. We combine proxies for various factors influencingthe probability that a pass is successful in a statistical model and evaluate the inherent player skillin this context. The measure of player passing skill has a natural interpretation in this framework,as does the metric that is proposed for pass difficulty. Finally, we can comprehensively handleall the players in the observed sample with the same procedure without a need to discard playersarbitrarily who have been observed too few times to be reliably evaluated. The reliability ofempirical passing rates based on a small number of observations is naturally taken into accountwithin the framework proposed.

A complication with our approach to estimating player passing abilities based on, amongother things, the pass difficulty is that there is likely to be some endogeneity in that skilledplayers are likely to be able to create easier passing opportunities for themselves and hence passdifficulty is possibly exogenous to player skill. However, even with this potential flaw, the resultsseem good and players who are known to be good at passing are identified by our approach.

When comparing the utility of the proposed method against the raw pass completion rate forpredicting fixture results, we used model predictions conditional on the estimates of the abilitiesof the players as well as the teams that are involved in each fixture. This is because the teamability is confounded with player’s ability in the pass completion rate statistic. Ignoring teamabilities in model predictions would give the naive method an unfair advantage since most ofthe players play for the same team in the fitting and the prediction sample. It might be arguedthat the approach that we took, in turn, gives our method an advantage because some playersdo change teams between the two seasons. However, we regard the fact that our method candisentangle player abilities from team abilities and other factors and put them back together ina different configuration to be one of the strengths of our approach.

Note also that the team parameters in the predicted period may correspond to a differentteam from the team that they were estimated for (in case of the personnel changes). However,this would just add noise to the passing index based on the model predictions and could onlyact to their disadvantage; hence it is the more reassuring that they perform relatively well.

One important point that needs to be made about player evaluations that are produced by thismodel is that we believe that they are most useful when compared among players performingsimilar types of passes, in similar circumstances. This is likely to be a consequence of a player’splaying position and that player’s passing ability not being independent, so that some of theability of midfielders (for example) is ‘given’ to the coefficient estimate on position ‘midfielder’.Thus, breaking down the results by position is one way to compare players making similartypes of pass and mitigates the confounding effect of passing ability and playing position beingrelated. To suggest, for example, that a central defender would maintain his passing rating whentransferred to a winger position without at least a period of transition would be naive.

Speaking of positions, we classified players to only a few categories and based only on thelocation of their actions on the ball in their previous games. As any football fan will know,this is a very simplistic approach as there are more possible positions and other factors that

20 Ł. Szczepanski and I. McHale

determine which of them a player belongs to. Classifying players to positions on the basis ofthe actions that they perform could itself be an interesting research problem. In this paper asimple classification algorithm is used just to highlight some potentially interesting aspects ofour results (in Fig. 6 and Table 3) but is not a component of our model. Therefore we settledfor the simplistic approach as far as this classification is concerned.

Another caveat for the results that are presented here is that, whereas we do take the gen-eral team ability to facilitate successful pass completion into account, the individual skill ofthe pass receiver is not factored in. Therefore it may still be possible that the latter may beconfounded in a rating of a player who tends to play an unusually high proportion of passestowards certain teammates. For example, John Terry’s rating may be inflated if he frequentlyplayed long passes which are normally difficult to complete but perhaps less so if Didier Drogbais the destination player (Drogba was renowned for his ability to receive a pass from a longdistance). Including pass receiver in the equation could be a potential model extension. How-ever, Opta does not currently collect information on the intended pass receiver for unsuccessfulpasses.

Another piece of information that could help us to improve the model, but is not availablein our data set, is weather data. Conditions such as strong winds, rain and snow can all havean effect on a player’s passing performance and controlling for them could possibly refine ourestimates of his passing skill.

Further work in this area could also involve evaluating passes on the basis of their valuefor the team rather than the difficulty. It may be that some players can add value with theirpassing above what could be expected by the difficulty of their passes, whereas others tend toattempt unnecessarily difficult passes, which is not recognized in the framework that is proposedhere. Further, another way that model validity could be determined would be to obtain expertjudgements on player passing ability and to compare these expert assessments with the modelpredictions and the pass completion rates. Finally, our model, as specified here, may rewardplayers for attempting difficult passes that have no positive effect, and possibly even a negativeeffect, on the team. However, despite this possibility we believe that our results demonstratethat the model is valuable, and it is certainly a step in the right direction if statistical modellingis to be used to measure passing ability of footballers.

Acknowledgements

We thank Smartodds Ltd who sponsor Łukasz Szczepanski’s doctoral research, Opta for allow-ing us to use the data, and the three referees and the Associate Editor for their useful commentsin improving the paper.

References

Albert, J. (1992) A Bayesian analysis of a Poisson random effects model for home run hitters. Am. Statistn, 46,246–253.

Albert, J. (2006) Pitching statistics, talent and luck, and the best strikeout seasons of all-time. J. Quant. Anal.Sprts, 2, no. 1.

Breslow, N. E. and Clayton, D. G. (1993) Approximate inference in generalized linear mixed models. J. Am. Statist.Ass., 88, 9–25.

Duch, J., Waitzman, J. S. and Amaral, L. A. N. (2010) Quantifying the performance of individual players in ateam activity. PLOS ONE, 5, no. 6, article e10937.

Efron, B. and Morris, C. (1975) Data analysis using Stein’s estimator and its generalizations. J. Am. Statist. Ass.,70, 311–319.

Jensen, S. T., Shirley, K. E. and Wyner, A. J. (2009) Bayesball: a Bayesian hierarchical model for evaluating fieldingin major league baseball. Ann. App. Statist., 3, 491–520.

Evaluating the Passing Ability of Footballers 21

Lin, X. and Zhang, D. (1999) Inference in generalized additive mixed models by using smoothing splines. J. R.Statist. Soc. B, 61, 381–400.

Loughin, T. M. and Bargen, J. L. (2008) Assessing pitcher and catcher influences on base stealing in Major LeagueBaseball. J. Sprts Sci., 26, 15–20.

McHale, I. G., Scarf, P. and Folker, D. (2012) On the development of a soccer player performance rating systemfor the English Premier League. Interfaces, 42, 339–351.

McHale, I. G. and Szczepanski, Ł. (2014) A mixed effects model for identifying goal scoring ability of footballers.J. R. Statist. Soc. A, 177, 397–417.

Oberstone, J. (2011) Evaluating English Premier League player performance using the MAP model. In Proc. 3rdInt Conf. Mathematics in Sport (eds D. Percy, J. Reade and P. Scarf), pp. 153–159. Southend-on-sea: Instituteof Mathematics and Its Applications.

R Core Team (2012) R: a Language and Environment for Statistical Computing. Vienna: R Foundation for StatisticalComputing,

Schall, R. (1991) Estimation in generalized linear models with random effects. Biometrika, 78, 719–727.Wood, S. N. (2006) Generalized Additive Models: an Introduction with R. Boca Raton: Chapman and Hall–CRC.