View
7
Download
0
Category
Preview:
Citation preview
ASHESI UNIVERSITY COLLEGE
PREDICTING INJURY IN FOOTBALL USING PITCH QUALITY,
PLAYER’S FUNCTION, PLAYER’S AGE AND MATCH INTENSITY:
A CASE STUDY OF THE 2017 AFRICAN CUP OF NATIONS
APPLIED PROJECT
B.Sc. Management Information Systems
Ayeley Commodore-Mensah
2017
Page | 1
Branding and Identity Guide The Ashesi brand and logo are integral parts of our worldwide image and identity. We must be careful of how and where the Ashesi is used to ensure we maintain the integrity of our organization.
This guide has been developed to help you clearly understand our policies towards the use of the Ashesi logo in a variety of mediums, as well as type faces and a color palate to help you produce materials that maintain the brand’s integrity. We would request that you seek approval from the Ashesi University College Marketing Committee before creating any media that reproduces the Ashesi logo.
Contents The Logo ........................................................................................................................................................ 2
Using the Logo .............................................................................................................................................. 3
Clear Space and Logo Design ........................................................................................................................ 5
Unacceptable Logo Uses ............................................................................................................................... 6
The Ashesi Seal .............................................................................................................................................. 7
Color Palette ................................................................................................................................................. 8
Fonts.............................................................................................................................................................. 8
Mission Statement ........................................................................................................................................ 9
ASHESI UNIVERSITY COLLEGE
Predicting injury in football using pitch quality, player’s function,
player’s age and match intensity: A case study of the 2017 African Cup
of Nations
APPLIED PROJECT
Applied Project submitted to the Department of Computer Science, Ashesi
University College in partial fulfilment of the requirements for the award of
Bachelor of Science degree in Management Information Systems
Ayeley Commodore-Mensah
April 2017
i
DECLARATION
I hereby declare that this Applied Project is the result of my own original work and that no
part of it has been presented for another degree in this university or elsewhere.
Candidate’s Signature:
……………………………………………………………………………………………
Candidate’s Name:
……………………………………………………………………………………………
Date:
……………………………………………………………………………………………
I hereby declare that preparation and presentation of this Applied Project were supervised
in accordance with the guidelines on supervision of such laid down by Ashesi University
College.
Supervisor’s Signature:
……………………………………………………………………………………………
Supervisor’s Name:
……………………………………………………………………………………………
Date:
……………………………………………………………………………………………
ii
Acknowledgement
To my supervisor Dr Charles Jackson, who encouraged, supported and guided me
from the beginning to the final stage of this project. I am also thankful to Mr Jon Boafo
and Mr Nathan Quao for their assistance in getting this project completed. I am indebted
to my family and colleagues for the support I received from them.
Above all, to God, the source of knowledge and wisdom, for seeing me through
this project.
iii
Abstract Football is the most popular sport in the world, and many individuals have taken
advantage of it to earn a living and improve upon their standards of living. Injuries are also
unfortunate incidents that occur in daily life and in sports, which affect an individual’s
ability to make good use of his sporting talent to earn a living for himself and his family.
In this project, modifiable risk factors that affect a player’s likelihood of getting an
injury are identified, and their individual contributions to injury of a player is assessed. A
predictive model for determining important risk factors for determining injuries in football
is generated using the identified risk factors: pitch quality, match intensity, player function
and player age.
iv
Table of Contents DECLARATION ................................................................................................................... i
Acknowledgement ................................................................................................................ ii
Abstract ................................................................................................................................ iii
List of Figures ...................................................................................................................... vi
List of Abbreviations .......................................................................................................... vii
Chapter 1 ............................................................................................................................... 1 1.1 Introduction and Background ...................................................................................... 1
1.2 Motivation ................................................................................................................... 1
1.3 Problem Description .................................................................................................... 2
1.4 Benefits ........................................................................................................................ 3
1.5 Objectives .................................................................................................................... 3
1.6 Outline of Project ........................................................................................................ 4
Chapter 2: Related Works ..................................................................................................... 5
2.1 Background ................................................................................................................. 5 2.2 Relevance of Sports Injury Prediction ........................................................................ 5
2.3 Factors that Result in Injuries ...................................................................................... 6
2.4 Related Works ............................................................................................................. 7
2.4.1 Oslo Sports Trauma Research Center ................................................................... 7
2.4.2 NSW Waratahs ..................................................................................................... 8
2.4.3 SAP SE ................................................................................................................. 8
2.4.4 Team from the University of Birmingham and Southampton Football Club ....... 8
2.4.5 Kitman Labs ......................................................................................................... 9 2.4.6 Sports Injury Predictor .......................................................................................... 9
2.4.7 Researchers from Dow Jones and Wall Street Journal ......................................... 9
2.5 Findings and Proposed Solution ................................................................................ 10
2.6 Proposed Solution ..................................................................................................... 10
Chapter 3: Requirements ..................................................................................................... 11
3.1 Functional Requirements ........................................................................................... 11
3.1.1 Business Understanding ..................................................................................... 11
3.1.2 Data Inventory and Understanding ..................................................................... 12
3.1.3 Data Preparation ................................................................................................. 14 3.1.4 Data modelling ................................................................................................... 14
3.1.5 Testing, Evaluation, Interpretation, Understanding ............................................ 15
v
3.1.6 Deployment ........................................................................................................ 15
3.2 Non-Functional Requirements .................................................................................. 15
3.2.1 Product Requirements ......................................................................................... 15
3.2.2 Security Requirements ........................................................................................ 16
Chapter 4: High Level Architecture .................................................................................... 17 4.1 Data Mining Algorithm ............................................................................................. 17
Chapter 5: Implementation ................................................................................................. 18
5.1 Data sources .............................................................................................................. 18
5.1.1 AFCON Data ...................................................................................................... 18
5.1.2 Match intensity data ............................................................................................ 22
5.2 Tools .......................................................................................................................... 25
5.3 Language ................................................................................................................... 25
5.4 Implementation .......................................................................................................... 25 5.4.1 Dataset 1 ............................................................................................................. 26
5.4.2 Dataset 2 ............................................................................................................. 27
5.4.3 Logistic regression .............................................................................................. 30
Chapter 6: Testing ............................................................................................................... 37
6.1 Testing results ........................................................................................................... 37
Chapter 7: Conclusions and Recommendations ................................................................. 39
References ........................................................................................................................... 41
Appendix ............................................................................................................................. 43
vi
List of Figures
Figure 2.1 The Meeuwisse model for classifying injury risk factors ................................................ 7 Figure 5.1 Injury distribution based on periods in the match .......................................................... 20 Figure 5.2 Number of matches played on the different pitches ....................................................... 22 Figure 5.3 Match intensity levels for all matches ............................................................................ 24 Figure 5.4 Linear regression results on Dataset 1 (very serious injuries) ........................................ 27 Figure 5.5 Linear regression results on Dataset 2 (all injuries) ....................................................... 28 Figure 5.6 Plot of player function and injuries sustained ................................................................ 29 Figure 5.7 Plot of pitch quality and injuries sustained ..................................................................... 30 Figure 5.8 Head view of dataset ...................................................................................................... 31 Figure 5.9 Plot of all variables considered in the analysis ............................................................... 32 Figure 5.10 Plot of player function on different pitch qualities ....................................................... 33 Figure 5.11 Correlation output ......................................................................................................... 34 Figure 5.12 Logistic regression output ............................................................................................ 35 Figure 6.1 Result of testing in R ...................................................................................................... 38 Figure 7.1 Proposed interface for club administrators to work with ............................................... 39
vii
List of Abbreviations
The Fédération Internationale de Football Association – FIFA
Confederation of African Football CAF
National Football League-NFL
Major League Soccer - MLS
Major League Baseball – MLB
African Cup of Nations- AFCON
Confederación Sudamericana de Fútbol –CONMEBOL
Union of European Football Associations-UEFA
1
Chapter 1
1.1 Introduction and Background
Football is regarded as the most popular sport in the world. It is an enjoyable form
of exercise, and helps develop agility, balance, coordination and sense of team work
Stopsportsinjuries.com (n.d.). It is a contact sport and as is common with all contact
sports, there is the likelihood of an injury occurring in the life of a player. Players, football
teams, player agents, physiotherapists and the country are some important stakeholders in
the sport who are affected by injuries sustained by a footballer.
Stephen Appiah, Michael Essien, Junior Agogo, Kwadwo Asamoah and Kevin
Prince Boateng are all players who have represented Ghana in football at various times in
the past. That’s not all they have in common. These players had their careers destroyed by
injuries they sustained while playing football (Pulse.com.gh, 2016). These players were
fortunate to have played to play in well-established leagues outside Ghana, which meant
they got the right treatment for their injuries. They recovered but not to the level they were
before they got injured. For a player who plied his career on the local scene, the story is
different. A serious injury means his career is over. Former Accra Hearts of Oak and
Kumasi Asante Kotoko player, Charles Taylor, is one whose career was ended by injuries
(Goal.com, 2012). There are other players who could have also become stars like Charles
but injuries cut their careers short before it even began.
1.2 Motivation
In Ghana, there are no specialized sports hospitals to take care of the sports
injuries. Typically, when a player sustains an injury while on duty for his team he is let go
with no hope for treatment. When a player sustains an injury, the impact is not felt by him
2
alone. First, he can no longer play and his only means of supporting himself or making
money has been lost. His family is affected, as the person they look up to for financial
support is no longer available to assist them. The most affected stakeholder is the team he
plays for. The team that doled out huge sums of money to buy him in the first place must
contend with being without their player for some weeks, months or in some cases years.
One player who comes to mind in this case is Andre Ayew, a Ghanaian player who
sustained an injury on his first outing for English Premier League club, West Ham United.
This was after they bought him for a club record fee. His injury meant that he would be
out for close to four months, during which time West Ham’s value for their money would
be lost (Forbes.com, 2015).
This project seeks to develop a solution that uses the risk factors for sustaining an
injury to create a predictive model that will reduce the likelihood of a player sustaining an
injury.
1.3 Problem Description
The most common injuries sustained by the sportsman are the lower extremity
injuries such as sprains and strains, cartilage tears and anterior cruciate ligament sprains in
the knee, overuse lower extremity injury being soreness in the calf (shin splints), pain in
the knee or the back of the ankle (Achilles tendinitis), upper extremity injuries, and head,
neck and face injuries (Stopsportsinjuries.com, n.d.). Injury has been known to be a major
cause of derailing the careers of major sports men and women all over the world. These
effects may be physical or psychological. Some emotions that are associated with injuries
include sadness, anger, and frustration. In 2015, it was estimated that the average cost of
player injuries in the top four professional leagues in Europe was $12.4 million per team
(Goal.com, 2016).
3
Technology has been used in advanced countries to mitigate the risk factors
leading to injuries. However, these practices use sophisticated technology which are
expensive and not readily available in Ghana to be used by the team and players that ply
their trade here.
1.4 Benefits
Being able to predict an injury means it can be prevented to an extent. When the
risk factors that are most likely to result in a player getting injured are identified, focus can
be placed on eliminating these unfavourable risk factors, which ultimately results in less
injuries occurring. It will be useful to coaches to identify when their players are more
susceptible to injury, and inform their decision to rest them or play them. This saves the
team money that would have been lost in treating the player. The player in turn earns
money that would have been lost if he were out due to injury and makes sure his family is
well taken care of. With respect to the sports entertainment industry, this will be extremely
useful in fantasy football team selections for sports fans.
1.5 Objectives
This project seeks to generate a model using multiple logistic regression analysis
that will assist in the prediction of injuries and the possible prevention of injuries in
football. Significant objectives that will be achieved in this process include:
• Identifying modifiable risk factors that affect a player’s likelihood of getting
injured.
• Identifying how these different factors contribute to a player getting injured.
• Design a model to predict the probability of an injury occurring.
• Outlining steps to be taken by stakeholders involved to reduce the occurrences of
injuries.
4
1.6 Outline of Project
This paper contains six chapters and will be outlined as follows:
Chapter 1 introduces readers to the project and the problem it is trying to solve,
highlighting the motivation behind undertaking this project in the first place. Chapter 2
reviews related works in sports injury prediction, highlighting scholarly work emphasizing
the importance of injury prediction, as well as identifies major risk factors to be
considered in injury analysis and prediction. Furthermore, it analyses existing models to
find out what new feature can be added or what can be done differently. Chapter 3
describes the functional and non-functional requirements using the the CRoss Industry
Standard Process for Data Mining (CRISP-DM) 1.0 data mining process model. Chapter 4
focuses on the architecture and factors that will be used in predicting a model. Chapter 5
deals with the implementation of the project, describing the tools and technology and
platforms that will be used and explore the reasons why they were chosen. Chapter 6 will
cover the testing of the model and results from testing will be discussed. Finally, Chapter
7 will make conclusions and recommendations. Limitations encountered and suggestions
for further work are made.
5
Chapter 2: Related Work
This chapter tackles related work in the field of sports injury prediction. It
discusses the relevance of predicting injuries, and then highlights instances where injury
prediction was effectively in sports.
2.1 Background
Predicting and possibly avoiding injuries is touted as the next big thing in sports
data. For this chapter, a background study into the importance of sports injury prediction
was covered in section 2.2. It further analysed the different risk factors that could be
assessed in injury prediction. Under section 2.4 a study into the different risk factors was
discussed, as well as existing solutions in injury prediction were analysed. Finally, section
2.5 summarises the major solutions currently in place and section 2.6 gives
recommendations on how the existing measures can be adapted and improved upon in this
project.
2.2 Relevance of Sports Injury Prediction
Gabett (n.d.) in his article, “Injury prevention and performance enhancement in
team sports: Train smarter and harder”, discusses how injuries can be prevented and
performances enhanced in team sports, basically through training smarter. The paper
weighs the argument of the correlation between training loads and injuries from three
angles, the first being that suggesting that the harder these athletes train the more injuries
they will sustain, and the second that the if training loads exceeded a planned ‘threshold’,
athletes were ‘managed away’ from potential injury and finally that insufficient training
may lead to increased injury risk.
In their paper, Colston and Wilkerson looked at physiological factor that could lead
to a player developing an injury. It used a 3-factor prediction model that looked at injury
6
risk factors that could be used to identify injuries. The research that accompanied the
paper was designed in the form of a cohort study. The purpose of this study was to
investigate the relationship between physical workload and injury risk in elite youth
football players. The researchers used the workload data and injury incidence of 32
players, monitored throughout two seasons. This approach to injury prediction relied on
multiple regression to compare cumulative loads between injured and non-injured players
for specific GPS and accelerometer-derived variables. It was discovered that higher
accumulated and acute workloads were associated with a greater injury risk. However,
progressive increases in chronic workload may develop the players' physical tolerance to
higher acute loads and resilience to getting injured.
2.3 Factors that Result in Injuries
Murphy, Connolly and Beynnon in October 2002 undertook a study which
investigated risk factors among athletes and military recruits aged between 14 and 39
years for lower extremity injuries. The results of this study were published a paper titled
Risk factors for lower extremity injury: a review of the literature, were divided into
extrinsic and intrinsic risk factors. The extrinsic risk factors considered were level of
competition, skill level, shoe type, ankle bracing and playing surface. Intrinsic factors
studied included age, sex, phase of the menstrual cycle (for women), previous injury and
inadequate rehabilitation, aerobic fitness, body size, muscle strength, imbalance and
reaction time. Their research concluded that there was an increased incidence of injuries
on artificial turfs than on grass or gravel. Additionally, there was an increased likelihood
of injuries occurring in less skilled players as compared to highly skilled players who can
easily maneuver away from an imminent tackle. With respect to intrinsic factors, the study
revealed that there was an increased incidence of injuries for players older than 25 years.
7
2.4 Related Work
2.4.1 Oslo Sports Trauma Research Center
Roald Bahr and Ingar Holme of the University of Sport and Physical Education at
the Oslo Sports Trauma Research Centre Education conducted research into the various
factors that could lead to injuries in sports. This was published in a paper titled Risk
factors for sports injuries — a methodological approach. Using a multivariate statistical
approach, they investigated potential risk factors for injuries. These risk factors were
classified using the Meeuwisse model which divides risk factors into intrinsic and
extrinsic, and measured the impact these factors had on injuries. These researches used the
linear logistic regression model.
Figure 2.1 The Meeuwisse model for classifying injury risk factors
8
2.4.2 NSW Waratahs
In Australia, the NSW Waratahs Rugby Team’s use of IBM’s Predictive Analytics
helped reduce player injury. This in turn optimized team performance. The analysis model
used predicted the likelihood of a player being injured, informing the coaching team to
monitor each player’s training program and minimize their chance of getting injured
(IBM, 2013).
2.4.3 SAP SE
German company SAP SE, uses sensors and cloud computing through its Injury
Risk Monitor to predict and prevent football injuries. It uses huge catalogues of data and
its HANA cloud platform to make injuries less likely – and even preventable. Players wear
sensors which gathers data while they play, and along with statistics collected from a
player's entire career and held on SAP's HANA Cloud Platform. The Injury Risk Monitor
then gives a percentage to indicate how likely each player is to injure themselves in their
next match. The system considers how fit each player is, based on their diet and exercise
regime, along with the date of their last injury, and how long they usually take to recover
from a variety of injuries (Ibtimes.co.uk, n.d.).
2.4.4 Team from the University of Birmingham and Southampton Football Club
A team from the University of Birmingham and Southampton Football Club used
GPS technology to analyze the performance of youth team players, to study the link
between training activity and rates of injury. These GPS trackers monitored their speed,
distance travelled and total forces experienced by their bodies on the pitch during games
and training. This data was cross referenced against any recorded injuries which caused
players to miss training activity – and classified as mild, moderate and severe
(Sciencedaily.com, 2016).
9
2.4.5 Kitman Labs
Kitman Labs, analyses data about players and can predict when a player might get
injured with an unprecedented degree of accuracy. Partnering with the Leinster and Irish
rugby teams, they gathered data from athletes via sensors like GPS vests and heart rate
monitors (www.thejournal.ie, 2014). It pulls together data recorded on the player’s sleep
pattern, work done on the pitch, heart rate variability and other metrics. Kitman is looking
to move away from rugby into the United States market where they will look at
opportunities in MLB, MLS and every other track and field sport.
2.4.6 Sports Injury Predictor
Sports injury predictor is an algorithm that determines the probability of a player
being injured. It uses an injury database, considering every injury that has taken place,
type of injury and kind of treatment required (Sportsinjurypredictor.com, 2017). It uses an
injury correlation matrix to determine the statistical probability of an injury occurring
based on previous injury. It also considers biometrics data like age, height and weight,
play by play data, position and how many times player is likely to touch the ball. This s
used by the National Football League (NFL) in the United States of America. This
algorithm is however pending a patent.
2.4.7 Researchers from Dow Jones and Wall Street Journal
Researchers from Dow Jones and Wall Street Journal applied advanced machine
learning to predict the probability of an injury for a player in the NBA. This was revealed
at the MIT SLOAN Sports Analytics Conference. The model they created was based on
play-by-play game data, player workload and measurements, and team schedules covering
a period of two years. Their approach enabled team management and decision-makers to
identify the best time for a team to rest their star players and reduce the risk of long-term
injuries, while optimizing team strategies (Talukder & Vincent, 2016).
10
2.5 Findings and Proposed Solution
From the study of related works, it is seen that most existing models used
sophisticated technology to monitor a player’s likelihood of injury. These were in the form
of wearables like heart rate monitors and GPS vests. Advanced machine learning and
multivariate logistic regression were used in some cases by researchers to predict and
prevent injuries, using the data collected from monitoring the player’s vitals. In some
instances, the data was collected in game, which would immediately prompt the medical
staff if any change was necessary based on risk factors encountered. It is evident that the
use of these technology in predicting injury significantly led to a reduction in the
incidence of injuries in various sports ranging from football, basketball and rugby.
2.6 Proposed Solution
The model will take into consideration a mix of intrinsic and extrinsic factors,
based on the Meeuwisse model. Considering the limited technology available in Ghana to
carry out this prediction, multiple logistic regression is a statistical tool that can be used to
determine the probability of an injury occurring.
11
Chapter 3: Requirements
In this chapter, section 3.1 covers functional requirements using the Cross industry
Standard Process for Data Mining. Section 3.2 covers non-functional requirements.
3.1 Functional Requirements
For this project, the CRoss Industry Standard Process for Data Mining (CRISP-
DM) 1.0 data mining process model will be used (Chapman et al, 2000). Major stages in
that model are outlined below in this requirements plan. These stages include business
understanding, data understanding, data preparation, modelling, evaluation, and
deployment.
3.1.1 Business Understanding
This section seeks to provide an overview of the project context. It covers the
problem that exists and how data mining can be used to provide a solution. Resources that
are used in the project are identified as well as constraints. Finally, the criteria by which
how success will be measured is outlined.
3.1.1.1 Background
In sports, an injury is basically an event that causes absence from one or more
games or practice sessions. For this analysis on the 2017 African Cup of Nations, an injury
is any event which led to a stoppage in play, and required the presence of the medical
personnel of the team on the pitch to treat the player. This may have resulted in a player
being absent from subsequent games.
Throughout the course of the 2017 African Cup of Nations, there were injuries in
almost every game, with most players having their tournament cut short as a result. The
main objective of this project is to assess the likelihood of a player getting injured at the
12
2017 African Cup of Nations tournament and create a model for predicting the likelihood
of an injury, using data collected on players who were at the tournament.
3.1.1.2 Resources and Constraints
Data will be collected from primary sources online. Data sourced from online is
open source and will be adequate referenced. A sports expert who was present in Gabon
during the African Cup of Nations tournament will be consulted. Data mining tools like R
and Excel will be used. A constraint on the data used will be the small sample size
involved, as the Cup of Nations tournament lasted for only three weeks and involved only
368 players, out of which less than 250 actively participated in the tournament.
3.1.1.3 Success Criteria
Success will be measured in the short term if its identified that there exists a
relationship between injury and risk factors identified. Medium term success will be
determined by improvement in the states of the significant risk factors that have been
identified to have strong relationship with injuries. In the long term, success will be
measured by a reduction in the injuries in footballers.
3.1.2 Data Inventory and Understanding
The next step in this project would be to collect the data that is available for this
project. A description of database acquired is given. Steps are then taken to verify data
quality.
3.1.2.1 Description of data
Proprietary data in this case is obtained from the 2017 African Cup of Nations and
interviews with experts who were in Gabon for the tournament. External data sources on
match intensity would be generated from FIFA.com.
13
There are different factors that can cause an injury and these are classified as
intrinsic or extrinsic. Intrinsic are those factors that are peculiar to the player while
extrinsic are environmental factors the player may encounter. These environmental factors
include weather, pitch surface and football boot type. The factors likely to cause injuries
may further be classified as modifiable and non-modifiable. Examples of non-modifiable
factors may be gender and age, while pitch surface quality are modifiable factors. These
factors may be further classified as continuous, for example age or categorical. Pitch
surface quality is regarded as categorical.
3.1.2.2 Understanding data
In identifying the risk factors to use in the analysis, different variables were
collected. The next step gives an indication of which variables were selected to be used as
risk factors. Minutes played was identified as a factor/variable that could be used but this
is an after the fact variable, as every player would want to play 90 minutes, given the
absence of injury. Fouls suffered is also an indication of how a player is susceptible to
injury but the player does not determine how often he is fouled in a match situation.
However, the number of fouls a player suffers can be high or low based on the function he
plays on the field.
After the data understanding process, player’s age, pitch surface quality, player
function and match intensity were selected as risk factors to be analysed in the data mining
process.
3.1.2.3 Data quality
Data collected online is verified across different sources. The data is passed
through the sports expert to be the final check on the quality of data obtained from online.
14
3.1.3 Data Preparation
The next step in this would be analysing the data with the objective of determining
whether the available data can be used to come up with a model for predicting injury. If
the data is found not to be suitable, then the data will be modified or manipulated to suit
the objectives. If this is still not possible, objectives may be scaled back to reflect what is
possible with the data given.
3.1.3.1 Data Selection
Data collected online included the tackles a player made, duels a player was
involved in, minutes played, total passes completed. However, these risk factors were not
used as there was no way these could be predicted before a game.
3.1.3.2 Data Cleaning and Construction
The huge amount of data received will be prepared and put in a format which is
much easier to work with so that it can be successfully modelled. This would involve a
straightforward examination of the main components of the data expected to be relevant to
modelling. Steps to be taken include scanning the data for extreme and/or invalid values,
locating oddities in the data as well as ensuring the data provides appropriate values. Data
collected from different data sources are merged into one dataset to be used in this
analysis. Data is then stored in the .csv format that can be used in the R software. To run
the logistic regression, the dependent variable in the model, injury, is converted into
binary format, with 0 for no injury and 1 for injury, irrespective of the number of injuries
sustained in the game.
3.1.4 Data modelling
To test this project to see how viable it is, a test run will be done on a handful of
participants. At the very least, analytics would be conducted on a radically slimmed down
15
version of the data set to accelerate modelling run times. Afterwards, this would be
expanded to the entire dataset. This stage will involve automatically running and
summarizing a series of experiments. Using a logistic regression model, data will be
analysed to find a relation between the key factors identified and the likelihood of injury.
In creating the model, a statistical approach is taken using the multiple logistic
regression.
3.1.5 Testing, Evaluation, Interpretation, Understanding
This stage will take advantage of the model created to decide on whether injuries
can be prevented or not. The model will be interpreted in the sense of what it can do to
help prevent injuries and preserve the playing lives of players. As compared to the
previous stage that will make use of logistic regression, this stage will just involve
analysing the results of the previous stage.
3.1.6 Deployment
This is done once the project is tested and provides the model required. Data is
deployed in the test environment. A varied selection of possible variables that would have
existed should another match have been played at the tournament will be run through the
model.
3.2 Non-Functional Requirements
3.2.1 Product Requirements
3.2.1.1 Performance
The model should be able to perform the basic function of predicting the
possibility of an injury when it is fed with data. It should possess the ability to handle
large datasets.
16
3.2.2 Security Requirements
The system should keep player data secure and confidential.
17
Chapter 4: High Level Architecture
This chapter discusses the algorithm that will be used in analysing the data.
4.1 Data Mining Algorithm
The data mining algorithm used in this prediction model generation is multiple
logistic regression. This is used because the model will try to determine binary outcomes.
The logistic regression approach, was used because it gives the user a reliable way
of assessing the model and assessing predictors. It assists the user to predict an outcome
variable that is categorical from predictor variables that are continuous and/or categorical
(Ziegler-Hill, n.d.). The presence of age (continuous variable) and pitch quality
(categorical variable) made it essential to use logistic regression.
Additionally, logistic regression was used because it could answer the following
questions which were highlighted in the project objectives.
• Can modifiable risk factors that affect a player’s likelihood of getting injured be
identified?
• How do these different factors contribute to a player getting injured?
• Do some of the risk factors interact with each other?
18
Chapter 5: Implementation
This implementation section will cover the datasets needed to complete this project
and sources where it will be obtained from, as well as tools and languages used. For this
analysis, the external factors considered were pitch quality, match intensity and player
position. Age was considered as an intrinsic factor. It also indicates how the data was
entered into the Excel workbook.
5.1 Data sources
Data was collected from BBC.com, which gave a play by play commentary on
Match proceedings. This data was corroborated with data from Guardian.com, which gave
a play by play commentary as well. A third data source used was Sofascore.com which
provided match and player statistics, as seen in appendix A. FIFA.com also provided data
on match intensity.
5.1.1 AFCON Data
The first step of the implementation involved collecting data on player injuries at
the 2017 African Cup of Nations. This competition provided the best avenue for gathering
information as there was a widespread criticism on the high number of injuries that
occurred, with multiple attributed to the bad quality pitches. Characteristics of this
tournament are given in the following sentences. 16 nations qualified for this tournament
hosted in Gabon, with each country presenting a 23-man squad of players. There were four
stadia used with four teams stationed permanently at each of these venues throughout the
group stages. Each team played 3 matches during the group stages in a round robin format,
with the top two teams from each group qualified for the quarter finals. From this stage
onwards, all games were played in a knockout format till the finals. In all, there were 32
games played at the tournament. A total of 52 goals were scored across all 32 games
19
indicating an average of 1.6 goals per game. The low number of goals scored on the
average was an indication of the defensive minded approach most teams took towards the
games. 84 yellow cards were given out in total, an indication of how aggressive and
physical the matches were.
The nature of data gathered from the AFCON is found below:
• Player injury data
• Player’s function on the pitch (forward, midfielder, defender, goalkeeper)
• Player’s age (continuous data- integer value)
• Pitch quality data
5.1.1.1 Player injury data
All instances of a player being substituted because of an injury were noted by 1.
Searching for this data was made easy using Sofascore.com, which indicated the players
who were substituted because of an injury. This resulted in 24 unique observations of
players who had to be substituted. This data was corroborated with the play by play
accounts on both BBC.com and Guardian.com. Using the web browser’s inbuilt search
function, the keyword injury from bbc.com was searched for, and all observations noted.
A different record of injuries was also obtained which noted the different times a
medic team had to come onto the football pitch to treat an injury. For anytime play had to
be stopped for the medics to treat a player or for the player to be stretchered off the field, it
with 1, and all other instances with a 0. For this record that was taken, there was a total of
104 injuries. For the analysis performed on this dataset, the second record of injuries
requiring medics was used. This was because it was observed that because some teams had
exhausted their substitution options, some players had to play through their injury. This
would then make using instances where players were substituted inaccurate.
20
From Figure 5.1 below it is observed that most injuries happened in the second
half, with a total of 92, as compared to 12 in the first half.
Figure 5.1 Injury distribution based on periods in the match
5.1.1.2 Player’s function on the pitch
For any football match, a player can assume the position of goalkeeper, defender,
midfielder or forward. In all, there were 48 goalkeepers at the tournament, 119 defenders,
114 midfielders, and 87 forwards. This data was obtained from the official Confederation
of African Football website, from the official list of players submitted. In entering the data
into Excel, goalkeepers were denoted with 1, defenders with 2, midfielders with 3 and
forwards with 4.
12
92
0
10
20
30
40
50
60
70
80
90
100
1st half 2nd half
Num
ber o
f inj
urie
s
Half
Number of injuries per half
21
5.1.1.3 Player’s age
The player’s age represented continuous data. This data was gathered from
transfermarkt.com which is an online database of professional football leagues, clubs and
players. Each player’s name was entered into the search panel on the transfermarkt.com
website which gave the players date of birth and age. This assisted in deriving the player’s
age as at the time the tournament started. The oldest player was 44 years old while the
youngest player was 18 years old. The mean age was 26 years. 108 out of the 368 players
present at the tournament were aged 24, making it the modal age.
5.1.1.4 Pitch quality
Pitch surface quality was selected as an external factor to be analysed after coaches
and doctors attributed the high injury rate to the pitches. Frank Boahene, a pitch expert and
managing director of Green Grass Technology (GGT), a pitch development solutions firm,
revealed that the pitch could be blamed for the knee joint, ankle and hamstring injuries
that players suffered at the tournament (Boahene, 2017).
There were four different stadia used at the 2017 AFCON, with their pitches
possessing varying degrees of quality. The stadium in Franceville was viewed as having
the best pitch and was given a rank of 1. Group B played its group matches here. This was
followed by the stadium in Oyem, ranked 2nd. This was a new one built specifically for the
tournament. Group C matches were played here. Notwithstanding its good quality,
drainage on the pitch was poor and it was known to get soggy whenever there was very
heavy rain. The stadium in Libreville hosted Group A. This pitch was regarded as sandy
and was given a rank of 3. The last match venue was the stadium in Port Gentil, which
was one of the two new stadia built for the tournament. Unlike Oyem, this was regarded as
the worst pitch in the tournament. It was very sandy, which made balls bounce awkwardly.
This stadium hosted group D and was ranked 4th. This ranking was done based on insight
22
gained from an interview with Citi FM sports editor Nathan Quao (2017). Figure 5.2
below shows the number of matches played on different pitches at the tournament. Most
matches were played on pitch quality 2 which was the best at the tournament.
Figure 5.2 Number of matches played on the different pitches
5.1.2 Match intensity data
Data for determining match intensity was sourced from the FIFA Coca Cola world
ranking found on the FIFA website. This is because any team that performs in football
gains points which enable it to rise in the rankings. Countries that are ranked close to each
other have a similar performance record, while countries that are far apart on the rankings
have varying performance levels. Matches involving countries close on the rankings
indicate that they have the same level of skill and so it would be an intense match
involving two teams that are well equipped to win. This would make a match between
these two teams an intense one, as compared to one with teams far away from each other
8
9
7
8
0
1
2
3
4
5
6
7
8
9
10
Pitch Quality 1 Pitch Quality 2 Pitch Quality 3 Pitch Quality 4
Num
ber o
f mat
ches
play
ed
Pitch
Number of matches played on different pitches
23
on the ranking. Comparison of the different countries is seen in Appendix B and Appendix
C.
The ranking is determined by the Fédération Internationale de Football Association
(FIFA). The world football governing body releases a ranking monthly on all male senior
national teams. To ensure that the team’s standing reflects current performance, a team’s
total number of points over a four-year period is determined by adding the average number
of points gained from matches during the past 12 months and the average number of points
gained from matches older than 12 months which depreciates yearly.
5.1.2.1 FIFA’s calculation of rankings
The rankings are calculated as follows based on the FIFA formula (2017):
𝐏oints = 𝐌×𝐈×𝐓×𝐂
where:
‘M’ is the points for match result.
‘I’ is the importance of the match.
‘T’ is the strength of opposing team
and ‘C’ is the Strength of confederation
Figure 5.3 below shows the different levels of match intensity that was witnessed
at the tournament. There were 14 matches that were extremely intense, with 5 moderately
intense.
24
Figure 5.3 Match intensity levels for all matches
5.1.2.2 Calculating match intensity
The rankings for January 2017 was scraped from the FIFA website using,
DataMiner, a web based add-on. This was then placed in an Excel workbook after which it
was manipulated into a format that would be easy to work with.
First, the list of over 210 member countries of FIFA was pruned down to the 16
countries that qualified for the AFCON tournament, with their respective ranking on the
log. Next, the difference in ranking for all the participating nations were determined. For
instance, if Senegal was ranked 1st and Ghana was ranked 9th, then the difference in their
ranking would be 9 places. This was done for all countries, as is seen in Appendix A. It
was observed that the highest difference in ranking was 70 and the lowest difference in
ranking was 1. This difference in ranking was used to determine match intensity as
follows:
14
9
45
0
2
4
6
8
10
12
14
16
Extremely intense Intense Moderately intense Not too intense
Num
ber o
f mat
ches
Level of match intensity
Match intensity levels
25
• All 32 matches that took place were noted down with their corresponding
differences in ranking.
• With the varying differences in ranking, all matches with differences greater less
than 17.25 were given a rank of 1. Denoting that the match was very intense.
Matches with a difference score ranking between 17.25 and 34.50 were given a
rank of 2. For matches with difference score between 34.50 and 51.75, a score of 3
was given. 4 was reserved for matches with difference score of 51.75 and above.
Appendix B.
5.2 Tools
For data collection DataMiner, a web based add on for the Google Chrome
browser was used. Excel was used in putting the data together into a format that would be
easy to manipulate in R. R statistical software was used for data analysis.
5.3 Language
The language used in the generation of this model is R. A multiple logistic
regression was run on the dataset which included 5 variables: Injury, Pitch quality, Match
intensity, Player function and the Player’s age.
5.4 Implementation
Both linear and logistic regression was run on the dataset to observe the factors
that were very significant in the model. The linear regression model was run on two
datasets. The first dataset had players whose injuries were so serious to the extent that they
had to be substituted. The second dataset involved injuries that had medics coming onto
the pitch, including those that led to substitutions. Afterwards, the logistic regression was
run on the second dataset which was a more accurate representation of the injury statistics.
Different stages with the implementation and results at each phase is explained below. The
26
implementation will seek to confirm whether risk factors that affect a player’s likelihood
of getting injured can be identified, and the level at which rate these factors contribute to a
player getting injured. Appendix D shows a cross section of the data that was collected
and used in the analysis.
5.4.1 Dataset 1
This recorded injury of players who had to be substituted. In all there were 24
unique observations of this. Other columns which were included were player’s age, Player
function, Match intensity and pitch quality. A regression was run on this data, and the
result seen as follows:
The only hint of a slightly significant value was with match intensity with a
significance level of 0.1. All other variables were insignificant. This prompted the use of
Phase B data. This was because it was identified that some players had to play through
their injury due to selection constraints on the match. Thus, analysing this data without
them would have rendered the results inaccurate.
27
Figure 5.4 Linear regression results on Dataset 1 (very serious injuries)
5.4.2 Dataset 2
This accounted for all cases a medic team had to come onto the football pitch to
treat an injury. There were 104 times this happened. From the regression output, it is
observed that Pitch quality and Player position are significant in determining injuries.
From a table plot generated of player position and injuries sustained, it is observed that
goalkeepers stand a higher chance of getting injured based on historical data from the
2017 AFCON.
28
Figure 5.5 Linear regression results on Dataset 2 (all injuries)
Given the significant factors identified, this prompted a bar plot to be drawn to
show the relationship between player function and injury, and find which function was
more likely to get the player injured.
29
Figure 5.6 Plot of player function and injuries sustained
Additionally, a bar plot to be drawn to show the relationship between pitch quality
and injury, and give a graphical view of which pitch quality was more likely to get the
player injured.
30
Figure 5.7 Plot of pitch quality and injuries sustained
5.4.3 Logistic regression
The Data being analysed consist of injury data for 876 instances a player stepped
onto the pitch, and a binary variable coded as a 1 if a player gets an injury and a 0 if not.
For each player, it is also indicated if he is a goalkeeper (1), defender (2), midfielder (3) or
forward (4). The pitch surface quality they played on where very good is coded as 1, good
is coded as 2, okay is coded as 3 and poor is coded as 4. A very intense match is coded 1
31
and a match with very low intensity is coded 4. Shown below is the data for the first 6
players.
Figure 5.8 Head view of dataset
A plot of the data was then generated to show the relationship between the sets of
data.
32
Figure 5.9 Plot of all variables considered in the analysis
A bar plot was then generated to display categorical data in a bid to assess if there
are any interesting trends to consider.
33
Figure 5.10 Plot of player function on different pitch qualities
The findings from the linear regression is enforced, that goalkeepers stand a higher
chance of getting injured. Also, pitches with quality 3 and 4 which were deemed to be
poor had more injuries occurring on it. For goalkeepers, they would prefer to play on pitch
quality 2, as the number of injuries that occurred on this pitch were relatively less than the
rest. Additionally, a closer look at the injury plot for Midfielders show they sustained
much injuries on pitch quality 3 and 4. Also, defenders suffered more injuries on pitch
quality 3. However, for forwards, the numbers are relatively equal.
1 2 3 4
InjuryNo injury
Goalkeeper
Pitch quality
Num
ber o
f pla
yers
05
1015
1 2 3 4
InjuryNo injury
Defender
Pitch quality
Num
ber o
f pla
yers
020
4060
80
1 2 3 4
InjuryNo injury
Midfielder
Pitch quality
Num
ber o
f pla
yers
020
4060
80
1 2 3 4
InjuryNo injury
Forward
Pitch quality
Num
ber o
f pla
yers
010
2030
4050
60
34
Correlation.
A correlation matrix was run to display the correlations between each of the
variables with each other. It is observed that there is positive correlation between Match
intensity and Player Position and a negative correlation between Match intensity and Pitch
quality.
Figure 5.11 Correlation output
A logistic regression model was then run on the dataset.
35
Figure 5.12 Logistic regression output
The above gives us regression coefficients estimates with standard errors. The
player position variable is significantly different from zero with a 0.01 significance level.
None of the other coefficients are significantly different from zero with only Pitch Quality
having significance of 0.1. This reveals that out of age, player function, pitch quality and
match intensity, player function and pitch quality are the most likely to make a player at
the African Cup of Nations get an injury, with the function a player performs on the pitch
36
being the likeliest factor to cause an injury. This logistic regression coefficients give the
change in the log odds of the outcome for a one unit increase in the predictor variable
(Ziegler-Hill, n.d.). For example, from figure 5.4.4, for every one unit change in age, the
log odds of sustaining an injury decreases by 0.00326. The logistic regression also
confirms the results from the bar plots and linear regression, that pitch quality and player
function play a part in the likelihood of a player getting injured.
Implications of Results
Having identified pitch quality and player function as serious risk factors, these can
be worked on and modified to reduce the possibilities of an injury occurring. Out of these
two, player function however has a higher impact on a player getting injured. Pitch quality
is the most modifiable of all the factors, and regarding this, stadium administrators can
place more emphasis on upgrading pitches to grade A pitches. Players should also take
precautionary measure when playing in conditions that pre-exposes them to injury risk
factors.
37
Chapter 6: Testing
For this testing, the model generated will be fitted with dummy variables to
determine if it is accurate. This will be compared to results from the predict function in R.
For instance, from the observed historical data, it is seen that goalkeepers stand a greater
risk of getting injured as compared to player in other positions. Thus, if a goalkeeper’s
details are to be keyed into the model, the likelihood of an injury occurring should be high.
6.1 Testing results
The inbuilt R prediction model was used to test the testing variables generated.
Unsurprisingly, the testing data that generated the highest percentage of an injury
belonged to a goalkeeper playing on poor playing surface. Using 876 testing variables
generated, the likelihood of an injury occurring ranged between 0.04962 and 0.26840. For
this analysis, a cut-off point was decided on for determining the occurrence of injuries. If
the likelihood of getting an injury was above 18%, then there was sure to be an instance of
a player getting injured, and any percentage that fell below 12% indicated that the
likelihood of an injury was low.
When this result from the test was compared with the result from the logistic
regression and bar plots, goalkeepers had injury percentages above the cut-off point. This
was seen in high injury percentages for pitches with poor surface quality.
Figure 6.1 shows the results of the test carried out in R, with 382 unique
probabilities generated.
38
Figure 6.1 Result of testing in R
39
Chapter 7: Conclusions and Recommendations
It is proven that pitch quality and player position has a role to play in a player
sustaining an injury. This model generated meets the expected requirement of being able
to predict an injury. However, the dataset and variables used till date were not too
predictive because of its limited size. For future development, the formula will be fine-
tuned to provide a more accurate model. A much larger dataset would also be needed to
get a more perfect model. Getting a user interface for coaches and club administrators to
work with during game play will also be considered. This will influence their substitution
decisions.
Figure 7.1 Proposed interface for club administrators to work with
The study has shown that pitch surface quality, player function, player age and
match intensity are important variables that can be used to predict injury, especially with
pitch surface quality and player function being particularly significant. There is however
40
room for more variables, such as individual player characteristics, to be included, which
will be tackled in future implementations of the project.
41
References
Bahr, R., & Holme, I. (2003). Risk factors for sports injuries — a methodological
approach. British Journal of Sports Medicine, 37, 384-392.
Boahene, F. (2017). Pitch expert blasts Gabon surfaces. Retrieved from myjoyonline.com:
http://www.myjoyonline.com/sports/2017/January-20th/pitch-expert-blasts-gabon-
surfaces.php
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R.
(2000). crisp-dm.pdf. Retrieved from https://www.the-modeling-agency.com/crisp-
dm.pdf
FIFA. (2017). The FIFA/Coca Cola World Ranking - Men's Ranking Procedure. Retrieved
from fifa.com: http://www.fifa.com/fifa-world-ranking/procedure/men.html
Forbes.com. (2015). The crippling cost of Sports Injuries. Retrieved from Forbes.com:
http://www.forbes.com/sites/sap/2015/08/11/the-crippling-cost-of-sports-
injuries/#3ea9f9241d0d
Gabett, T. (n.d.). Injury prevention and performance enhancement in team sports: Train
smarter and harder. Aspetar sports medicine journal, 218-223.
Goal.com. (2012). Charles Taylor set to quit football as injuries hamper his career.
Retrieved from Goal.com:http://www.goal.com/en-gh/news/4389/ghanaian-
football/2012/08/27/3333650/charles-taylor-set-to-quit-football-as-injuries-
hamper-his
Goal.com. (2016). Slaven Bilic issues Andre Ayew injury update. Retrieved from
Goal.com: http://www.goal.com/en-gh/news/4392/ghanaians-
abroad/2016/08/26/26906442/slaven- bilic- issues-andre-ayew-injury-update
42
IBM. (2013). IBM predictive analytics reduces player injury and optimises team
performance for NSW Waratahs rugby team. Retrieved from www-03.ibm.com:
http://www- 03.ibm.com/press/au/en/pressrelease/42613.wss
Ibtimes.co.uk. (n.d.). injury risk monitor sap uses sensors cloud computing predict prevent
football injuries. Retrieved from ibtimes.co.uk: http://www.ibtimes.co.uk/injury-
risk-monitor-sap-uses-sensors-cloud-computing-predict-prevent-football-injuries-
1514453
Pulse.com.gh. (2016). Black stars list of Ghanaian players destroyed by injury. Retrieved
from Pulse.com.gh:http://pulse.com.gh/sports/features/black-stars-list-of-
ghanaian-players- destroyed-by-injury-id4211632.html
Sciencedaily.com. (2016). Study uses GPS technology to predict football injuries.
Retrieved from sciencedaily.com:
https://www.sciencedaily.com/releases/2016/08/160802222248.htm
Talukder, H., & Vincent, T. (2016). Preventing in game injuries for NBA players. MIT
Sloan Sports Analytics Conference. Boston.
The Journal (2014). Predict injuries make a business big data. Retrieved from
www.thejournal.ie: http://www.thejournal.ie/predict-injuries-make-a-business-
big-data-1541702-Jun2014/
Zeigler-hill. (n.d.). psy_512_logistic_regression.pdf. Retrieved from http://www.zeigler-
hill.com/uploads/7/7/3/2/7732402/psy_512_logistic_regression.pdf
43
Appendix
Appendix A
44
Appendix B
45
Appendix C
Appendix D: Excel view of data
46
Appendix E
#setting working directory
setwd("/Users/ayeleycommodore-mensah/Desktop/Spring
Senior/COMMODORE-MENSAH, Ayeley - 61402017/data")
getwd()
list.files()
#read in dataset to be used.
data=read.csv("data_log_reg.csv", header=TRUE)
head(data)
plot(data)
#draw boxplots to show relationship between pitch quality and
player function and injury
par(mfrow=c(1,4))
barplot(table(subset(data,data$Player.position==1)$Injury,
subset(data,data$Player.position==1)$PQ),
col=c("red","lightgreen"),
legend.text=c("No injury","Injury"),main="Goalkeeper",ylab="Number
of players",
xlab="Pitch quality")
barplot(table(subset(data,data$Player.position==2)$Injury,
47
subset(data,data$Player.position==2)$PQ),
col=c("red","lightgreen"),
legend.text=c("No injury","Injury"),main="Defender",ylab="Number
of players",
xlab="Pitch quality")
barplot(table(subset(data,data$Player.position==3)$Injury,
subset(data,data$Player.position==3)$PQ),
col=c("red","lightgreen"),
legend.text=c("No injury","Injury"),main="Midfielder",ylab="Number
of players",
xlab="Pitch quality")
barplot(table(subset(data,data$Player.position==4)$Injury,
subset(data,data$Player.position==4)$PQ),
col=c("red","lightgreen"),
legend.text=c("No injury","Injury"),main="Forward",ylab="Number of
players",
xlab="Pitch quality")
cor(data[,c("PQ","Match.intensity","Age","Player.position")])
#logistic regression
fit <- glm(formula=Injury ~ PQ + Match.intensity + Age +
Player.position,
data=data, family=binomial(logit))
48
fit
summary(fit)
#prediction dataset
newdata=read.csv("data_predict.csv", header=TRUE)
newdata2=predict(fit, newdata, type="response")
write.csv(newdata2,file = "prediction_output.csv",row.names=FALSE,
na="")
tail(newdata2)
summary(newdata2)
#draw boxplots to show relationship between player function and
injury
par(mfrow=c(1,4))
barplot(table(subset(data,data$Player.position==1)$Injury),
col="pink", main="Goalkeeper",ylab="Number of players",
xlab="Injury")
barplot(table(subset(data,data$Player.position==2)$Injury),
col="lightblue",main="Defender",ylab="Number of players",
xlab="Injury")
barplot(table(subset(data,data$Player.position==3)$Injury),
col="lightyellow",main="Midfielder",ylab="Number of players",
xlab="Injury")
49
barplot(table(subset(data,data$Player.position==4)$Injury),
col="lightgreen",main="Forward",ylab="Number of players",
xlab="Injury")
#draw boxplots to show relationship between pitch quality and
injury
par(mfrow=c(1,4))
barplot(table(subset(data,data$PQ==1)$Injury), col="pink",
main="Pitch quality 1",ylab="Number of players",
xlab="Injury")
barplot(table(subset(data,data$PQ==2)$Injury),
col="lightblue",main="Pitch quality 2",ylab="Number of players",
xlab="Injury")
barplot(table(subset(data,data$PQ==3)$Injury),
col="lightyellow",main="Pitch quality 3",ylab="Number of players",
xlab="Injury")
barplot(table(subset(data,data$PQ==4)$Injury),
col="lightgreen",main="Pitch quality 4",ylab="Number of players",
xlab="Injury")
#linear regression
data2=read.csv("data_lin_reg.csv", header=TRUE)
50
fit <- lm(Injury ~ PQ + Age + Match.intensity +Player.position,
data=data2)
fit
summary(fit)
Recommended