Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Statistical Analysis of Google Flu Trends
A PROJECT
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Melissa Sandahl
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
MASTER OF SCIENCE
Yang Li and Kang James
July, 2016
c© Melissa Sandahl 2016
ALL RIGHTS RESERVED
Acknowledgements
Firstly, I would like to thank my advisors Yang Li and Kang James for all of their
time, guidance and expert support while completing this project. I would also like to
thank Xuan Li for serving on my committee and the positive support. Lastly, I am very
thankful for everything that the University’s Mathematics & Statistics department has
provided me with and helped me accomplish over the last two years.
i
Abstract
Predicting the behavior of influenza is crucial to helping health officials prepare for and
decrease possible outbreaks of the infectious disease. This project discusses methods
for testing Google flu count data taken from 2008 - 2014 for spatial autocorrelation,
seasonality and temporal effects. We will generate an appropriate seasonal ARIMA
model to fit the data for the overall nation as well as use the statistical program R to
develop multiple state models. Lastly, the Ljung-Box test will be applied to test for
goodness of fit and model adequacy. The goal of this project is to be able to forecast
future influenza outbreaks from Google flu trends across the United States in hopes of
increasing preparation standards.
ii
Contents
Acknowledgements i
Abstract ii
List of Tables v
List of Figures vi
1 Introduction 1
2 Spatial Data 3
2.1 Spatial Weights Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Spatial Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Global Test for Spatial Autocorrelation . . . . . . . . . . . . . . 9
2.2.2 Local Test for Spatial Autocorrelation . . . . . . . . . . . . . . . 13
3 Time Series Analysis 19
4 Seasonal ARIMA Model 22
4.1 The General Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 Simple ARIMA Example . . . . . . . . . . . . . . . . . . . . . . 23
4.2 ACF and PACF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 R ARIMA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 State Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5 US Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 State Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iii
5 Model Forecasting 37
6 Conclusion and Discussion 39
References 41
Appendix A. Spatial Matrix 43
Appendix B. Moran I Proof 45
iv
List of Tables
2.1 Global Moran Is: January 2008 - June 2011 . . . . . . . . . . . . . . . . 11
2.2 Global Moran Is: July 2011 - December 2014 . . . . . . . . . . . . . . . 12
4.1 ARIMA model generated from ACF and PACF plots after differencing . 28
4.2 Results from ARIMA model generated from R . . . . . . . . . . . . . . . 29
4.3 Seasonal ARIMA models selected by R: Alabama - Montana . . . . . . . 30
4.4 Seasonal ARIMA models selected by R: Nebraska - Wyoming . . . . . . 31
4.5 Seasonal ARIMA model statistics: Alabama - Montana . . . . . . . . . . 35
4.6 Seasonal ARIMA model statistics: Nebraska - Wyoming . . . . . . . . . 36
v
List of Figures
2.1 Average Monthly Flu Counts from 2008 - 2014 . . . . . . . . . . . . . . 7
2.2 Average Monthly Flu Counts Per Person from 2008 - 2014 . . . . . . . 8
2.3 Time Series of observed Global Morans . . . . . . . . . . . . . . . . . . 13
2.4 Local Moran I values for Alabama . . . . . . . . . . . . . . . . . . . . . 16
2.5 Local Moran I values for Montana . . . . . . . . . . . . . . . . . . . . . 16
2.6 Local Moran I values for North Dakota . . . . . . . . . . . . . . . . . . . 17
2.7 Local Moran I values for South Dakota . . . . . . . . . . . . . . . . . . . 17
2.8 Time series of Local Moran I values for Alabama, Montana and North
Dakota . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Times Series of original data collected . . . . . . . . . . . . . . . . . . . 21
3.2 Times series of data after it was logged . . . . . . . . . . . . . . . . . . 21
4.1 ACF and PACF for US monthly data . . . . . . . . . . . . . . . . . . . 25
4.2 Times Series of the data after 12 month differencing . . . . . . . . . . . 26
4.3 ACF and PACF for US monthly data after differenced for 12 months . . 26
4.4 Time Series, ACF, and PACF of the model’s residuals . . . . . . . . . . 32
5.1 Forecast for ARIMA(1, 0, 1)× (2, 0, 0)12 model . . . . . . . . . . . . . . 37
A.1 Map of the continental states with corresponding numbers [13] . . . . . . 43
A.2 List of Neighbors for the 48 States [13] . . . . . . . . . . . . . . . . . . . 44
vi
Chapter 1
Introduction
Accurate models to help predict influenza outbreaks across the United States can
help public health officials track and prepare for future events and hopefully decrease
the amount of deaths. Nowadays, large data sets for diseases and epidemics can be col-
lected quickly and easily through internet-based programs. However, generating useful
models to help predict epidemics is largely dependent on the availability and accuracy
of this data [14].
The data for this project came from Google Flu Trends [5]. Google keeps track
of the number of times that the word ’flu’ or flu-like symptoms are searched on their
website and maintain a database with this information on a weekly basis. The data is
collected both nationwide and by each individual state.
Studies concerning Google-search-based tracking models have been done, like the
AutoRegression with Google data ,ARGO, model developed by Yang, Santillana and
Kou [14] and the Google Flu Trend with a state-space SEIR model by Dukik, Lopes
and Polson [3]. Both models were created with hopes of tracking disease behavior at
different temporal and spatial inputs.
We will start the project by using the data collected for each state and investigate
if there is any spatial autocorrelation present. When testing for spatial autocorrelation,
we must build a spatial weights matrix which depicts the locational similarities between
1
2
the areas in the study. We will test for both global and local spatial autocorrelation in
the data in order to get more accurate results.
Next, we will analyze the data as a seasonal time series. Using the ACF and PACF
plots, we will determine if any lags in the data are significant in developing an ap-
propriate seasonal ARIMA model that fits the data. For this project we will test the
model’s accuracy using the Ljung-Box test which will test if the data is independently
distributed.
Our goal for this project is to be able to create a model that will be able to forecast
future behavior of Google flu trends on the nationwide and/or state level. In doing
so, we would be able to predict if one state will have a rise in flu cases based on the
frequency of google flu counts in neighboring states and previous time periods.
Chapter 2
Spatial Data
Spatial data is the term used to describe data that has a spatial or geographical
component associated with it. This data can be characteristic or more commonly, nu-
merical observations.
There are two common spatial structures that are used when modeling regional spa-
tial data. One method is by determining the proximity of areas based on distances
between the centroids of each areal unit. It is assumed that the observation is observed
in the centroid of each region and then a spatial covariance structure is developed based
on the distances of the centroids. This strategy does not take into account the fact that
the centroid does not accurately describe the behavior across the entire region. [13]
The second method, and the one used in this project, uses a neighborhood structure
to create the spatial covariance matrix. This would mean that the proximity of the
areal units is determined by the borders shared in the regional lattice structure. Thus,
regions are considered neighbors when they share a common border. Unfortunately, this
method uses irregular lattices and has not been studied as in-depth as the first method.
In this project, we used area data from the 48 continental U.S. states of google flu
counts collected from Google [5]. Area data is defined as the type of spatial data where
the observations are associated with a fixed set of areal units. Like stated before, we
used a neighborhood structure that contained areas/zones with irregular boundaries [4].
3
4
2.1 Spatial Weights Matrix
We utilized what’s known as a spatial weights matrix to determine spatial relativity
within our data. Since we wanted to investigate if a rise in Google Flu Counts in one
state would affect the states surrounding it, we decided to use the neighbors of each
state to generate our spatial weights matrix. Thus, when two states, areas i and j,
share a common border the entries Wi,j and Wj,i will be given a value of 1. All diagonal
entries Wi,i will be assigned a value of 0 [4].
The Spatial Weights Matrix W̃ of the 48 continental U.S. states is:0 W1,2 . . . W1,48
W2,1 0 . . . W2,48
......
...
W48,1 W48,2 . . . 0
Wi,j =
{1 if area j is neighbors with area i
0 otherwise.
See appendix A for a map of the 48 areal units labeled in their corresponding al-
phabetical order and a list of all the neighbors for each state. Before we moved on with
the data, we decided to row standardize W̃ so that we could interpret each entry as the
portion of spatial influence that area j has on area i.
5
2.2 Spatial Dependence
In order to create a useful model with this data, we must first check if there is spatial
dependence among the observations. We will investigate if there is spatial autocorrela-
tion in the number of Google Flu Counts based on the proximity of the 48 continental
states. We will run two tests to check for spatial dependence, one on a global scale and
the second will look at local spatial dependence for each location.
Before we ran any tests, we wanted to look at what was happening with the flu
counts per person in each state at multiple time periods. Figure 2.2 below reveals what
happened every other month across the United States for the year 2014. We observed
that there seems to be a larger percentage of people searching flu trends in the upper
mid-west all year round than in other areas across the country.
When looking for spatial dependence, we are comparing the similarity between the
observations for each of the 48 locations. We will use all of the following variables to
help us test for spatial autocorrelation:
n number of areas in sample,
i, j two different areas in sample,
zi observation collected from area i,
z̄ average of all n observations,
Wij similarity in location of areas i and j,
Mij similarity in observations of areas i and j.
To see visually what the counts we collected looked like, we took the average monthly
data from 2008 - 2014 and generated density plots of the geographical region. Figure
2.1 displays the density counts for every other month.
We noticed that the states with the largest density of counts were the ones with the
6
largest populations, e.g. Texas, Arizona, California. For that reason, Figure 2.1 is not
useful for making any conjectures since it would not be spatial effects that cause those
states to have higher Google flu counts. To account for this, we divided our data of
monthly counts by each state’s population so we could use per capita data. We then
generated the same density plots but with the monthly average per capita data, see
Figure 2.2.
When we graphed the newly transformed data we saw that there is no longer clus-
tering happening primarily in the southern region of the US. We now see clustering
happening mainly in the upper mid-west region of the states instead. One possible
explanation for this is that that region generally has harsher winters and colder tem-
peratures year round which would increase the chances of contracting influenza.
7
25
30
35
40
45
50
−120 −100 −80x
y
2000
4000
6000
Count
Average January Counts
25
30
35
40
45
50
−120 −100 −80x
y
500
750
1000
1250
Count
Average July Counts
25
30
35
40
45
50
−120 −100 −80x
y
1000
2000
3000
Count
Average March Counts
25
30
35
40
45
50
−120 −100 −80x
y
1000
2000
3000
Count
Average September Counts
25
30
35
40
45
50
−120 −100 −80x
y
1000
1500
Count
Average May Counts
25
30
35
40
45
50
−120 −100 −80x
y
1000
2000
3000
4000
Count
Average November Counts
Figure 2.1: Average Monthly Flu Counts from 2008 - 2014
8
25
30
35
40
45
50
−120 −100 −80x
y
0.1
0.2
0.3
0.4
Count
Average January Logged Counts
25
30
35
40
45
50
−120 −100 −80x
y
0.05
0.10
Count
Average July Logged Counts
25
30
35
40
45
50
−120 −100 −80x
y
0.1
0.2
0.3
Count
Average March Logged Counts
25
30
35
40
45
50
−120 −100 −80x
y
0.05
0.10
0.15
Count
Average September Logged Counts
25
30
35
40
45
50
−120 −100 −80x
y
0.05
0.10
0.15
0.20Count
Average May Logged Counts
25
30
35
40
45
50
−120 −100 −80x
y
0.1
0.2
Count
Average November Logged Counts
Figure 2.2: Average Monthly Flu Counts Per Person from 2008 - 2014
9
2.2.1 Global Test for Spatial Autocorrelation
Global spatial autocorrelation measures and tests use the entire spatial weights ma-
trix W̃ to determine if there is spatial autocorrelation over the total area in the study.
Whereas, local measures will calculate a statistic for each area in the study and use a
smaller, restricted set of areal units [4].
When measuring for global spatial autocorrelation, we compared the similarities in
the observationsMij with the similarities in the locationsWij by using the cross-product:
n∑i=1
n∑j=1
MijWij (2.1)
The two most commonly used methods for finding spatial autocorrelation for areal
units are the Moran’s I and Geary’s c statistics. Both statistics will determine the
overall degree of spatial correlation in the data set as a whole.
For this project, we used Moran’s I statistic to determine if there’s global spatial au-
tocorrelation present in our data. Moran’s I statistic uses the cross-products to measure
value similarity, Mij = (zi − z̄)(zj − z̄), versus Geary’s c which uses squared differences
such as (zi − zj)2.
The global Moran I statistic is:
I =
nn∑i=1
n∑j=1
Wij(zi − z̄)(zj − z̄)
n∑i=1
n∑j 6=i
Wij
n∑i=1
(zi − z̄)2(2.2)
E[I] = − 1
n− 1(2.3)
var(I) =n2(n− 1)W1 − n(n− 1)W2 − 2W 2
0
(n+ 1)(n− 1)2W 20
(2.4)
10
where
W0 =n∑i=1
n∑j 6=i
Wij (2.5)
W1 =1
2
n∑i=1
n∑j 6=i
(Wij +Wji)2 (2.6)
W2 =n∑k=1
( n∑j=1
Wkj +
n∑i=1
Wik
)2(2.7)
Please see Appendix B for the proof of the expected value for Moran I.
The null hypothesis associated with the Moran I statistic is that the spatial pro-
cesses influencing any spatial relationships is randomly placed and there is no spatial
autocorrelation. To test for the significance of spatial autocorrelation, R will randomly
assign the obersvations to the areal units and calculate the observed Moran I for a
large number of these random assignments. The observed Moran I is then compared
to the random set of Is and if the actual observed I falls less than the 5th percentile
or greater than the 95th percentile then there is spatial autocorrelation present at the
α = 0.5 level [6]. Therefore, we looked for p-values that were significant (< 0.05) which
would imply that we could reject the null hypothesis and conclude that there is spatial
autocorrelation present[4].
When spatial autocorrelation is present, for large data sets, the observed Moran I
statistic will be a large value versus the expected value of the statistic under the null
hypothesis of no spatial relation. To see this, consider when two neighboring areas i
and j both have high observation values. They will both be larger than the average, z̄,
and the cross product (zi − z̄)(zj − z̄) will be a large positive value.
Using the Moran.I() command in the package {ape} in R we found the Global Moran
I statistics for all 84 time periods in the study. These results are shown in Tables 2.1
and 2.2 below.
11
Date Observed I Expected I St. Dev P val
Jan-08 0.066 -0.021 0.096 0.367
Feb-08 0.138 -0.021 0.095 0.094
Mar-08 0.198 -0.021 0.092 0.018
Apr-08 0.119 -0.021 0.093 0.133
May-08 0.062 -0.021 0.092 0.364
Jun-08 0.039 -0.021 0.091 0.509
Jul-08 0.024 -0.021 0.089 0.615
Aug-08 -0.003 -0.021 0.086 0.831
Sep-08 -0.036 -0.021 0.086 0.864
Oct-08 -0.054 -0.021 0.088 0.714
Nov-08 -0.047 -0.021 0.091 0.778
Dec-08 0.017 -0.021 0.093 0.682
Jan-09 0.023 -0.021 0.090 0.627
Feb-09 0.098 -0.021 0.094 0.208
Mar-09 0.115 -0.021 0.093 0.141
Apr-09 0.141 -0.021 0.095 0.088
May-09 0.050 -0.021 0.091 0.432
Jun-09 0.121 -0.021 0.089 0.111
Jul-09 0.018 -0.021 0.087 0.655
Aug-09 0.096 -0.021 0.095 0.218
Sep-09 -0.001 -0.021 0.097 0.837
Oct-09 0.183 -0.021 0.094 0.030
Nov-09 0.133 -0.021 0.093 0.098
Dec-09 -0.026 -0.021 0.093 0.958
Jan-10 -0.030 -0.021 0.091 0.924
Feb-10 -0.011 -0.021 0.095 0.910
Mar-10 0.016 -0.021 0.094 0.694
Apr-10 0.041 -0.021 0.093 0.505
May-10 0.034 -0.021 0.092 0.544
Jun-10 0.012 -0.021 0.091 0.715
Jul-10 0.014 -0.021 0.090 0.697
Aug-10 0.007 -0.021 0.089 0.751
Sep-10 -0.038 -0.021 0.092 0.858
Oct-10 -0.036 -0.021 0.093 0.872
Nov-10 -0.030 -0.021 0.093 0.929
Dec-10 -0.001 -0.021 0.095 0.833
Jan-11 -0.020 -0.021 0.095 0.986
Feb-11 0.085 -0.021 0.096 0.270
Mar-11 0.188 -0.021 0.095 0.027
Apr-11 0.147 -0.021 0.095 0.076
May-11 0.079 -0.021 0.094 0.285
Jun-11 0.039 -0.021 0.092 0.510
Table 2.1: Global Moran Is: January 2008 - June 2011
12
Date Observed I Expected I St. Dev P val
Jul-11 0.018 -0.021 0.089 0.659
Aug-11 -0.008 -0.021 0.089 0.884
Sep-11 -0.051 -0.021 0.091 0.745
Oct-11 -0.040 -0.021 0.090 0.832
Nov-11 -0.030 -0.021 0.091 0.923
Dec-11 -0.043 -0.021 0.094 0.816
Jan-12 -0.019 -0.021 0.094 0.983
Feb-12 0.017 -0.021 0.095 0.691
Mar-12 0.123 -0.021 0.095 0.128
Apr-12 0.050 -0.021 0.094 0.450
May-12 0.021 -0.021 0.093 0.651
Jun-12 0.015 -0.021 0.090 0.685
Jul-12 0.004 -0.021 0.087 0.775
Aug-12 -0.031 -0.021 0.087 0.911
Sep-12 -0.072 -0.021 0.089 0.569
Oct-12 -0.010 -0.021 0.092 0.899
Nov-12 0.050 -0.021 0.097 0.461
Dec-12 0.089 -0.021 0.096 0.250
Jan-13 0.182 -0.021 0.096 0.034
Feb-13 0.071 -0.021 0.096 0.338
Mar-13 0.000 -0.021 0.094 0.824
Apr-13 0.002 -0.021 0.093 0.805
May-13 -0.008 -0.021 0.090 0.887
Jun-13 -0.005 -0.021 0.090 0.861
Jul-13 -0.013 -0.021 0.091 0.931
Aug-13 0.002 -0.021 0.095 0.808
Sep-13 -0.079 -0.021 0.094 0.538
Oct-13 -0.045 -0.021 0.095 0.802
Nov-13 -0.024 -0.021 0.095 0.977
Dec-13 0.199 -0.021 0.097 0.022
Jan-14 0.064 -0.021 0.096 0.376
Feb-14 -0.006 -0.021 0.093 0.870
Mar-14 -0.022 -0.021 0.094 0.990
Apr-14 -0.008 -0.021 0.092 0.887
May-14 -0.036 -0.021 0.092 0.877
Jun-14 0.001 -0.021 0.094 0.813
Jul-14 -0.016 -0.021 0.092 0.958
Aug-14 -0.061 -0.021 0.083 0.637
Sep-14 -0.027 -0.021 0.084 0.950
Oct-14 0.002 -0.021 0.095 0.804
Nov-14 -0.019 -0.021 0.095 0.983
Dec-14 0.151 -0.021 0.095 0.068
Table 2.2: Global Moran Is: July 2011 - December 2014
13
A graph of the observed Global Moran I values for all 84 time periods is shown below
in Figure 2.3. The reference line for the expected Global Moran I is also included in
the plot. One standard deviation above and below the observed value is indicated in
the plot. Notice that none of the observed Global Moran values are at least one stan-
dard deviation away from the expected value but we did find five months that displayed
spatial autocorrelation by testing the I values from Figures 2.1 and 2.2. This result
is not very strong and we could not conclude that there is global spatial autocorrela-
tion in the data. Thus, we decided to test each state for any local spatial autocorrelation.
2008 2010 2012 2014
−1.0
−0.5
0.00.5
1.0
Global Moran Is
Time
Obser
ved Gl
oval M
oran Is
Figure 2.3: Time Series of observed Global Morans
2.2.2 Local Test for Spatial Autocorrelation
Since we did not find any significant global spatial autocorrelation, our next step was
to test for signs of local spatial autocorrelation. Local statistics are used to determine if
each areal unit has a large amount of spatial clustering or if there are notable similarities
in the observations of surrounding areas. Local indicators of spatial assocation (LISA)
were created by Luc Anselin to determine the influence of each individual observation
versus looking at the entire sample area [2].
14
The test for local spatial autocorrelation utilizes the cross product:
n∑j=1
MijWij (2.8)
which uses comparisons of spatial autocorrelations for a specific observation or areal
unit. Similarly, Mij = (zi − z̄)(zj − z̄) like in the global test. Also, we now have neigh-
borhood sets Ji which are the collection of neighbors for area i and z̄ now represents
the average observation value for just area i’s neighboring states.
The local Moran Ii statistic for area i is:
Ii = (zi − z̄)n∑
j∈Ji
Wij(zj − z̄) (2.9)
E[Ii] = − 1
(n− 1)
n∑j=1
Wij (2.10)
var[Ii] =1
(n− 1)Wi(2)(n− b2) +
2
(n− 1)(n− 2)Wi(kh)(2b2 − n)− 1
(n− 1)2W̃i
2(2.11)
where
Wi(2) =n∑j 6=i
W 2ij (2.12)
2Wi(kh) =
n∑k 6=i
n∑h6=i
WikWih (2.13)
W̃i =
n∑j=1
Wij (2.14)
One thing to notice is that the sum of Ii for all i values is equivalent to the global
15
Moran I statistic we used in Equation 2.2.
n∑i=1
Ii ≡n∑i=1
(zi − z̄)n∑
j∈Ji
Wij(zj − z̄) (2.15)
Using the Local.Moran() command in the package {spdep} in R, we again calcu-
lated Moran I statistics for each of the 84 time periods. However, when testing for local
spatial autocorrelation we generated a test statistic for each of the 48 areas for each
time period. Thus, we observed 4,032 local Moran I statistics.
From these results, we generated a graph for each state that includes the calculated
test statistic as well as one standard deviation above and below and the expected Ii
value for reference. From these, we found that only six states were consistently fur-
ther than one standard deviation away from the expected value. These states include:
Delaware, Montana, North Dakota, South Dakota, Texas and Wyoming. The corre-
sponding graphs for Montana, North Dakota and South Dakota are displayed below
along with the plot for Alabama which did not display significant spatial autocorrela-
tion.
16
2008 2010 2012 2014
−10
−50
510
Local Moran I: Alabama
Time
Obse
rved L
ocal
Moran
Is
Figure 2.4: Local Moran I values for Alabama
2008 2010 2012 2014
−10
−50
510
Local Moran I: Montana
Time
Obse
rved L
ocal
Moran
Is
Figure 2.5: Local Moran I values for Montana
17
2008 2010 2012 2014
−10
−50
510
Local Moran I: North Dakota
Time
Obser
ved Lo
cal Mo
ran Is
Figure 2.6: Local Moran I values for North Dakota
2008 2010 2012 2014
−10
−50
510
Local Moran I: South Dakota
Time
Obser
ved Lo
cal Mo
ran Is
Figure 2.7: Local Moran I values for South Dakota
18
Observed Local Moran I values
2008 2010 2012 2014
−50
510
15
Time
Obse
rved L
ocal
Moran
I's
Alabama
Montana
North Dakota
Figure 2.8: Time series of Local Moran I values for Alabama, Montana and NorthDakota
Figure 2.8 displays the trends in the observed local Moran Is for three of the states.
Alabama is used as a reference for a state that does not display any type of local spatial
autocorrelation. Whereas, both Montana’s and North Dakota’s Moran Is indicated that
they had significant local spatial autocorrelation.
However, despite having six states with significant local spatial autocorrelation, our
results are similar to those for global spatial autocorrelation. We concluded that our
data did not show enough significant local spatial autocorrelation to continue to include
it in our model building.
Chapter 3
Time Series Analysis
For this project, we are using average monthly data observed for 7 years, for a total
of 84 data points for each location. Since the data we collected was in terms of weeks,
we manually converted the data to monthly averages. Weekly data was listed by the
first day of the week that the counts started. Thus, if a week started at the end of Jan-
uary but also contained days in February it was only taken into account for January’s
average. A time series plot of the raw data is shown in Figure 3.1 below. Notice that
there is a large amount of variance between the peaks.
Looking at Figure 3.2, we can see that after we take the log of the data, there is
much less variance in the output. The data now ranges roughly from 6 − 10 versus
50− 1100 with the raw data. Also, since our data is in the form of counts then taking
the log will help normalize the data. We will continue our project by using the log-
transformed data to generate useful models.
Another option we could have tried would have been to difference the data. Differ-
encing is commonly used when you have a seasonal time series. The seasonality typically
causes a time series to be nonstationary since the seasonality is the main factor affect-
ing the output at certain time periods. Differencing, computing the difference between
consecutive observations, can help make a time series stationary and remove patterns
in the data. However, we decided to not use differencing and keep the seasonal effects
in the data for modeling purposes we will discuss later in the project [7].
19
20
Next, notice that there is an obvious spike in the data that occurs once every year.
These spikes represent flu season which typically peaks between December and Febru-
ary every year with the majority of years peaking in February [12]. Because of this, we
incorporated a 12 month seasonal effect into our models.
Another observation of the data is the atypical spike for the 2012 - 2013 flu season.
According to the Center for Disease Control, CDC, the flu vaccine was 52% effective at
preventing acute respiratory illness that required medical attention that year. However,
the vaccine’s effectiveness was lower for those aged 65 years and older; which could have
been one factor for the flu season’s severity [12].
21
2008 2010 2012 2014
2000
4000
6000
8000
10000
Time Series of Monthly Average US Raw Data
Dates
Raw M
onthly
Avera
ge Flu
Count
s
Figure 3.1: Times Series of original data collected
2008 2010 2012 2014
6.06.5
7.07.5
8.08.5
9.0
Time Series of Monthly Average US Logged Data
Dates
Logged
Month
ly Aver
age Flu
Count
s
Figure 3.2: Times series of data after it was logged
Chapter 4
Seasonal ARIMA Model
Since we are using monthly data to help us determine a model for predicting Google
flu counts, we would expect our model to be seasonal. Having seasonality in a time
series means there is a pattern present that repeats every m time periods. Thus, since
flu season generally peaks around the same time each year, we would expect our data
to have a reoccurring pattern every 12 months.
4.1 The General Model
The seasonal ARIMA model includes both seasonal and non-seasonal autoregressive
and moving average components. The model also incorporates the use of differencing
which was mentioned earlier. The backwards difference operator, B, is defined as:
(Bm)xt = xt−m (4.1)
The notation for the general model is:
ARIMA(p, d, q)× (P,D,Q)m
22
23
p non-seasonal AR order,
d non-seasonal differencing,
q non-seasonal MA order,
P seasonal AR order,
D seasonal differencing,
Q seasonal MA order,
m number of months per season.
The model can be written as:
Φ(Bm)φ(B)(1−B − ...−Bd)(1−B − ...−BD)(xt − µ) = Θ(Bm)θ(B)wt (4.2)
The non-seasonal AR component is written as:
φ(B) = 1− φ1B − ...− φpBp (4.3)
The non-seasonal MA component is written as:
θ(B) = 1 + θ1B + ...+ θqBq (4.4)
The seasonal AR component is written as:
Φ(Bm) = 1− Φ1Bm − ...− ΦPB
mP (4.5)
The seasonal MA component is written as:
Θ(Bm) = 1 + Θ1Bm + ...+ ΘQB
mQ (4.6)
4.1.1 Simple ARIMA Example
To illustrate a simple example of a seasonal ARIMA model, let’s look at anARIMA(0, 1, 1)×(1, 0, 1)12 model.
24
Our model would start as:
Φ(B)(1−B)(xt − µ) = Θ(B)θ(B)wt (4.7)
Now, replace (xt − µ) with zt and substitute in the appropriate equations to get:
(1− ΦB)(1−B)zt = (1 + ΘB)(1 + θB)wt (4.8)
(1−B − ΦB − ΦBB)zt = (1 + ΘB + θB + ΘθBB)wt (4.9)
Thus, we get the resulting equation:
zt = zt−1 + Φzt−1 + Φzt−2 + wt + Θwt−1 + θwt−1 + Θθwt−2 (4.10)
25
4.2 ACF and PACF
Next, we created ACF and PACF plots for the nationwide data which are shown in
Figure 4.1.
0 5 10 15 20 25 30
−0.5
0.0
0.5
1.0
Series: USData
LAG
ACF
0 5 10 15 20 25 30
−0.5
0.0
0.5
1.0
LAG
PACF
Figure 4.1: ACF and PACF for US monthly data
We observed that the ACF plot in Figure 4.1 has cyclic spikes that do not seem to
dampen in magnitude at any lag value. Thus, it would be difficult to try and determine
a model for this data based on these plots.
Since we had seasonal data, it is sometimes helpful to analyze the ACF and PACF
plots after differencing the data. Seasonality commonly brings about nonstationarity
in data since it’s typical for seasonal data to fluctuate and peak during certain time
periods. Thus, we used a 12 month difference in our data and the time series plot after
the differencing can be found in Figure 4.2. Notice that the new plot no longer has
the consistent peak and valley pattern that was present back in Figure 3.2. We then
generated ACF and PACF plots for the data after applying a 12 month difference which
can be found in Figure 4.3.
26
2008 2009 2010 2011 2012 2013 2014
−1.5
−1.0
−0.5
0.00.5
1.01.5
Time Series after Differenced for 12 Months
Dates
12 Mo
nth Di
fferenc
ed Flu
Count
s
Figure 4.2: Times Series of the data after 12 month differencing
0 5 10 15 20 25 30
−0.5
0.0
0.5
1.0
Series: Differenced
LAG
ACF
0 5 10 15 20 25 30
−0.5
0.0
0.5
1.0
LAG
PACF
Figure 4.3: ACF and PACF for US monthly data after differenced for 12 months
27
From analyzing Figure 4.3 we were able to make an educated guess on what pa-
rameters to include in a seasonal ARIMA model. We used the the ACF plot to help
us determine the values for the seasonal and non-seasonal MA order and the PACF for
the number of seasonal and nonseasonal AR parameters.
Looking at just the first couple lags in the ACF we determined that a non-seasonal
MA order of 1 would fit since there is only a spike at lag 1. Next, we analyzed the lags
at 12, 24 and 36 and decided to use a seasonal MA order of 1 as well since the lag at
time 12 seemed to be the only significant seasonal lag.
We used the same methodology for picking the AR components of the model and
analyzing the PACF plot. Thus, we decided to use a non-seasonal AR order of 2 and a
seasonal order of 0.
Thus, we created the following model:
ARIMA(2, 0, 1)× (0, 0, 1)12
Applying these parameters to equation (4.1.1) we get:
φ(B)(xt − µ) = Θ(B12)θ(B)wt (4.11)
(1− φ1B − φ2B2)(xt − µ) = (1 + ΘB12)(1 + θB)wt (4.12)
For simplicity, let zt = (xt − µ) and multiply the 2 polynomials on the right side:
(1− φ1B − φ2B2)zt = (1 + ΘB12 + θB + ΘθBB12)wt (4.13)
zt = φ1zt−1 + φ2zt−2 + wt + Θwt−12 + θwt−1 + Θθwt−13 (4.14)
28
Type Coef S.E.
AR 1 (φ1) 1.0545 0.1558
AR 2 (φ2) -0.4153 0.1504
MA 1 (θ) 0.5563 0.1603
SMA 1 (Θ) 0.5034 0.1276
Constant 7.1957 0.1713
AIC=26.95 σ̂2=0.0651
Table 4.1: ARIMA model generated from ACF and PACF plots after differencing
Our final model is:
zt = 0.6889zt−1 + wt + 0.4577wt−12 + 0.8025wt−1 + 0.3673wt−13 (4.15)
4.3 R ARIMA Model
Now that we have analyzed the ACF and PACF plots for the data and generated a
model using those, we decided to see what type of model R would choose to fit the data.
We used the auto.arima() command in the package {forecast} in R to help us develop
a potentially different model. This function in R selects the best model based on the
Akaike information criterion, AIC. The AIC is used to estimate the quality of a model
in relation to other models. Thus, it will determine which model is the most useful out
of a set of multiple models. It will not determine significance of an individual model.
Our results were as follows:
ARIMA(1, 0, 1)× (2, 0, 0)12
Applying these parameters to equation (4.1.1) we get:
Φ(B12)φ(B)(xt − µ) = θ(B)wt (4.16)
29
(1− Φ1B12 − Φ2B
24)(1− φB)(xt − µ) = (1 + θB)wt (4.17)
For simplicity, let zt = (xt − µ) and multiply the first 2 polynomials on the left side
to get:
(1− Φ1B12 − Φ2B
24 − φB + Φ1φB12B + Φ2φB
24B)zt = (1 + θB)wt (4.18)
zt = Φ1zt−12 + Φ2zt−24 + φzt−1 − Φ1φzt−13 − Φ2φzt−25 + wt + θwt−1 (4.19)
Type Coef S.E.
AR 1 (φ) 0.6375 0.0913
MA 1 (θ) 0.7387 0.0920
SAR 1 (Φ1) 0.3779 0.1091
SAR 2 (Φ2) 0.3785 0.1251
constant 7.2354 0.3217
AIC=15.57 σ̂2=0.0568
Table 4.2: Results from ARIMA model generated from R
Our final model is:
zt = 0.3779zt−12+0.3785zt−24+0.6375zt−1−0.2409zt−13−0.2413zt−25+wt+0.7387wt−1
(4.20)
Our first model, which we developed by analyzing the ACF and PACF plots, had
an AIC value of 26.95. Whereas, the model chose by R had an AIC value of only 15.57.
30
Thus, we concluded that the model generated by R, ARIMA(1, 0, 1) × (2, 0, 0)12, is a
better model for the data than the one we calculated ourselves.
4.4 State Models
Next, we ran auto.arima() on each of the 48 individual states in order to gather
state-specific models. The models that R chose for each state are listed below in Tables
4.3 and 4.4.
State AR MA SAR SMA Period NonSDiff SDiff AIC
Alabama 1 1 0 2 12 0 1 56.393
Arizona 2 0 1 1 12 0 0 51.056
Arkansas 1 3 1 2 12 0 0 45.365
California 2 0 2 0 12 0 0 19.249
Colorado 1 1 1 1 12 0 0 52.624
Connecticut 1 1 2 0 12 0 0 92.317
Delaware 1 1 2 0 12 0 0 16.141
Florida 1 1 1 0 12 0 1 -10.093
Georgia 1 1 2 0 12 0 0 42.990
Idaho 3 0 2 0 12 0 0 73.055
Illinois 1 1 0 1 12 0 1 21.765
Indiana 2 0 1 0 12 0 1 45.295
Iowa 1 1 1 0 12 0 1 56.480
Kansas 1 1 0 1 12 0 1 64.199
Kentucky 2 0 2 0 12 0 0 73.715
Louisiana 1 1 2 0 12 0 0 31.711
Maine 0 2 2 0 12 0 0 89.355
Maryland 1 1 1 0 12 0 1 -0.696
Massachusetts 2 0 1 0 12 0 1 75.148
Michigan 1 1 1 0 12 0 1 4.383
Minnesota 1 1 2 0 12 0 0 60.982
Mississippi 2 0 2 0 12 0 0 75.025
Missouri 1 1 0 2 12 0 1 34.746
Montana 3 1 1 2 12 0 0 79.302
Table 4.3: Seasonal ARIMA models selected by R: Alabama - Montana
31
State AR MA SAR SMA Period NonSDiff SDiff AIC
Nebraska 2 0 1 0 12 0 0 85.521
Nevada 2 0 2 0 12 0 0 29.620
New Hampshire 1 1 2 0 12 0 0 100.658
New Jersey 1 1 1 2 12 0 0 62.443
New Mexico 1 2 1 1 12 0 0 30.337
New York 1 1 2 0 12 0 0 35.076
North Carolina 1 1 2 0 12 0 0 62.608
North Dakota 1 1 2 0 12 0 1 83.835
Ohio 2 0 2 0 12 0 0 37.356
Oklahoma 4 0 2 0 12 0 0 62.203
Oregon 2 0 2 0 12 0 0 65.044
Pennsylvania 1 1 2 0 12 0 0 -9.615
Rhode Island 2 0 1 0 12 0 1 92.514
South Carolina 2 0 1 1 12 0 0 77.623
South Dakota 2 0 1 1 12 0 0 100.126
Tennessee 1 1 1 1 12 0 0 63.286
Texas 1 1 2 0 12 0 0 30.807
Utah 0 2 1 2 12 0 0 62.374
Vermont 2 1 1 0 12 0 0 80.656
Virginia 2 0 2 0 12 0 0 6.013
Washington 1 1 1 2 12 0 0 36.091
West Virginia 2 0 2 0 12 0 0 40.571
Wisconsin 1 1 1 0 12 0 1 30.775
Wyoming 2 1 0 0 12 0 0 70.470
Table 4.4: Seasonal ARIMA models selected by R: Nebraska - Wyoming
Notice that each state displays fairly different model specifications and that there
is a wide variety of models associated with the 48 states. It does appear that the
ARIMA(1, 0, 1)× (2, 0, 0)12 is the most common model present in Tables 4.3 and 4.4.
This makes sense since this is the model that R also chose to describe the behavior of
the overall country’s flu counts. However, there are 13 states that include a seasonal
difference variable in their model which is something we did not include in our model.
32
4.5 US Model Accuracy
We then made a time series plot of the residuals for this model as well as the ACF
and PACF plots which are shown in Figure 4.4 below.
res
1 2 3 4 5 6 7 8
−0.4
−0.2
0.0
0.2
0.4
0.6
0.8
0 5 10 15 20 25
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
Lag
ACF
0 5 10 15 20 25
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
Lag
PACF
Figure 4.4: Time Series, ACF, and PACF of the model’s residuals
There doesn’t look like there’s an apparent trend in the time series for the residu-
als, which indicates a decent model. We also notice that there are not any significant
33
spikes in either the ACF or the PACF plots which would mean that there’s no trends
happening in lags of the residuals for the model. Based on these plots, we concluded
that this model does sufficiently describe the data.
Next, we ran a Ljung-Box test on the model to help us further determine the model’s
usefulness [8]. The Ljung-Box test is defined as:
H0: The data are independently distributed
Ha: The data are not independently distributed (correlation is present)
with the test statistic:
Q = T (T + 2)
h∑g=1
ρ̂2
T − g
T length of time series,
ρ̂ sample autocorrelation at lag g,
h number of lags being tested,
K number of model parameters.
Under the null hypothesis, the test statistic Q follows a Chi-squared distribution
typically with h degrees of freedom, χ2(h). However, since we are using this test for an
ARIMA model, we are testing if the errors resemble independence and must adjust the
degrees of freedom accordingly. The degrees of freedom for a seasonal ARIMA model
end up being the number of lags being tested less the number of model parameters.
Thus, with a significance level of α, the rejection region for the hypothesis of random-
ness in the residuals is:
Q > χ21−α,h−K
According to Rob Hyndman and George Athanasopoulos [9] there is not a standard
value one should use for h, but as a rule of thumb for seasonal data you should let
34
h = min(2m,T/5) where T is the length of the time series.
Applying our model generated from R to this test at a 95% significance level we get:
h = min(2 ∗ 12,84
5) ≈ 17,
Q = 8.1738,
χ20.05,13 = 22.36,
p− value = 0.8321,
AIC = 15.57.
We would reject the hypothesis of randomness in the residuals if Q > 22.36. Since our
test statistic for our model is 8.1738, we will not reject the hypothesis and conclude
that the data does show to be independently distributed.
4.6 State Model Accuracy
Next, we ran the Ljung-Box test on the models that R generated for each individual
state. The test statistics, degrees of freedom, p-values and AIC values are listed in
Figures 4.5 and 4.6 below. We see that when using the same test we used for the
overall US model, we will only reject the hypothesis of independence for the state of
Arkansas at the α=0.05 level. We can conclude that the models for the rest of the 47
states look to be independently distributed and useful for our data.
35
State Q-statistic DF P val AIC
Alabama 13.478 14 0.489 56.393
Arizona 9.418 13 0.741 51.056
Arkansas 19.891 10 0.030 45.365
California 8.344 13 0.820 19.249
Colorado 4.444 13 0.985 52.624
Connecticut 16.314 13 0.233 92.317
Delaware 10.380 13 0.663 16.141
Florida 9.847 15 0.829 -10.093
Georgia 13.392 13 0.418 42.990
Idaho 9.476 12 0.662 73.055
Illinois 14.532 15 0.486 21.765
Indiana 14.995 15 0.452 45.295
Iowa 11.571 15 0.711 56.480
Kansas 8.317 15 0.910 64.199
Kentucky 10.906 13 0.619 73.715
Louisiana 7.865 13 0.852 31.711
Maine 11.078 13 0.604 89.355
Maryland 7.327 15 0.948 -0.696
Massachusetts 10.519 15 0.786 75.148
Michigan 13.373 15 0.574 4.383
Minnesota 9.025 13 0.771 60.982
Mississippi 21.479 13 0.064 75.025
Missouri 11.488 14 0.647 34.746
Montana 6.296 10 0.790 79.302
Table 4.5: Seasonal ARIMA model statistics: Alabama - Montana
36
State Q-statistic DF P val AIC
Nebraska 16.957 14 0.258 85.521
Nevada 3.778 13 0.993 29.620
New Hampshire 12.350 13 0.499 100.658
New Jersey 10.271 12 0.592 62.443
New Mexico 3.737 12 0.988 30.337
New York 14.578 13 0.334 35.076
North Carolina 12.101 13 0.519 62.608
North Dakota 11.379 14 0.656 83.835
Ohio 17.240 13 0.189 37.356
Oklahoma 11.168 11 0.429 62.203
Oregon 10.200 13 0.677 65.044
Pennsylvania 12.022 13 0.526 -9.615
Rhode Island 11.614 15 0.708 92.514
South Carolina 6.925 13 0.906 77.623
South Dakota 10.969 13 0.613 100.126
Tennessee 9.210 13 0.757 63.286
Texas 5.074 13 0.974 30.807
Utah 9.045 12 0.699 62.374
Vermont 13.333 13 0.422 80.656
Virginia 13.872 13 0.383 6.013
Washington 7.749 12 0.804 36.091
West Virginia 7.615 13 0.868 40.571
Wisconsin 10.877 15 0.761 30.775
Wyoming 8.633 14 0.854 70.470
Table 4.6: Seasonal ARIMA model statistics: Nebraska - Wyoming
Chapter 5
Model Forecasting
Our final task is to forecast future events from the model that we generated. To do
this, we used R’s forecast() command and applied it to our ARIMA(1, 0, 1)× (2, 0, 0)12
model. The resulting forecast is shown below in Figure 5.1.
24 Month Forecast
2008 2010 2012 2014 2016
6.06.5
7.07.5
8.08.5
9.0
Figure 5.1: Forecast for ARIMA(1, 0, 1)× (2, 0, 0)12 model
37
38
The black line on the left represents the actual data that we collected whereas the
blue line is the fitted values of the model for 24 months in the future. The shaded
areas are prediction intervals for the forecast. The light grey region the 95% confidence
interval and the purple area is the 80% confidence interval for the forecasted data.
The dotted line in Figure 5.1 is data that we collected from Google for January
2015 - July 2015. From this graph, the forecast looks to be a good fit, the fitted values
stay inside both the 80% and 95% confidence intervals.
Chapter 6
Conclusion and Discussion
Starting with weekly flu count data for 2008 - 2014, we converted it to average
monthly data and log-transformed it to normalize and reduce the variance in the data.
We were then left with a seasonal time series ready to be analyzed for spatial and tem-
poral effects.
Our first step was to test for both global and local spatial autocorrelation. We
used Moran’s I statistic to test for both types of spatial correlation. Unfortunately, we
did not find enough evidence to say there was signficant global spatial autocorrelation.
And although a handful of states displayed local spatial autocorrelation, we decided
this wasn’t enough evidence to continue using spatial analysis in our model. Thus, we
continued on with without spatial effects and looked into just temporal.
We created two potential seasonal ARIMA models. One from analyzing the ACF
and PACF plots of the seasonal differenced data and the other generated from R. We
then used the AIC values to help us choose between the two models. The lower AIC
value led us to choose the model that R developed for the data.
Once we had our final model, we applied the Ljung-Box test to determine the over-
all quality of the seasonal ARIMA model. This test revealed that the data did in fact
show to be independently distributed at a 95% significance level and we rejected the
alternative hypthesis that correlation was present. Thus, our model was an adequate
39
40
fit for the data.
Lastly, we were able to create a 24 month forecast of the model. We generated the
80% and 95% prediction intervals based on the 24 month forecast and then plotted 7
months of data from 2015 to see how well the model could forecast. The actual data
from January - July 2015 stayed within both confidence intervals so we accepted this
forecast and would suggest using it for predicting future Google flu trends across the
United States.
References
[1] Anselin, Luc. Spatial Econometrics A Companion to Theoretical Econometrics.
Chapter 14 (2003), 310-325.
[2] Anselin, Luc. Local Indicators of Spatial Association - LISA Geographical Analysis
27 (2): 93-115.
[3] Dukic, Vanja, Hedibert F. Lopes and Nicholas G. Polson. Tracking Epidemics With
Google Flu Trends Data and a State-Space SEIR Model Journal of the American
Statistical Association 107:500, 1410-1426. Web.
[4] Fischer, Manfred M., & Jinfeng Wang. Spatial Data Analysis: Models, Methods
and Techniques Heidelberg: Springer, 2011. Springer Link. Springer International
Publishing. Web.
[5] Flu Trends. Google. N.p., 2015. Web. 22 July 2015, available at https://www.
google.org/flutrends/about/data/flu/us/data.txt
[6] Franklin, Meredith. Spatial Statistics USC Keck School of Medicine, 2013. Web. 6
July 2016.
[7] Hyndman, Rob J., and George Athanasopoulos., 8.1 Stationarity and Differencing
OTexts, 2016. Web.
[8] Hyndman, Rob J. Thoughts on the Ljung-Box test WordPress, 2014. Web. 02 July
2016.
[9] Hyndman, Rob J., and George Athanasopoulos., 8.9 Seasonal ARIMA Models
OTexts, 2016. Web.
41
42
[10] LeSage, James P. Lecture 1: Maximum likelihood estimation of spatial regression
models (2004), available at www4.fe.uc.pt/spatial/doc/lecture1.pdf Web.
[11] Pace, R. Kelley, Ronald Barry, Otis W. Gilley1‘ and C.F. Sirmans. ”A method for
spatial-temporal forecasting with an application to real estate prices” International
Journal of Forecasting 16 (200): n. pag. Web.
[12] ”The Flu Season.” Centers for Disease Control and Prevention. Centers for Disease
Control and Prevention, 22 Oct. 2014. Web.
[13] Wall, Melanie M. ”A close look at the spatial structure implied by the CAR and
SAR models” Journal of statistical planning and inference 121 (2004): 311-324.
Web.
[14] Yang, Shihao, Mauricio Santillana and S. C. Kou. ”Accurate estimation of in-
fluenza epidemics using Google search data via ARGO.” Proceedings of the Na-
tional Academy of Sciences Proc Natl Acad Sci USA 112.47 (2015): 14473-4478.
Web.
Appendix A
Spatial Matrix
The spatial weights matrix W which is used for finding spatial autocorrelation is
determined by the collection of neighboring states for each state. The geographical map
of the states numbered in alphabetical order is displayed below in Figure A.1. The
corresponding list of neighbors associated with the map is found in Figure A.2 [13].
Figure A.1: Map of the continental states with corresponding numbers [13]
43
44
Figure A.2: List of Neighbors for the 48 States [13]
Appendix B
Moran I Proof
I =
n
n∑i=1
n∑j=1
Wij(zi − z̄)(zj − z̄)
n∑i=1
n∑j 6=i
Wij
n∑i=1
(zi − z̄)2
E[I] =−1
N − 1
Proof
Start by noting:
1. z1, ...zn are given andn∑i
(zi − z̄)2 is a constant
2. For k 6= l, E(zk − z̄)(zl − z̄) is the same ∀ k, l
45
46
3. For k 6= l,
E(zk − z̄)(zl − z̄) =
∑k
∑l 6=k
(zk − z̄)(zl − z̄)
n(n− 1)
=
∑k
(zk − z̄)(− (zk − z̄)
)n(n− 1)
= −
∑k
(zk − z̄)2
n(n− 1)
4.n∑i=1
n∑j=1
Wij =n∑i=1
n∑j 6=i
Wij since Wkk = 0 ∀k
Now we can algebraically prove:
E[I] =
nn∑i=1
n∑j=1
Wij(zi − z̄)(zj − z̄)
n∑i=1
n∑j 6=i
Wij
n∑i=1
(zi − z̄)2
=
nn∑i=1
n∑j=1
Wij
(−∑k
(zk − z̄)2)
n(n− 1)n∑i=1
n∑j 6=i
Wij
n∑i=1
(zi − z̄)2
= − 1
n− 1
�