Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Efficient Algorithms for Mining Large Spatio-Temporal Data
Feng Chen
Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in
partial fulfillment of the requirements for the degree of
Doctor of Philosophy
In
Computer Science and Applications
Chang-Tien Lu, Chair
Ing Ray Chen
Naren Ramakrishnan
Wenjing Lou
Yue Wang
November 30, 2012
Falls Church, VA
Keywords: Spatio-Temporal Analysis, Outlier Detection, Robust Prediction, Energy
Disaggregation
Efficient Algorithms for Mining Large Spatio-Temporal Data
Feng Chen
ABSTRACT
Knowledge discovery on spatio-temporal datasets has attracted growing interests. Recent advances
on remote sensing technology mean that massive amounts of spatio-temporal data are being col-
lected, and its volume keeps increasing at an ever faster pace. It becomes critical to design efficient
algorithms for identifying novel and meaningful patterns from massive spatio-temporal datasets. Dif-
ferent from the other data sources, this data exhibits significant space-time statistical dependence,
and the assumption of i.i.d. is no longer valid. The exact modeling of space-time dependence will
render the exponential growth of model complexity as the data size increases. This research focuses
on the construction of efficient and effective approaches using approximate inference techniques for
three main mining tasks, including spatial outlier detection, robust spatio-temporal prediction, and
novel applications to real world problems.
Spatial novelty patterns, or spatial outliers, are those data points whose characteristics are markedly
different from their spatial neighbors. There are two major branches of spatial outlier detection
methodologies, which can be either global Kriging based or local Laplacian smoothing based. The
former approach requires the exact modeling of spatial dependence, which is time extensive; and the
latter approach requires the i.i.d. assumption of the smoothed observations, which is not statistically
solid. These two approaches are constrained to numerical data, but in real world applications we are
often faced with a variety of non-numerical data types, such as count, binary, nominal, and ordinal.
To summarize, the main research challenges are: 1) how much spatial dependence can be eliminated
via Laplace smoothing; 2) how to effectively and efficiently detect outliers for large numerical spatial
datasets; 3) how to generalize numerical detection methods and develop a unified outlier detection
framework suitable for large non-numerical datasets; 4) how to achieve accurate spatial prediction
even when the training data has been contaminated by outliers; 5) how to deal with spatio-temporal
data for the preceding problems.
To address the first and second challenges, we mathematically validated the effectiveness of Laplacian
smoothing on the elimination of spatial autocorrelations. This work provides fundamental support
for existing Laplacian smoothing based methods. We also discovered a nontrivial side-effect of
Laplacian smoothing, which ingests additional spatial variations to the data due to convolution
effects. To capture this extra variability, we proposed a generalized local statistical model, and
designed two fast forward and backward outlier detection methods that achieve a better balance
between computational efficiency and accuracy than most existing methods, and are well suited to
large numerical spatial datasets.
We addressed the third challenge by mapping non-numerical variables to latent numerical variables
iii
via a link function, such as logit function used in logistic regression, and then utilizing error-buffer
artificial variables, which follow a Student-t distribution, to capture the large valuations caused by
outliers. We proposed a unified statistical framework, which integrates the advantages of spatial
generalized linear mixed model, robust spatial linear model, reduced-rank dimension reduction, and
Bayesian hierarchical model. A linear-time approximate inference algorithm was designed to infer
the posterior distribution of the error-buffer artificial variables conditioned on observations. We
demonstrated that traditional numerical outlier detection methods can be directly applied to the
estimated artificial variables for outliers detection. To the best of our knowledge, this is the first
linear-time outlier detection algorithm that supports a variety of spatial attribute types, such as
binary, count, ordinal, and nominal.
To address the fourth and fifth challenges, we proposed a robust version of the Spatio-Temporal
Random Effects (STRE) model, namely the Robust STRE (R-STRE) model. The regular STRE
model is a recently proposed statistical model for large spatio-temporal data that has a linear
order time complexity, but is not best suited for non-Gaussian and contaminated datasets. This
deficiency can be systemically addressed by increasing the robustness of the model using heavy-
tailed distributions, such as the Huber, Laplace, or Student-t distribution to model the measurement
error, instead of the traditional Gaussian. However, the resulting R-STRE model becomes analytical
intractable, and direct application of approximate inferences techniques still has a cubic order time
complexity. To address the computational challenge, we reformulated the prediction problem as
a maximum a posterior (MAP) problem with a non-smooth objection function, transformed it to
a equivalent quadratic programming problem, and developed an efficient interior-point numerical
algorithm with a near linear order complexity. This work presents the first near linear time robust
prediction approach for large spatio-temporal datasets in both offline and online cases.
iv
Acknowledgements
First and foremost, I would like to thank my advisor, Dr. Chang-Tien Lu. Dr. Lu has contributed to
this work in many ways, and has taught me a tremendous amount. It was his energy and enthusiasm
that drew me to Virginia Tech, and led me down my current research path. Second, I would like
to thank my committee members, Dr. Ing Ray Chen, Dr. Naren Ramakrishnan, Dr. Wenjing Lou,
and Dr. Yue Wang; and my previous committee member Dr. Michael K. Badawy for many helpful
comments and insightful discussions from my proposal to final defense. Special thanks goes to Dr.
Wenjing Lou, who was willing to participate in my final defense committee at the last moment.
I would like to express appreciation to my friends in the Spatial Data Management Laboratory,
Xutong Liu, Yen-Cheng Lu, Bingsheng Wang, Haili Dong, Ting Hua, Liang Zhao, Kaiqun Fu, Manu
Shukla, Jing Dai, Ying Jin, Bing Liu, Arnold Boedijardjo, Edward Devilliers, Ray Dos Santos,
Wendell Jordan-Brangman, and Chad Steel. Many thanks for their precious comments on my
dissertation. Each discussion with them sparked new thoughts in my research. They made my
Ph.D. study an enjoyable journey with a lot of happy memory
Most importantly, I would like to thank my family and friends, for all of their love and support.
CONTENTS v
Contents
List of Figures x
List of Tables xi
1 Introduction 11.1 Research Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Spatial Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Robust Spatio-Temporal Prediction . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Proposal Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Theoretical Foundations and Related Works 92.1 Spatial Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Laplacian Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Approximate Inference Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 A Generalized Approach to Numerical Spatial Outlier Detec tion 213.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Spatial Local Statistics and Related Works . . . . . . . . . . . . . . . . . . . . . 23
3.3 Generalized Local Spatial Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Generalized Local Statistic Model (GLS) . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Theoretical Properties of GLS . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Estimation and Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Generalized Least Squares Regression . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.2 GLS-Backward Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 GLS-Forward Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.4 Connections with Existing Methods . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.2 Detection Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.3 Computational Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
CONTENTS vi
4 A Generalized Approach to Non-Numerical Spatial Outlier D etection 464.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Theoretical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Reduced-Rank Spatial Linear (Gaussian Process) Model . . . . . . . . . . . . . 48
4.2.2 Spatial Generalized Linear Mixed Model (SGLMM) . . . . . . . . . . . . . . . . 49
4.3 Robust and Reduced-Rank Bayesian SGLMM model . . . . . . . . . . . . . . . . 50
4.3.1 The Observations Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 The Latent Robust Gaussian process Layer . . . . . . . . . . . . . . . . . . . . 52
4.3.3 The Parameters Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.4 Theoretical Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Robust Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1 Inference on Latent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2 Inference on Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.3 Non-Numerical Spatial Outlier Detection . . . . . . . . . . . . . . . . . . . . . 55
4.4.4 Time and Space Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.2 Detection Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.3 Detection Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.4 Impact of Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 Robust Prediction for Large Spatio-Temporal Data Sets 695.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Theoretical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.1 Spatio-Temporal Random Effects Model . . . . . . . . . . . . . . . . . . . . . . 72
5.2.2 Fixed Rank Spatio-Temporal Prediction . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.1 Robust Spatio-Temporal Random Effects Model . . . . . . . . . . . . . . . . . 74
5.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4 A General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.1 MAP Estimation of η1:T |T , ξ1:T |T . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.2 LA Estimation of the Precision Matrix G1:T |T . . . . . . . . . . . . . . . . . . 77
5.5 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5.1 Primal-Dual Optimization for Huber Distribution . . . . . . . . . . . . . . . . . 79
5.5.2 Primal-Dual Optimization for Laplace Distribution . . . . . . . . . . . . . . . . 82
5.5.3 Time and Space Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 83
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.6.1 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6.2 Experiments on Aerosol Optical Depth Data . . . . . . . . . . . . . . . . . . . 87
5.6.3 Experiments on Traffic Volume Data . . . . . . . . . . . . . . . . . . . . . . . 88
CONTENTS vii
6 Application 1: Activity Analysis Based on Low Sample Rate S mart Me-ters 956.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.1 Problem and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.2 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.3 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3 A NEW STATISTICAL DISAGGREGATION FRAMEWORK . . . . . . . . . . . . . 101
6.4 DISAGGREGATION APPROACHES . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.1 HMM-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4.2 Classification-GMM-based Approach . . . . . . . . . . . . . . . . . . . . . . . . 107
6.5 Evaluation & Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.5.2 Parameter Settings & Baseline Methods . . . . . . . . . . . . . . . . . . . . . . 110
6.5.3 Effectiveness Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.5.4 Impact of Sample Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.5.5 Disaggregation for Pilot Households . . . . . . . . . . . . . . . . . . . . . . . . 113
6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7 Application 2: Wireless Passive Device Fingerprinting us ing InfiniteHidden Markov Random Field 1187.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.1 Radio-metric Based Device Fingerprinting . . . . . . . . . . . . . . . . . . . . . 121
7.2.2 RSS Based Device Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3 Features for Device Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3.1 Time Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3.2 Frequency Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3.3 Phase Shift Difference Measurement . . . . . . . . . . . . . . . . . . . . . . . 124
7.3.4 Angle of Arrival Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3.5 Radio Signal Strength (RSS) Measurement . . . . . . . . . . . . . . . . . . . . 125
7.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.5 Theoretical Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5.1 Hidden Markov Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5.2 Infinite Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.6 Infinite Hidden Markov Random Field (iHMRF) . . . . . . . . . . . . . . . . . . . 130
7.7 Incremental Variational Inference for the IHMRF Model . . . . . . . . . . . . . . 132
7.7.1 Model Building Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.7.2 Compression Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.7.3 Incremental Batch Update Phase . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.8 Simulation Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
CONTENTS viii
7.8.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.8.2 Impacts of Instable RSS Collection Rates . . . . . . . . . . . . . . . . . . . . . 140
7.8.3 Impacts of Transmission Power Changes . . . . . . . . . . . . . . . . . . . . . 141
7.8.4 Comparisons on Precision, Recall, and F-Measure . . . . . . . . . . . . . . . . . 141
7.8.5 Comparison on Time Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.8.6 A Case Study on Detecting Masquerade Attacks . . . . . . . . . . . . . . . . . 142
7.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8 Achievements and Future Work 1478.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2.1 Spatial and Spatio-Temporal Outlier Detection . . . . . . . . . . . . . . . . . . 151
8.2.2 Spatio-Temporal Anomalous Cluster Detection . . . . . . . . . . . . . . . . . . 152
8.2.3 Energy Disaggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.2.4 Wireless Device Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.3 Published Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
A Appendix 157A.1 Estimated Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
A.2 Definition of Matrices M and E . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
A.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
A.4 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A.5 Offline Inference Solution for iHMRF . . . . . . . . . . . . . . . . . . . . . . . . . 163
Bibliography 164
LIST OF FIGURES ix
List of Figures
3.1 An example of correlation: it reflects the noise and direction of a linear relationship . . 29
3.2 The neighborhoods defined by 4 or 12-nearest-neighbors rules in gridded data, equal to
those defined by radiuses r and 2r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Comparison on computational cost (setting: linear trend, isolated outliers, α = 0.1, σ20 =
2, c = 15,K = 8, n = 200) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Outlier ROC Curve Comparison (the same setting: n = 200, b = 5, σ2C = 20) . . . . . 45
4.1 Graphic Model Representation of the 3RB-SGLMM Model . . . . . . . . . . . . . . . . 51
4.2 Spatial Distribution of Four Simulation Datasets . . . . . . . . . . . . . . . . . . . . . 60
4.3 Spatial Distribution of Six Real Life Datasets . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Spatial Distribution of Simulation Data . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Detection Rate Comparison on Four Real Datasets . . . . . . . . . . . . . . . . . . . . 65
4.6 Time Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 Detection Rate Comparison Using Different Knot Sizes . . . . . . . . . . . . . . . . . . 67
5.1 pdfs of Heavy Tailed Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Approximations of Heavy Tailed Distributions . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Comparison between the FR-STP and RFR-STP using the data observed at four different
times and with different numbers of isolated outliers (15 unobserved locations from s =
113 to s = 127) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.5 Comparison between the FR-STP and RFR-STP using the data observed at two different
times and with different sizes of regional outliers (15 unobserved locations from s = 113
to s = 127) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.6 Comparison between the FR-STP and RFR-STP on the contaminated AOD data sets
observed at time t = 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.7 Comparison between the FR-STP and RFR-STP using the Traffic Volume Data on the
4th day. (Detectors #75 and #215 are spatial neighbors) . . . . . . . . . . . . . . . . . 94
6.1 An Example of Data and Disaggregated Activities . . . . . . . . . . . . . . . . . . . . 97
6.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Smarter Water Service Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 Disaggregation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5 Impact of Interval Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
LIST OF FIGURES x
6.6 Distribution vs. Demographic Info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.7 Washer Usage vs. Day of Week . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.8 Shower vs. Day of Week . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.9 Shower/Washer vs. Time of Day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.1 Illustration of phase shift difference for constellation of QPSK symbols of two transmitters 124
7.2 Features extraction from packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3 Graphical Model Representation of iGMM . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.4 Graphical Model Representation of iHMRF . . . . . . . . . . . . . . . . . . . . . . . . 132
7.5 Spatial Distribution of Simulation Data . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.6 Comparison on Time Costs (Seconds) . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.7 Visualization for the UdelModels Data with 1 Building 10 Floors . . . . . . . . . . . . . 144
7.8 Visualization for the UdelModels - Chicago9B1k Data with Pedestrians and Cars . . . 145
7.9 Visualization for the UdelModels - Chicago9B1k Data with Only Cars . . . . . . . . . . 146
A.1 The comparison between the true correlation |ρ(ω∗i , ω
∗j ;θθθ)| and the estimated bound
function. Here, K = 12, c = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
A.2 The comparison between the true correlation |ρ(ω∗i , ω
∗j ;θθθ)| and the estimated bound
function. Here, K = 12,c = 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
A.3 The comparison between the true correlation |ρ(ω∗i , ω
∗j ;θθθ)| and the estimated bound
function. Here, K = 12,c = 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
A.4 The comparison between the true correlation |ρ(ω∗i , ω
∗j ;θθθ)| and the estimated bound
function. Here, K = 12,c = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
A.5 The comparison between the true correlation |ρ(ω∗i , ω
∗j ;θθθ)| and the estimated bound
function. Here, K = 12,c = 40. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
LIST OF TABLES xi
List of Tables
3.1 Description of major symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Combination of parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Competition statistics for different combinations of parameter settings . . . . . . . . . 43
4.1 Simulation Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Real life Data Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1 Comparison of Time Cost using the Simulated and AOD Data (Seconds) . . . . . . . . 91
5.2 Comparison of Robustness using the AOD data . . . . . . . . . . . . . . . . . . . . . . 91
6.1 Terms & Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Water Journaling of One Household . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3 Precision, Recall, and F-measure on Simulation Data . . . . . . . . . . . . . . . . . . . 112
6.4 Precision, Recall, and F-measure on Volunteers . . . . . . . . . . . . . . . . . . . . . . 113
7.1 Device Fingerprinting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2 Definition of TP, FP, FN, and TN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3 Simulation Data Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4 Simulation Results Based on UdelModels with 1 Building 10 Floors . . . . . . . . . . . 139
7.5 Simulation Results Based on UdelModels - Chicago9Blk - with Pedestrians and Cars . . 139
7.6 Simulation Results Based on UdelModels - Chicago9Blk - with Only Cars . . . . . . . . 139
7.7 Simulation Results Based on UdelModels - Chicago9Blk - with Only Pedestrians . . . . 140
7.8 Unstable RSS Rates (UdelModels - Chicago9Blk - with Only Pedestrians) . . . . . . . 141
7.9 Change of Transmission Power (UdelModels - Chicago9Blk - with Only Pedestrians) . . 141
7.10 Detection Rates for Masquerade Attacks Based on UdelModels - Chicago9B1k - Pedestrains
143
7.11 Detection Rates for Masquerade Attacks on UdelModels - Chicago9B1k - 1 Building 10
Floors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Chapter 1 1
Chapter 1
Introduction
In recent years, with the advancements of remote sensoring techniques and the widespread use of
mobile devices, such as GPS and intelligent phones, the amount of spatial (or geographic) data have
been multiplied. The ever-increasing volume of spatial data has greatly challenged our ability to
store, retrieve, and extract useful but implicit knowledge from them. This is crucial for many ap-
plication domains including ecology and environmental management, public safety, transportation,
earth science, epidemiology, and climatology [4]. A number of research works have been conducted
to develop the Spatial Data Management System (SDBMS). The major research areas on spa-
tial databases include spatial data modeling, spatial data access, spatial data query, spatial data
visualization, and spatial data mining (or knowledge discovery).
Spatial data mining [285,224,264,263] is the process of discovering previously unknown and poten-
tially useful patterns from large spatial data sets. Similar to traditional data mining, spatial data
mining techniques can be categorized into clustering, classification, co-location mining, and outlier
detection [4]. However, traditional data mining may not be directly applied to mine spatial data
because of the complexity of spatial data, intrinsic spatial relationships, and spatial autocorrelations.
By the first law of geography, “Everything is related to everything else, but nearby things are more
related than distant things” [55].
In many applications, especially in sensor networks, the spatial data are continuously collected and
the addition of temporal information to spatial data makes the mining of spatial patterns even more
challenging. It is crucial to consider both spatial and temporal dependency during the knowledge
discovery process. To process temporal and streaming data, a number of work has been conducted
on the modelling [36], querying [42, 240, 244], classification [252, 230, 289, 291], clustering [245], as
well as visualization [210].
This research focuses on the development of local space and geometry based techniques for three
spatial mining tasks, including spatial outlier detection, anomalous cluster detection, and spatial
classification. These tasks have a wide array of applications. Some of them are described as follows.
Chapter 1 2
In this following chapters, we use “anomaly detection” to denote both the first two tasks.
• Event detection in sensor networks. Nowadays, sensor networks [214,5,6] have attracted
increasing attentions and many sensor networks are in the deployment process, such as habitat
monitoring applications [29], smart grid [30], and IBM smarter planet [31] projects. There are
a variety of sensor networks applications where anomaly detection is central. Typical examples
include: (1) environment monitoring, in which anomaly detection can identify when and where
an event occurs based on the regional temperature and humidity information collected by
sensors [27]; (2) habitat monitoring, in which sensors are equipped on endangered species to
monitor their daily life, and anomaly detection can indicate their abnormal behaviors [26]; (3)
health and medical monitoring, in which different portions of patients are equipped with sensors
and anomaly detection can indicate potential diseases [28]; (4) industry monitoring, in which
anomaly detection can detect possible malfunctions and other abnormalities by equipping
the temperature, pressure, and vibration amplitude sensors in the machines [29]; (5) target
tracking, in which the moving targets can by tracked by equipping GPS sensors and anomaly
detection can filter erroneous information and improve the tracking accuracy and efficiency [7,
8]; (6) detection of traffic incidents and traffic congestions [56, 286]; and (7) detection of
radioactive, biological or chemical materials [10, 9, 11].
• Object detection in digital images. The literature on the detection of spatial objects
in images has been several decades, mainly focused on the fields of satellite imagery [12–15],
computer vision [16, 17], and medical imaging [18–21]. One of its most recent applications is
within the brain imaging domain. Spatial anomalous cluster detection has been applied to
indicate brain regions that have been affected by some diseases, such as stroke or degenerative
diseases [78]. It has also been applied to identify brain regions that correlate to some brain
activities. For example, it is possible to tell wether a person is watching a movie or reading a
book by monitoring the functional magnetic resonance imaging (FMRI) images of their brain
activities [80].
• Disease Outbreak Surveillance. Disease surveillance is one of the major application do-
mains for spatial anomalous cluster detection. It is of great practical utility to detect the
emerging of disease outbreaks as early as possible. The presence of chemical and biological
pollutions in some geographic regions can also be detected indirectly if these materials have
impacts on human health [22–24].
• Intrusion and virus detection in a computer network. With the widespread use of
internet technologies, computers can be easily affected by virus or worms spreading through
a computer network [25]. The slightly abnormal symptoms (e.g., slight loss of performance
and presence of system instability) presented in infected computers could be difficult to be
detected on a single machine.
1.1 Research Issues 3
1.1 Research Issues
This research aims to investigate and develop local based efficient and effective learning techniques
for spatio-temporal data. The major research issues are stated as follows:
1.1.1 Spatial Outlier Detection
Spatial outlier detection aims to find a small group of data objects that deviate significantly from the
rest large amount of data, by considering the effects of spatial autocorrelations. Existing solutions
for spatial outlier detection can be categorized into two branches, including global and local based
detection methods. Global based methods were designed based on the robust estimation of global
statistical models (e.g., ordinary or universal Kriging models). For this category, outlier detection
can be regarded a by-product of the robust estimation of a prediction model. However there are
applications where outlier detection is central, rather than prediction. It may be important and
more efficient to identify outliers without being able to estimate the complete model. This is the
major motivation for local based detection methods. The basic idea of local based methods is
to first calculate the local difference (or Laplacian-smoothed value) for each object, which is the
difference between the non-spatial attribute of the object and the aggregated value (e.g., average)
of its spatial neighbors. By assuming i.i.d. normal distributions for these local differences, the local
based approach discovers outlier objects by robust estimation of the related local model parameters,
such as the aggregated values, mean, and standard deviation. There are four major issues that this
research addresses for spatial outlier detection.
1. Statistical foundations for local based methods. Existing local based detection methods
have the advantages of simplicity and high efficiency. These methods were designed based on
the fundamental assumption that the calculated local differences are i.i.d. normal. However,
no justifications for this assumption have ever been proposed. It is important to study the
situations where this assumption is appropriate and where it is inappropriate. The appropri-
ateness can be measured by a statistical significance level, e.g., at 0.5% level. A variety of
scenarios need to be tested, which can be modeled by different statistical frameworks (e.g., or-
dinary and universal kriging) under different parameter settings. Example parameters include
different data structures (e.g., continuous space, lattice space, and transportation network),
neighborhood definitions (e.g., defined by K nearest neighbors or by Voronoi), neighborhood
size, and covariance models (e.g., spherical, exponential, and gaussian kernels).
2. Accuracy and performance parametrization. There exist popular situations where the
assumption of i.i.d normal is violated. In these situations the performances of existing local
based detection methods deteriorate significantly. There are four major scenarios to be consid-
ered. First, some data may exhibit linear or nonlinear global trend, which can be represented by
some parametric forms, such as polynomial of spatial locations or linear combinations of other
basis functions (e.g., Gaussian basis functions). Second, the local (or Laplacian) smoothing
1.1.2 Robust Spatio-Temporal Prediction 4
process by calculating local differences can help reduce spatial autocorrelations between data
objects. However this smoothing process will also increase correlations between data objects
because of the convolution effect [54]. Third, some spatial data may have different regional
characteristics, such as population density, community types, and spatial heterogeneities, e.g.,
two cities separated by a mountain range. These regional features will lead to varying auto-
correlations across different regions. Fourth, some spatial data may exhibit a complex trend
structure that can not be described by some parametric form. In this case, nonparametric
estimation techniques need to be considered.
3. Comparisons between local and global based methods. Rare research works have been
published to compare the performance between local and global based methods theoretically
and empirically. From the theoretical side, the key is to identify the situations where the spatial
autocorrelations between objects can not be removed significantly (e.g., 0.05 level) by local (or
Laplacian) smoothing. In these situations, global based methods will perform superior over
local based methods. From the empirical side, a variety of real data sets need be tested to
further justify the results derived from the preceding theoretical analysis.
4. Extension to non-numerical spatial outlier detection. Most existing spatial outlier
detection methods are proposed for numerical spatial data. However, due to the spatial het-
erogeneity, data are often of different types, such as continuous, ordinal, and binary, each
of which conveys important information. For example, in economics studies, the living ar-
eas (continuous variables), the ages of dwelling (ordinal variables), and the indicator which
shows if a dwelling is located in a certain county (binary variables), are usually measured to
characterize the sale prices of houses. It is emerging to generalize univariate outlier detection
techniques to non-numerical data. Two of the major challenges are: 1) the modeling of spatial
dependence for non-numerical data is different from that for numerical data. It is necessary
to design an unified spatial model to capture the spatial dependence for different data types;
2) Laplacian smoothing is mainly applicable to numerical data. It is necessary to find an
alternative approximation strategy to speed up the outlier detection process.
1.1.2 Robust Spatio-Temporal Prediction
Efficient prediction for massive amounts of spatio-temporal data is an emerging challenge in the
data mining field. The state of the art Fixed rank spatio-temporal prediction (FR-STP) offers a
promising dimension-reduced approach for predicting large spatio-temporal data in linear time, but
is not applicable for the nonlinear dynamic environments popular in many real applications. This
deficiency can be systematically addressed by increasing the robustness of the FR-STP using heavy
tailed distributions, such as the Huber, Laplace, and Student’s t distributions. There are two major
issues that this research addresses for robust spatio-temporal prediction.
1. Robust Spatio-temporal prediction for numerical data There are currently two ap-
proaches for predicting spatio-temporal data, namely the Kriging based and dynamic (me-
1.2 Contributions 5
chanic or probabilistic) specification based approaches. Both approaches have the measure-
ment error components that can be modeled using heavy tailed distributions to increase the
models’ robustness. The extension will make the resulting approaches analytically intractable,
and efficient approximate algorithms need to be designed. For dynamic (mechanic or prob-
abilistic) specification based approach, the most advanced model is entitled Spatio-Temporal
Random Effects (STRE) model. It is technically challenging to design a robust version of the
STRE model, and design efficient algorithms that can do robust spatio-temporal prediction
in near linear time. In addition, strategies also need be developed to estimate the confidence
interval of the prediction results. The theoretical properties of the robust version of the STRE
model and its connection with the STRE model need to be explored.
2. Robust Spatio-temporal prediction for non-numerical data The key challenge is to
efficiently model spatial autocorrelations between attributes of different data types, such as
numerical, binary, count, and categorical. Based on spatial generalized linear models, the
observations of different data types at each time stamp can be mapped to a latent vector
of numerical random variables modeled by a multivariate Gaussian distribution. The spatial
autocorrelations between different data types can be then modeled using the covariance matrix
of the multivariate Gaussian distribution. The latent random vectors with different time
stamps can then be modeled by a first order autoregressive model (linear dynamic system) to
capture temporal autocorrelations. There are two major computational challenges. The first
is the necessity to invert an n by n covariance matrix that has the time complexity of O(n3).
The second component is the necessity of applying MCMC for inferences. These two challenges
can be addressed by modeling the latent spatial process as a reduced-rank Gaussian process,
and by using the Integrated Nested Laplace Approximation (INLA) to conduct approximate
inferences [301]. In order to increase the robustness of our proposed model, the model can be
further extended by adding a noise component with a heavy tailed distribution (e.g., Laplace,
Student-t distributions) to the latent Gaussian random variables, and the reduced rank and
INLA can be applied to conduct robust and approximate inferences.
1.2 Contributions
The major proposed research contributions can be stated as follows:
Spatial Outlier Detection
1. A generalized local statistics framework
Propose a new generalized local statistics (GLS) model and evaluate its major statistical prop-
erties. This new GLS model provides statistical interpretations and connections for existing
local and global based outlier detection methods. Propose improved detection methods based
on the GLS model. Conduct extensive simulations and real data sets evaluations to compare
1.2 Contributions 6
the performance between the proposed detection methods and all state of the art local and
global based detection methods. The simulations will consider broad settings (e.g, different
data sizes, global trend functions, distance metrics, neighborhood sizes, and kernel models),
in order to test a variety of scenarios.
2. Significance Evaluation for Laplacian Smoothing
Derive statistical relationships between the quality of Laplacian smoothing and different data
settings, such as data size, neighborhood size, and the spatial distance metric (e.g., Euclidean
and Manhattan distances) used. The objective is to study the situations where Laplacian
smoothing could help reduce autocorrelations between data objects to a significance level, e.g.,
0.05, for the problem of spatial outlier detection.
3. Extension of GLS to non-numerical spatial data
Generalize the proposed GLS model to non-numerical data. The generalized model will use
generalized spatial linear model to capture the spatial dependence between non-numerical
data, use heavy tailed distribution to capture variations due to outliers, and use approximate
inference algorithms such as integrated laplace nested approximation to achieve near linear
time detection efficiency.
To summarize, we proposed two efficient outlier detection approaches that are best suited for
large numerical and non-numerical spatial datasets, respectively.
Robust Spatio-Temporal Prediction
1. Formalization of the robust spatio-temporal prediction problem
A Robust Spatio-Temporal Random Effects (R-STRE) model is proposed in which the mea-
surement error follows a heavy tailed distribution, in place of the traditional Gaussian distribu-
tion. The RFR-STP problem is then formalized as a Maximum-A-Posterior (MAP) prediction
problem based on the R-STRE model.
2. Design of a general RFR-STP algorithm
A general prediction algorithm is proposed utilizing a framework of Newton’s methods that can
be applied to most existing heavy tailed distributions. The proposed algorithm outperformed
the traditional algorithms in nonlinear environments, where some of the underlined distribution
assumptions of Gaussian process and linear dynamic systems are violated.
3. Development of optimization techniques
For the special Huber and Laplace distributions, the corresponding robust prediction problems
with non-continuously differentiable objective functions were first reformulated as Quadratic
Programming (QP) problems, and then primal-dual interior point methods were applied to
achieve a near-linear-order time prediction efficiency.
1.3 Proposal Organization 7
4. Comprehensive experiments to validate the new algorithm’s robustness and effi-
ciency
The proposed techniques were evaluated using an extensive simulation study and experiments
on two real life data sets. The results demonstrated that the proposed algorithm outperformed
traditional prediction algorithms when the data were contaminated by a small portion of
outliers.
To summarize, we proposed the first near-linear-time robust prediction approach for large
spatio-temporal datasets in both offline and online cases.
Novel Applications
1. Activity Analysis Based on Low Sample Rate Smart Meters
Activity-level consumption insights were provided to residents and the city management team
to support decision making. A general disaggregation framework was designed with two imple-
mentations for different scenarios. Appropriate smart meter sample rate to enable consumption
disaggregation was explored. Interesting consumption patterns were identified from the dis-
aggregation results. To the best of our knowledge, this is the first unsupervised approach to
human activity analysis based on low sample rate smart meter data.
2. Device Fingerprinting to Enhance Wireless Security using Infinite Hidden Markov
Random Field
Wireless device fingerprinting is an emerging approach for detecting spoofing attacks in wireless
network. Existing methods utilize either time-independent features or time-dependent features,
but not both concurrently due to the complexity of different dynamic patterns. We proposed
a unified approach to fingerprinting based on iHMRF. The proposed approach is able to model
both time-independent and time-dependent features, and to automatically detect the number of
devices that is dynamically varying. We designed an efficient iHMRF-based online classification
algorithm for wireless environment using variational incremental inference, micro-clustering
techniques, and batch updates. Based on our literature survey, this is the first approach to
wireless device fingerprinting using iHMRF.
1.3 Proposal Organization
The remainder of this research proposal is organized as follows. Chapter 2 presents theoretical
backgrounds and literature survey. Chapter 3 defines a generalized local statistical framework and
three efficient and effective methods for spatial numerical outlier detection. Chapter 4 proposes
a generalized approach to do non-numerical spatial outlier detection, based on generalized linear
models and robust statistics. Chapter 5 presents a robust spatial temporal random effects model
and three efficient algorithms for near linear time robust prediction. Chapter 6 designs a general
1.3 Proposal Organization 8
statistical framework to do energy disaggregation for water smarter meter data. Chapter 7 presents
a novel application of infinite hidden Markov random fields (iHMRF) to the wireless finger printing
problem. Chapter 8 concludes and discusses our future work.
Chapter 2 9
Chapter 2
TheoreticalFoundations andRelated Works
This chapter first describes the fundamental concepts of spatial data mining, including spatial ran-
dom field, covariogram and semivariogram, spatial model decomposition, kriging models, and Lapla-
cian smoothing. It then presents literature surveys on outlier detection, anomalous cluster detection,
and locally linear classification.
2.1 Spatial Data Modeling
This section introduces four major statistical components for spatial data modeling, including spatial
random field, covariogram and semivariogram, spatial model decomposition, and kriging models.
Spatial Random Field
A spatial random field (SRF ) refers to a collection of random variables indexed by a set of spatial
coordinates. It can be represented as
Z(s) | s ∈ D ⊂ R2, (2.1)
where D is a fixed spatial region. A spatial random field is called a Gaussian spatial random field
if any subset of D, e.g., Z(s1), Z(s2), . . . , Z(sn) ⊂ D, follow a multivariate Gaussian distribution.
Note that D is an infinite collection spatial indexes and in real applications only a partial sample of
a particular realization of the random field is available.
A spatial random field is a strict (or strong) stationary random field if the distribution is invariant
2.1 Spatial Data Modeling 10
under translations of coordinates. It is second-order (or weak) stationary if the covariance between
random variables (Z(si) and Z(sj)) is a function of their spatial separation:
E(Z(s)) = µ; Cov[Z(si), Z(sj)] = C(h), (2.2)
where h = si − sj. C(h) is called the covariance function of the spatial process. A second-order
stationary spatial process is called isotropic if the covariance function C(h) = C(‖ h ‖), where ‖ h ‖
is a norm of the lag vector h (or the spatial distance between si and sj). Examples of distance
metrics include Euclidean distance, Manhattan distance, and network distance.
Covaroigram and Semivaroigram
Let Z(s) | s ∈ D ⊂ R2 be a spatial process and define
C∗(si, sj) = Cov(Z(si), Z(sj)). (2.3)
If C∗(si, sj) = C(si−sj), a function of spatial coordinate difference between si and sj, then C(si−sj)
is called the covariogram of the spatial process. If C(si − sj) = C(‖ si − sj ‖), it is called an
isotropic covariogram. There are four popular isotropic covariogram models (C(h;θθθ)), including
linear, spherical, exponential, and gaussian covariograms [53]. Two example models are formulated
as follows
A spherical model is defined as
C(h;θθθ = [b, c]T ) =
b, if h = 0, (2.4)
b
(
1 −3h
2c+
1
2
(
h
c
)3)
, if 0 ≤ h ≤ c, (2.5)
0, if h > 0. (2.6)
A exponential model is defined as
C(h;θθθ = [b, c]T ) =
b, if h = 0, (2.7)
b(1 − exp(−h/c)), if 0 < h ≤ c, (2.8)
0, if h > c. (2.9)
A covariogram model provides a parametric form of the variance-covariance matrix: V ar(Z) = Σ(θ),
where Σij = C(si − sj). A second-order stationary process can be cast terms of a covariogram func-
tion. The covariogram concept also indicates an implicit requirement of a second-order stationary
process: V ar(s) = C(s − s) = C(0), which is independent on s. Note that, for nonstationary pro-
cesses, the function C∗(si, sj) remains valid and the variance-covaraince matrix V ar(Z) = Σ can
still be constructed, but it is not called covariogram.
Similar to the concept of covariogram, if the function γ∗(si, sj) = 12V ar[Z(si)−Z(sj)] is a function
of the coordinate difference with γ∗(si − sj) = γ(si − sj), then the function γ(si − sj) is called the
2.1 Spatial Data Modeling 11
semivariogram of the spatial process. There is a close relation between covariogram and variogram.
If C(h) is well-defined, then covariogram and variogram are defining a same stationary process. The
equivalence can be derived as follows:
V ar[Z(si) − Z(sj)] = V ar[Z(si)] + V ar[Z(sj)] − 2Cov[Z(si), Z(sj)] (2.10)
= 2[C(0) − C(si − sj)] = 2γ(si − sj). (2.11)
Spatial Model Decomposition
A popular model decomposition for a spatial random field can be formulated as:
Z(s) = µ(s) + ω(s) + e(s), (2.12)
where µ(s) is the large scale trend (mean) of the spatial random field, ω(s) is the smooth-scale
variation, and e(s) is the white noise measurement error. The first component is determinis-
tic and the other two components are random processes. The large scale trend µ(s) is usually
modeled by a function of s and its related covariates x(s) : µ(s) = f(x(s),βββ), where βββ is a vec-
tor of unknown function parameters. For example, we can define f(x(s),βββ) = x(s)Tβββ, where
x(s) = [sd1, sd2, s2d1, s
2d2, sd1 · sd2]
T , and sd1 and sd2 refers to the first and second dimension coordi-
nates of s, respectively. In this case, the large scale trend is assumed to be a second-order polynomial
function of spatial locations. The smooth-scale variation ω(s) is a spatial process that causes spatial
dependencies between data objects.
Suppose a set of observations Z(s1), Z(s2), ..., Z(sn) is generated from a Gaussian spatial random
field that is second-order stationary and isotropic. By employing the above decomposition from,
let Z = [Z(s1), ..., Z(sn)]T , ω = [ω(s1), ..., ω(sn)]T , e = [e(s1), . . . , e(sn)]T , and X = [x1, . . . ,xn]T .
Then we have
Z = Xβ + ω + e ∼ N (Xβ,Σ(θθθ)), (2.13)
where Σ(θθθ) = V ar(Z) = Σω(θθθ) + σ20I, ω ∼ N (0n×1,Σn×n(θθθ)), and e ∼ N (0n×1, σ
20In×n).
Kriging Models
Kriging is a family of Best Unbiased Linear Predictors (BULP) for spatial data. There are three
most popular kriging models, including simply Kriging, ordinary Kriging, and universal kriging.
Simple Kriging is designed for spatial data with known means, ordinary Kriging is designed for
spatial data with constant but unknown means, and universal Kriging is designed for varying and
unknown means. The first two models can be looked as spatial cases of universal Kriging. The basic
idea of universal Kriging (UK) is stated as follows.
Given a set of observations S = Z(s1), Z(s2), ..., Z(sn) ⊂ U = Z(s) | s ∈ D ⊂ R2, the objective
is to predict the Z value of a “new” location s, Z(s) ∈ U − S. Universal kriging considers linear
predictors: Z(s) = x(s)Tβββ, where x(s) is a vector of covariates of s. Mean squared prediction error
is used as the error score function.
2.1 Spatial Data Modeling 12
Let Z = [Z(s1), ..., Z(sn)]T and x1, . . . ,xn]T . Assume that the variance-covariance V ar[Z] = Σ,
Cov[Z, Z(s)] = σσσ, and V ar[Z(s)] = σ0. Universal kriging is to solve the following optimization
problem.
minimizeaaa
E[
(aT Z − Z(s))2]
subject to E[aT Z] = E[Z(s)].(2.14)
By the method of Lagrange multipliers, we can derive the analytical solution as
a = Hσσσ + Σ−1X(XΣ−1X)−1x(s), (2.15)
where H = Σ−1 − Σ−1X(XΣ−1X)−1XΣ−1.
By the form of a, it can be readily derived that βββUK = (XΣ−1X)−1XΣ−1Z and the best linear
unbiased predictor of Z(s) can be written as
PUK(Z; s) = x(s)T βββUK + σσσTΣ−1(Z − XβββUK). (2.16)
The above optimization process assumes that the components Σ, σσσ, and σ0 are known. However,
in real applications, these components are unavailable and need to be treated as unknown model
parameters to be estimated. Without any assumption about structures of these components, the
total number of unknown parameters will be greater than N2, whereas the total number of training
observations is only N . It is impossible to accurately estimate all these parameters, given the
limited training data. To make the estimation process practical, some covariogram function C(h;θθθ)
is usually predefined, and the preceding components can be rewritten as Σ(θθθ), σσσ(θθθ), and σ0(θθθ).
Then the optimization problem becomes the search of optimal a and θθθ, such that the mean squared
prediction error can be minimized. Notice the relationship between a and βββ, the optimization
problem can also be reformulated as a generalized least squares problem:
minimizeβββ,θθθ
[Z − Xβββ]T
Σ(θθθ)−1 [Z− Xβββ]
subject to the constraints of θθθ defined by the covariogram function.(2.17)
By this form, it is now clear that the above problem is nonconvex and there is no analytical form
solution because of the component Σ(θθθ)−1 in the objective function. A numerical method termed
iteratively re-weighted generalized least squares (IRWGLS) is proposed to search for an local opti-
mal solution, but still computationally expensive [54]. The basic idea is to estimate the parameters
βββ and θθθ iteratively, similar the popular EM algorithm.
Spatial Linear (Gaussian Process) Model
Let Y (s) : s ∈ D ∈ R2 be a real-valued spatial process. The Spatial Linear Model (SLM) first
decomposes the spatial process into two additive components
Y (s) = Z(s) + ε(s), s ∈ D, (2.18)
2.1 Spatial Data Modeling 13
where ε(s) is a spatial white noise process with mean zero and var(ε(s)) = τ2 > 0, and τ2 is a
parameter to be estimated. The white noise assumption implies that cov(ε(s), ε(r)) = 0, unless
s = r. The hidden process Z(s) is assumed to have the linear mean structure
Z(s) = µ(s) + η(s), s ∈ D, (2.19)
where µ(s) is a vector of deterministic (spatial) mean or trend functions, modeling large scale
variations, and the random process η(s) captures the small scale variations. A common strategy is
to define µ = xT (s)β, where x(s) refers to a vector of known covariates, and the coefficients β are
unknown. The hidden process η(s) is assumed to follow a zero mean spatial Gaussian process
η(s) ∼ GP(
0, σ2C(η(s), η(s′)|φ))
, (2.20)
where σ2 refers to the variance, and C(η(s), η(t)|φ) refers to the correlation functions of the process
controlled by the parameter φ. By definition, a Gaussian process implies that any subset of latent
variables η = η(s1), · · · , η(sN ) follows a multivariate Gaussian distribution: η ∼ N (0,Σ), where
Σi,j = σ2C(η(si), η(sj). The correlation function C(η(si), η(sj)) controls the smoothness and scale
between latent variables (η(s)), and can be selected freely as long as the resulting covariance matrix
is symmetric and positive semi-definite. A popular so-called exponential function can be formalized
as
C(η(si), η(sj) = exp
(
‖ si − sj ‖2
φ
)
. (2.21)
Combining Equations (2.18) to (2.20) and defining µ(s) := xT (s)β, the SLM model can then be
described as
Y (s) = xT (s)β + η(s) + ε(s)
η(s) ∼ GP(
0, σ2C(η(s), η(s′)|φ))
ε(s) ∼ N (0, τ2) (2.22)
Let Y = [Y (s1), · · · , Y (sN )]T , the vector of observations at N sampled locations. A discretized
version of the GLM model can be formalized as
Y = Xβ + η + ε
η ∼ N (0, σ2R(φ))
ε ∼ N (0, τ2I), (2.23)
where X = [x(s1), · · · ,x(sN )]T , η = [η(s1), · · · , η(sN )]T , ε = [ε(s1), · · · , ε(sN )]T , and Rij(φ) =
C(η(si), η(sj)|φ)
Robust Spatial Linear (Gaussian Process) Model
2.2 Laplacian Smoothing 14
Recently, [255] presented a robust version of spatial linear model, by using the zero-mean Student’t
distribution to model the measurement error, instead of the traditional Gaussian distribution. The
robust SLM model can be formalized as
Y = Xβ + η + ε
η ∼ N (0, σ2R(φ))
εn ∼ Student′t(0, ν, τ), n = 1, · · · , N. (2.24)
The zero-mean Student’s t distribution Student′t(0, ν, τ) has the probability density function as
p(εtn) =Γ(ν
2 + 12 )
Γ(ν/2)(
1
πνσ)
12 (1 +
ε2
νσ)−
ν2 − 1
2 , (2.25)
where ν is the degrees of freedom and τ is the scale parameter.
Different from the regular SLM model, inferences based on the robust SLM model are analytically
intractable, and approximate methods need to be considered. The authors evaluated the performance
of the robust SLM model by using a variety of approximate inference methods, including Markov
chain Monte Carlo (MCMC), Laplace approximation, factorizing variational approximation (fVB),
and expectation propagation (EP). The results indicate that the EP approach outperformed other
approximate inference methods in overall on both the efficiency and effectiveness.
Bayesian Hierarchical Model
Bayesian hierarchical model refers to a type of statistical model where the parameters of a hierarchical
model are themselves treated as random variables, and the second-level parameters are known as
hyper-parameters. In the SGLMM model, the model parameters include β, σ2, φ, and τ . Prior
distributions can be defined on those parameters. Specifically, β is assigned a multivariate Gaussian
prior, i.e., β ∼ N (µβ ,Σβ). The variance component σ2 is assigned an inverse-Gamma prior, i.e.,
σ2 ∼ Inv-Gamma(ασ,βσ). The correlation parameter φ is usually assigned an informative prior
decided based on the underlying spatial domain, i.e., a uniform distribution over a finite range. The
prior distribution of the dispersion parameter τ is decided depending on the specific exponential
distribution. For Gaussian distribution, the prior is an Inverse-Gamma distribution. For binomial
and poisson, τ is set to 1, a deterministic value, and hence no priors are needed.
2.2 Laplacian Smoothing
This section introduces the concepts of (continuous) Laplace operator and discrete Laplace opera-
tor, and discusses Laplacian smoothing and its connections with local based spatial outlier detection
methods. The discussions are focused on a two-dimensional spatial space and could be straightfor-
wardly generalized to higher dimensional spaces.
2.2 Laplacian Smoothing 15
Continuous and Discrete Laplace Operator
A continuous Laplace operator () is defined as the divergence of the gradient of a function f .
Given a real-valued function f(x) : x = [x1, x2]T ∈ R2 → R twice-differentiable, the Laplacian of f
is defined by
f = 2f =
2∑
i=1
∂2f
∂x2i
. (2.26)
Let G = (V,E) be a graph with vertices V and edges E. Let f : V → R be a real-valued function
of the vertices. A discrete (or graph) Laplacian () is defined by
(f) (u) =∑
v∈N(u)
Wuv[f(u) − f(v)], (2.27)
where N(u) refers to nearest neighbors of the vertex u and Wuv refers to the weight of the edge
between u and v.
Edge weights can be defined based on specific application requirements. For a set of spatial ob-
servations Z(s1), Z(s2), ..., Z(sn), K-nearest neighbor graph is usually employed to model spatial
neighborhood relationships. In this graph, each vertex relates to a spatial location, and the function
f gives the nonspatial attribute value: f(si) = Z(si). There are two popular weight functions,
including averaging and heat kernels.
The averaging kernel is defined by
Wij =
1/K, if sj ∈ N(si), (2.28)
0, otherwise. (2.29)
The heat kernel is defined by
Wi,j =
exp− ‖sj−si‖
4t , if sj ∈ N(si), (2.30)
0, otherwise. (2.31)
Laplacian Smoothing
A Laplacian matrix Ln×n is defined as
Lij =
−Wij , if sj ∈ N(si), (2.32)n∑
j=1
Wij , if i = j, (2.33)
0, otherwise. (2.34)
Let D be a diagonal matrix with Dii =∑n
j=1 Wij . It can be derived that L = D − W. Let
2.3 Approximate Inference Techniques 16
Z = [Z(s1), Z(s2), ..., Z(sn)]T , then the discrete laplacians can be calculated by
Z = LZ. (2.35)
The linear transform process Z∗ = LZ is called Laplacian smoothing, and the components in Z∗ is
called adjusted observations after Laplacian smoothing (or Laplacian-smoothed observations)
There is a close connection between Laplacian smoothing and local based spatial outlier detection
methods. The local statistics defined in Equation 2.45 is the same as a Laplacian smoothing process
based on an averaging kernel. Notice that a second-order stationary process has a stable energy
for different realizations of the process. Assume that we are given the whole set of observations
R = Z(s) | s ∈ D ⊂ R2. Define the function f as f(s) = Z(s). Then the set Z(s) | s ∈ D ⊂ R2
relates to a three-dimensional manifold surface and the energy of the spatial process can be calculated
as
E(f) =
[∫
R
‖ f(s) ‖2 ds
]
= C, (2.36)
where C is a constant value.
Suppose only partial observations of the surface (or realization) are available: R = Z(s1), ..., Z(sn),
then we can use the discrete form of the energy function
E(f) = ZT LZ =∑
i,j
Wij [Z(si) − Z(sj)]2 ≈ C. (2.37)
The presence of outliers in the set R will increase the energy E(f) of the spatial process. Therefore,
outlier detection is actually to identify a small number of observations such that the updated energy
after removal of these observations can be minimized.
2.3 Approximate Inference Techniques
This section introduces two advanced approximate inference techniques, including the Integrated
Nested Laplace Approximation and Expectation Propagation.
The Integrated Nested Laplace Approximation
The integrated nested laplace approximation (INLA) [217] is a computational approach which is
proposed as an alternative of the time consuming MCMC method. The INLA approximation per-
forms Bayesian inferences in latent Gaussian fields. It approximates the marginal posteriors for the
latent variables as well as for the parameters of the Gaussian latent model, given by
π(vi|Y ) =
∫
π(vi|θ, Y )π(θ|Y )dθ (2.38)
This approximation is an efficient combination of Laplace approximations to the full conditionals
2.3 Approximate Inference Techniques 17
π(θ|Y ) and π(vi|θ, Y ), and finally executes numerical integration routines by integrating out the
parameter θ.
The INLA approach consists of three main approximations to obtain the marginal posteriors for
each latent variable. The first step is to approximate the full posterior π(θ|Y ), which is executed
using the Laplace approximation
π(θ|Y ) ∝π(v, θ, Y )
πG(v|θ, Y )
∣
∣
v=v∗(θ)(2.39)
As shown above, we need to approximate the full conditional distribution of π(v|Y, θ), which can
be achieved by a multivariate Gaussian density πG(v|Y, θ) [218]. The v∗(θ) is the mode of the full
conditional distribution of v for a given θ and can be estimated using πG(v|Y, θ). The posterior
π(θ|Y ) will be used later to integrate out the uncertainty with respect to θ when approximating
π(vi|Y ).
The second step executes the Laplace approximation of the full conditionals π(vi|θ, Y ) for specified
θ values. The density π(vi|θ, Y ) is approximated using Laplace approximation defined by
πLA(vi|θ, Y ) ∝π(v, θ, Y )
πG(v−i|vi, θ, Y )
∣
∣
v−i=v∗(vi,θ)(2.40)
where πG(v−i|vi, θ, Y ) refers to the Gaussian approximation of π(v−i|vi, θ, Y ) which takes the vi as
a fixed value. v∗(vi, θ) is the mode of π(v−i|vi, θ, Y ).
Finally, we can approximate the marginal posterior density of vi by combining the full posteriors
obtained in the previous steps. The approximation expression is shown as follows.
π(vi|Y ) ≈∑
k
π(vi|θk, Y )π(θk|Y )k (2.41)
It is a numerical summation on a representative set of θk, with area weight k for k = 1, · · · ,K.
Note that a good choice of the set θk is crucial to the accuracy of the above numerical integration.
Expectation Propagation
Expectation Propagation [219] is an efficient approximate inference framework that has been shown
better predictive performance than traditional inference approaches, such as variational approxi-
mation and Laplace approximation [255]. Given observed data D and hidden variables (including
parameters) θ, for many probabilistic models, the posterior distribution of θ given D comprises a
product of factors with the form
p(θ|D) =1
p(D)
∏
i
fi(θ). (2.42)
2.4 Outlier Detection 18
EP aims to approximate p(θ|D) by a product of factors
q(θ) =1
p(D)
∏
i
fi(θ), (2.43)
in which each factor fi(θ) relates to the one of the factors (fi(θ)) in Equation 2.42. The factors fi(θ)
are usually constrained to parametric forms (e.g., exponential family) in order to make the inference
algorithm practical.
Basically, EP conducts iterative refinement the approximate posterior q(θ|D) by adding additional
message passes through the factors. For each iteration, EP first replaces one of the approximate
factors fi(θ) with the true factor fi(θ), denoted as q\i(θ)fi(θ). It then refines the new posterior by
moments matching between qnew(θ) and q\i(θ)fi(θ). After that, the new factor fi(θ) is updated as
fi(θ) ∝qnew(θ)
q\i(θ). (2.44)
EP continues the refinement iterations until all factors fi(θ) converge. Note that, the EP convergence
has not be theoretically justified, but in practice the convergence is often achieved as occurred in
our problem.
2.4 Outlier Detection
This section first introduces general outlier detection, and then presents related works on spatial
outlier detection and multivariate spatial outlier detection [53, 54].
General Outlier Detection
Existing outlier detection algorithms can be classified into the following categories: clustering-based,
distribution-based, depth-based, density-based, and distance-based. A few clustering-based algo-
rithms have been designed to identify outliers as exceptional data points that do not belong to any
cluster [156,128,141]. Since these algorithms are not specifically designed for outlier detection, their
efficiency and effectiveness are not optimized. Distribution-based methods use a standard distribu-
tion to fit the data set so that data points deviating from this distribution are defined as outliers [154].
The primary limitation of these methods is that in many applications, the exact distribution of a
data set is unknown beforehand. Depth-based methods organize the data in different layers of k-d
convex hulls where data in the outer layers tend to be outliers [144, 283]. These methods are not
widely used due to their high computation costs for multi-attribute data. Density-based algorithms
define outliers in terms of their local reachability densities [123,133]. Local outlier factor (LOF) is a
typical example of density based algorithms which evaluate the outlierness of an object by compar-
ing its density with those of its neighbors. Distance-based methods may be the most widely used
techniques which define an outlier as a data point having an exceptionally far distance to the other
data points [262,280].
2.4 Outlier Detection 19
Spatial Outlier Detection
Traditional outlier detection algorithms can be applied to spatial data. However, their performance
is not assured since they treat spatial attributes and non-spatial attributes equally. For spatial
outlier detection, spatial and non-spatial dimensions should be considered separately. The spatial
dimension is used to define the neighborhood relationship, while the non-spatial dimension is often
used to define the discrepancy quantity. By the first law of geography, “Everything is related to
everything else, but nearby things are more related than distant things” [55].
A number of algorithms have been specifically designed to deal with spatial data. These methods
can be generally grouped into two categories, namely, graphic and quantitative approaches. Graphic
approaches are based on visualization of spatial data which highlights spatial outliers. Examples
include variogram clouds and pocket plots [247,277]. A Scatterplot shows the attribute value on the
X-axis and the average of the attribute values over the neighborhood on the Y -axis. Nodes far away
from the least square regression line are flagged as potential spatial outliers. A Moran scatterplot
is a plot of normalized attribute value against the neighborhood average of normalized attribute
values. It contains four quadrants where spatial outliers can be identified from the upper left and
lower right quadrants.
Quantitative methods provide tests to distinguish spatial outliers from the remainders of the data
set. These methods can be further grouped into two categories, namely, local statistics and global
statistics based approaches. Given a set of observations Z(s1), Z(s2), ..., Z(sn), a local spatial
statistic [56] is defined as
S(s) = [Z(s) − Esi∈N(s)(Z(si))], (2.45)
where G = s1, ..., sn ⊂ R2 is a set of spatial locations, s ∈ G, Z(s) ∈ R represents the value of Z
attribute at location s, N(s) is the set of spatial neighbors of s, and Esi∈N(s)(Z(si)) represents the
average attribute value for the neighbors of s. It is assumed that the set of local spatial statistics
S(s1), ..., S(sn) are independently and identically normally distributed (i.i.d. normal). Then
the popular Z-test [56] for detecting spatial outliers can be described as follows: Spatial statistic
ZS(s) = |(S(s) − µs)/σs| > Φ−1(α/2), where Φ is the cumulative distribution function (CDF ) of a
standard normal distribution, α refers to significance level and is usually set to 0.05, and µs and σs
are the sample mean and standard deviation, respectively.
Lu et al. [57] pointed out that the Z-test is susceptible to the well-known masking and swamping
effects. When multiple outliers exist in the data, the quantities Esi∈N(s)(Z(si)), µs, and σs are
biased estimates of the population means and standard deviation. As a result, some true outliers are
“masked” as normal objects and some normal objects are “swamped” and misclassified as outliers.
The authors proposed an iterative approach that detects outliers by multi-iterations. Each iteration
identifies only one outlier and modifies its attribute value so that it will not impact the results
of subsequent iterations. Later, Chen et al. [58] proposed a median based approach that uses
median estimator for the quantities Esi∈N(s)(Z(si)) and µs, and median absolute deviation (MAD)
estimator for σs. Hu and Sung [60] proposed an approach similar to [58], but using trimmed mean
2.4 Outlier Detection 20
to estimate Esi∈N(s)(Z(si)), instead of the median estimator. Sun and Chawla [61] presented a
spatial local outlier measure to capture the local behavior of data in their neighborhood. Shekhar et
al. [286] employed a graph-based method to define spatial neighborhoods (N(s)) and their method
is applied to a special case of transportation network.
Global based approaches identify outliers using the robust estimator of a global kriging model which
is the best linear unbiased estimator for geostatistical data. Particularly, Christensen et al. [62]
proposed diagnostics to detect spatial outliers on the estimation of covariance function. Cerioli and
Riani [63] developed a forward search procedure to identify spatial outliers for an ordinary kriging
model. Militino et al. [64] further generalized the forward search method in [63] to a universal kriging
model.
Multivariate Outlier Detection
The above methods for detecting outliers focus on low dimensional data. For detecting outliers with
numerous attributes, traditional outlier detection approaches are ineffective due to the curse of high
dimensionality, i.e., the sparsity of the data objects in a high dimensional space [212]. It has been
shown that the distance between any pair of data points in a high dimensional space is so similar
that either every data point or none of the data points can be viewed as an outlier if the concept of
proximity is used to define outliers [209]. As a result, traditional Euclidean distance cannot be used
to effectively detect outliers in high dimensional data sets. Two categories of research work have
been conducted to address this issue. One is to project high dimensional data to low dimensional
data [211, 212, 122, 249], and the other is to re-design distance functions to accurately define the
proximity relationship between data points [209].
Currently, there is a limited number methods proposed for multivariate spatial outlier detection.
Two representative approaches basically generalize local and global based univariate approaches to
multivariate spatial data. Particularly, Chen. et al. [58] extend the univariate median based (local)
method [58] to multivariate data. Mahalanobis distance is used to capture the local differences
correlations between different attributes, and Minimum Covariance Determinant (MCD) estima-
tor is used to replace Median estimator. Militino et al. [64] extend the univariate forward search
(global based) method to multivariate data. Multivariate kriging (or named co-kriging) model is
used to replace (univariate) kriging model. Other related methods include robust trend parameters
estimation [94] and robust covariogram parameters estimation [95] for multivariate spatial data.
Chapter 3 21
Chapter 3
A GeneralizedApproach to NumericalSpatial OutlierDetection
Local based approach is a major category of methods for spatial outlier detection (SOD). Currently,
there is a lack of systematic analysis on the statistical properties of this framework. For example,
most methods assume identical and independent normal distributions (i.i.d. normal) for the calcu-
lated local differences, but no justifications for this critical assumption have been presented. The
methods’ detection performance on geostatistic data with linear or nonlinear trend is also not well
studied. In addition, there is a lack of theoretical connections and empirical comparisons between
local and global based SOD approaches. This chapter discusses all these fundamental issues under
the proposed generalized local statistical (GLS) framework. Furthermore, robust estimation and
outlier detection methods are designed for the new GLS model. Extensive simulations demonstrated
that the SOD method based on the GLS model significantly outperformed all existing approaches
when the spatial data exhibits a linear or nonlinear trend.
This chapter is organized as follows. Section 3.1 introduces background and motivation. Section 3.2
introduces the generalized local statistical model and presents a rigorous theoretical treatment of
its fundamental statistical properties. Section 3.3 introduces several robust estimation and outlier
detection methods for the GLS model, and analyzes the connection between different SOD methods.
Section 3.4 provides the simulations and discussions, and Section 3.5 gives the conclusion.
3.1 Background and Motivation 22
3.1 Background and Motivation
The ever-increasing volume of spatial data has greatly challenged our ability to exact useful but
implicit knowledge from them. As an important branch of spatial data mining, spatial outlier detec-
tion aims to discover the objects whose non-spatial attribute values are significantly different from
the values of their spatial neighbors [53]. In contrast to traditional outlier detection, spatial outlier
detection must differentiate spatial and non-spatial attributes, and consider the spatial continuity
and autocorrelation between nearby samples. By the first law of geography, “Everything is related
to everything else, but nearby things are more related than distant things.” [55]
There are two main streams for spatial outlier detection (SOD): local and global based approaches.
Local based approach [56] first calculates the local difference (statistic) for each object, which is the
difference between the non-spatial attribute of the object and the aggregated value (e.g., average)
of its spatial neighbors. By assuming i.i.d. normal distributions for these local differences, the
local based approach discovers outlier objects by robust estimation of model parameters, such as
the aggregated values, mean, and standard deviation. Various methods have been presented by
using various spatial neighborhood definitions and robust estimation techniques [57,61]. The second
stream, global based, is to identify outliers using the robust estimator of a global kriging model which
is the best linear unbiased estimator for geostatistical data. Particularly, Christensen et al. [62]
proposed diagnostics to detect spatial outliers on the estimation of covariance function. Cerioli and
Riani [63] developed a forward search procedure to identify spatial outliers for an ordinary kriging
model. Militino et al. [64] further generalized the forward search method in [63] to a universal kriging
model. We focuses on local based methods, because local based methods are simpler to understand
and implement and can achieve better efficiency with minimal loss of accuracy. This will be justified
by extensive simulations in Section 3.5.
This work is primarily motivated by the current situation where there is still no systematic study
about the statistical properties of local based SOD methods. For example, existing works assume
i.i.d. on local differences, but no justifications have ever been proposed. Also, their performance
on spatial data with linear or nonlinear trends has not been well studied. There is also a lack of
research on the theoretical connections and empirical comparisons between local and global based
SOD methods. To that end, this chapter present a generalized framework for local based SOD
methods and theoretically and empirically compares it to global based SOD methods. The proposed
framework is casted within the statistical abstraction of a spatial Gaussian random field which is
the most popular model for geostatistical data [53, 54]. A major reason for its popularity is that
the optimal solution based on the Gaussian random field is equivalent to a best linear unbiased
estimator that imposes no particular distributional assumption.
A spatial Gaussian random field refers to a collection of dependent random variables that are asso-
ciated with a set of spatial indexes, Z(s), s ∈ D ⊂ R2, where D is a continuous fixed region. This
family of random variables can be characterized by a joint Gaussian probability density or distribu-
tion. In real applications, only partial observations of one realization (or a partial sample of size one)
3.2 Spatial Local Statistics and Related Works 23
are available: Z(s1), ..., Z(sn). In order to make this model operational, the requirements for sta-
tionarity and isotropy, such as second-order or intrinsic stationarity, are further imposed. Imposing
such an assumption reduces the number of model parameters required to be estimated. When the
data is second-order stationary and isotropic, the spatial correlation structure is described by some
semivariogram or covariance function, in which the correlation between two variables is dependent
on their spatial distance. Statistical inferences are then performed by assuming some explicit forms
of the covariance and mean functions.
Our major contributions are as follows:
• Design of a generalized local statistical framework: The general local statistical (GLS)
model is a generalized statistical framework for existing local based SOD methods. It can
effectively handle complex situations where the spatial data exhibits a global trend or non-
negligible dependences between local differences.
• Robust estimation and outlier detection methods based on the proposed GLS
framework: Analyze contamination issues that cause the masking and swamping effects of
outlier detection. Based on the analysis, two robust algorithms, GLS-backward search and
GLS-forward search, are proposed to estimate the parameters for the GLS model.
• In-depth study on the connection between different SOD methods: Present theo-
retical foundations for existing local based SOD methods and discuss the crucial connections
between local and global based SOD methods.
• Comprehensive simulations to validate the effectiveness and efficiency of GLS:
This is the first work that provides extensive comparisons between existing popular methods
through a systematic simulation study. The results show that the proposed GLS-SOD ap-
proach significantly outperformed all existing methods when the spatial data exhibits a linear
or nonlinear trend.
3.2 Spatial Local Statistics and Related Works
Given a set of observations Z(s1), Z(s2), ..., Z(sn), a local spatial statistic [56] is defined as
S(s) = [Z(s) − Esi∈N (s)(Z(si))], (3.1)
where G = s1, ..., sn ⊂ R2 is a set of spatial locations, s ∈ G, Z(s) ∈ R represents the value of Z
attribute at location s, N(s) is the set of spatial neighbors of s, and Esi∈N(s)(Z(si)) represents the
average attribute value for the neighbors of s. It is assumed that the set of local spatial statistics
S(s1), ..., S(sn) are independently and identically normally distributed (i.i.d. normal). Then
the popular Z-test [56] for detecting spatial outliers can be described as follows: Spatial statistic
ZS(s) = |(S(s) − µs)/σs| > Φ−1(α/2), where Φ is the cumulative distribution function (CDF ) of a
3.3 Generalized Local Spatial Statistics 24
standard normal distribution, α refers to significance level and is usually set to 0.05, and µs and σs
are the sample mean and standard deviation, respectively. An number of improved methods have
been proposed based on robust estimation of local model parameters, such as local statistics, mean,
and standard deviation [57, 58, 60, 61, 286].
Most existing local based methods assume that the set of local statistics S(s1), , S(sn) are i.i.d.
normal, but no justifications for this assumption have been proposed. As we will discuss in sub-
sequent sections, this i.i.d. assumption is only approximately true in certain scenarios, and the
dependencies between different local differences (statistics) must be considered when the spatial
data exhibit linear or nonlinear trend or the selected neighborhood size for each object is small. As
shown in our simulations in Section 3.5, the violation of i.i.d. assumption can significantly impact
the accuracies of the outlier detection methods.
3.3 Generalized Local Spatial Statistics
This section first introduces some preliminary background on spatial Gaussian random field, then
presents the generalized local statistical (GLS) model, and finally discusses the statistical properties
of our GLS model. Table 3.1 summarizes the key notations used in chapter.
Table 3.1: Description of major symbols
Symbol Descriptions
Z(si)ni=1 A given set of observations, where si ∈ R
2 is the spatial location and Z(·) is theZ attribute value
x(si)ni=1 x(si) is a vector of covariates of si, such as the bases of spatial coordinates of si
Z Z = [Z(s1), ..., Z(sn)]T
X X = [x(s1), ...,x(sn)]T
F Neighborhood weight matrix; See Equation 3.4
N(s) A general definition of spatial neighbors of s.
NK(s) K-nearest neighbors of s. We consider NK(s) as the specification of N(s).
K Neighborhood size. It is the major parameter to define spatial neighbors (NK(s)).
SOD Spatial Outlier Detection
GLS Generalized Local Statistics Model
β, σ, σ0 The unknown parameters in the GLS model
3.3.1 Generalized Local Statistic Model ( GLS)
Consider a spatial Gaussian random field Z(s)|s ∈ D ⊂ R2 with the following form:
Z(s) = f(x(s),β) + ω(s) + e(s), (3.2)
3.3.1 Generalized Local Statistic Model (GLS) 25
where D is a fixed region, f(x(s),β) is the large scale trend (mean) of the process, ω(s) is the smooth-
scale variation that is a Gaussian stationary process, and e(s) is the white noise measurement error
with variance σ20 . For the large scale trend f(x(s),β), x(s) is a vector of covariates, and β is a
vector of parameters for the trend model. We assume that x(s) is a vector of the basis of spatial
coordinates of s, and f(x(s),β) is a linear function with f(x(s),β) = x(s)Tβ. The nonlinear degree
of the trend depends on the polynomials of the elements in x(s). For the smooth-scale variation
ω(s), we assume that it is an isotropic second order stationary process, which means the covariance
Cov(Z(s1), Z(s2)) is a function of the spatial distance between s1 and s2: C(‖ s1 − s2 ‖). Various
distance metrics may be selected, such as L2 (Euclidean distance), L1 (Manhattan distance), and
graph distance [62].
Given a set of observations Z(s1), Z(s2), ..., Z(sn) that is a partial sample of a particular realization
of the spatial Gaussian random field, let Z = [Z(s1), ..., Z(sn)]T , ω = [ω(s1), ..., ω(sn)]T , e =
[e(s1), . . . , e(sn)]T , and X = [x1, . . . , xn]T . Then we have
Z = Xβ + ω + e ∼ N (Xβ,Σ + σ20I), (3.3)
where ω ∼ N (0n×1,Σn×n), and e ∼ N (0n×1, σ20In×n).
The vector of local spatial statistics calculated by Equation 3.1 can be reformulated as the matrix
form
diff(Z) = FZ, (3.4)
where F ∈ Rn×n is a neighborhood weight matrix with Fij = 1 when i = j; Fij = − 1K , when
sj ∈ NK(si); and Fij = 0 otherwise. By Equations 3.3 and 3.4, we can readily derive the generalized
local statistical (GLS) model as
diff(Z) ∼ N (FXβ,FΣFT + σ20FFT). (3.5)
As shown in Section 3.3.2, FΣFTcan be approximated by σ20I. It follows that the GLS form (5)
becomes asymptotically equivalent to
diff(Z) ∼ N (FXβ, σ2I + σ20FFT ). (3.6)
As indicated in Section 3.3.2 Theorem 1, when the neighborhood size is relatively large with K ≥ 8,
the component σ20FFT can be further approximated by σ2
0I. This leads to a simpler form of GLS
as
diff(Z) ∼ N (FXβ, (σ2 + σ20)I). (3.7)
Discussion: Local statistics is a popular technique used to reduce the dependence between sample
points. However, by employing the decomposition form as indicated in the above equations, we
observe that local statistics help reduce the correlations between sample points caused by smooth-
scale random variations, but at the same time it also induces “new” correlations due to the averaging
3.3.2 Theoretical Properties of GLS 26
of white noise variations. As discussed in [54], correlated data can be expressed as linear combination
of uncorrelated data. The approximateGLS form 3.6 explicitly models the “new” correlations caused
by the averaging of white noises variations. The approximate GLS form 3.7 essentially ignores these
“new” correlations. The form 3.7 may be considered when users expect high efficiency and allow
some loss of accuracy. This tradeoff is studied in Section 3.5 by simulations.
The generalized local statistical model above has the unknown parameters β, σ, and σ0. The robust
estimation of these parameters will be discussed in Section 3.4.
3.3.2 Theoretical Properties of GLS
This section studies the properties of two major covariance components σ20FFT and FΣFT , and
discusses the situations where they can be approximated by σ20I and σ2I, respectively. As shown
in equation 3.3, σ20FFT and FΣFT are the covariance matrices of the random vectors e∗ = Fe and
ω∗ = Fω, respectively. We focus on the study of their correlation structures. Because they are both
multivariate normally distributed, the correlation structure gives important information about the
related dependence structure (e.g., zero correlation implies independence). Three related theorems
are stated as follows:
Theorem 1 The random vector e∗ has two major properties
1. The variance V ar(e∗i ) = K+1K σ2
0 , i = 1...n,
2. The correlation |ρ(e∗i , e∗j )| ≤
2K+1 , ∀i, j with i 6= j,
where e∗i refers to the i-th element in the vector e∗.
Proof First, we prove Property 1. Recall that V ar(e∗) = σ20FFT , where F is the neighborhood
weight matrix (see Section 3.3.1 Equation 3.4 for the definition). For simplicity, we represent F as
[F1,F2, . . . ,Fn]T and let Fij denote the j-th component of the vector Fi. According to the definition
of F, Fii = 1;Fij = − 1K , if sj ∈ Nk(si); otherwise, Fij = 0. It implies that V ar(e∗i ) = [σ2
0FFT ]ij =
σ20F
Ti Fj = σ2
0(1 + ΣKi=1
1K2 ) = σ2
0(1 + 1K ) = 1+K
K σ20 , ∀i = 1, . . . , n. This proves Property 1.
Second, we prove Property 2. ∀i, j ∈ 1, . . . , n, the correlation ρ(e∗i , e∗j) = [σ2
0FFT ]ij/(K+1
K σ20) =
KK+1F
Ti Fj = K
K+1
∑nt=1 FitFjt = K
K+1 (FiiFji + FijFjj +∑n
t=1,t6=i,j FitFjt). The third component
in this equation satisfies∑n
t=1,t6=i,j FitFjt ∈ [0, 1K ], since Fit and Fjt can only be − 1
K or zero, and
the set Fiknk=1,k 6=i or Fjtn
t=1,t6=i has at most K elements with value − 1K . As to the components
FiiFji and FijFjj , we consider four different situations:
1. sj ∈ Nk(si), si ∈ Nk(sj): It implies that FiiFji = FijFjj = − 1K .
∣
∣ρ(e∗i , e∗j )∣
∣ = KK+1
∣
∣
∣FiiFji + FijFjj +∑n
k=1,k 6=i,j FikFjk
∣
∣
∣ = KK+1
∣
∣
∣− 2K +
∑nk=1,ki,j FikFjk
∣
∣
∣ ≤K
K+1 · 2K = 2
K+1 .
3.3.2 Theoretical Properties of GLS 27
2. sj ∈ Nk(si), si /∈ Nk(sj) : It implies that FiiFji = 0 and FijFjj = − 1K .
∣
∣ρ(e∗i , e∗j )∣
∣ = KK+1
∣
∣
∣FiiFji + FijFjj +∑n
k=1,k 6=i,j FikFjk
∣
∣
∣ = KK+1
∣
∣
∣− 1K +
∑nk=1,k 6=i,j FikFjk
∣
∣
∣
KK+1 ·
1K = 1
K+1 .
3. sj /∈ Nk(si), si ∈ Nk(sj) : It implies that FiiFji = − 1K and FijFjj = 0.
∣
∣ρ(e∗i , e∗j )∣
∣ = KK+1
∣
∣
∣FiiFji + FijFjj +
∑nk=1,k 6=i,j FikFjk
∣
∣
∣= K
K+1
∣
∣
∣− 1
K +∑n
k=1,k 6=i,j FikFjk
∣
∣
∣≤
KK+1 · 1
K = 1K+1 .
4. sj /∈ Nk(si), si /∈ Nk(sj) : It implies that FiiFji = FijFjj = 0.∣
∣ρ(e∗i , e∗j )∣
∣ = KK+1
∣
∣
∣FiiFji + FijFjj +∑n
k=1,k 6=i,j FikFjk
∣
∣
∣ = KK+1
∣
∣
∣
∑nk=1,k 6=i,j FikFjk
∣
∣
∣ ≤ KK+1 ·
1K = 1
K+1 .
Therefore, we conclude that |ρ(e∗i , e∗j )| ≤
2K+1 , ∀i, j with i 6= j.
Theorem 1 indicates that when the neighborhood size is relative large, the correlations between
the components in e∗ are very low (e.g., smaller than 0.2 when K=10 ) and the variance of each
component is very close to σ20 . In this case, σ2
0FFT ≈ σ20I. However, for a small neighborhood
size, as shown in simulations (Section 3.5), the dependence between the components in e∗ must be
considered.
The next two theorems are related to the random vector ωωω∗. It is very difficult to analytically
evaluate ωωω∗, because it is generated by an isotropic second order stationary process, and even when
the explicit form of the covariance function is known, the statistical properties of ωωω∗ are still not
straightforward. For this reason, several additional assumptions (constraints) need to be considered.
The following are three assumptions required for Theorem 2:
1. If NK(sl) ∩ NK(sd) 6= Φ, then, ∀si, sj , st ∈ NK(sl) ∩ NK(sd), their between spatial distances
are approximately equivalent: ‖ sj − si ‖≈‖ st − si ‖≈‖ sj − st ‖.
2. If sj ∈ NK(si), st /∈ NK(si),and NK(st) ∩ NK(si) = Φ, then ‖ st − si ‖≈‖ st − sj ‖.
3. The distance between any points that are K-nearest neighbors is approximately constant
everywhere.
The intuition on assumptions 1 and 2 is that, because neighbors are close to each other, they share
similar between-distances, and also share similar distances to the points that are not their neighbors.
The assumption 3 is valid when the spatial locations follow a uniform distribution or a grid structure.
Note that, the assumption 3 holds in many applications [65]. The situations where assumptions 1
and 2 are potentially violated will be discussed in Theorem 3.
Theorem 2 If the above assumptions 1 and 2 hold, then the random vector ωωω∗ has two major
properties
3.3.2 Theoretical Properties of GLS 28
1. The variance V ar(ω∗i ) ≈ 1+K
K (σ2 − Cxi), i = 1 . . . n
2. The correlation ρ(ω∗i , ω
∗j ) ≈ − 1
K , if sj ∈ NK(si) or si ∈ NK(sj); otherwise, ρ(ω∗i , ω
∗j ) ≈ 0,
where Csirefers to the average covariance value between si and its K-nearest neighbors, and
σ = C(0) refers to the constant variance for each component of ω. Further, if the assumption
3 also holds, then the variance V ar(ω∗i ) becomes constant everywhere.
Proof Let Σ = V ar(ω), D = V ar(ω∗) = FΣFT , and T = FΣ. Recall that ω∗ = Fω, where ω
is the smooth scale variation (see Section 3.3.1 Equation 3.3). The covariance component Σij =
Cov(ωi, ωj) = C(‖ si − sj ‖), where C(·) is a covariance function (e.g., exponential or spherical
functions) that depends on the distance hij =‖ si − sj ‖. By the covariance function C(·) and the
assumption 1, neighboring points must have the same covariance. For each point si, we represent
the constant covariance between si and its K-nearest neighbors as Csi. Let σ = C(0). The variance
for each component of ω can be calculated as: V ar(ωi) = Cov(ωi, ωi) = C(‖ si − si ‖) = C(0) =
σ, ∀i = 1, . . . , n. Then by matrix computation,
|Tij | ≈
σ2 − Csi, i = j, (3.8)
1
K(Csi
− σ2), sj ∈ NK(si) or si ∈ NK(sj), (3.9)
0, Otherwise. (3.10)
Particularly, by assumption 1, if i = j, then Tij =∑n
t=1[FitΣtj ] ≈ σ2 +K · (− 1KCsi
) = σ2 − Csi.
If i 6= j and sj ∈ NK(si) (or si ∈ NK(sj)), then Tij =∑n
k [FikΣkj ] ≈ [(K − 1) · (− 1KCsi
) +
(− 1Kσ
2)]+Csi= 1
K (Csi−σ2). For other cases, derived from the assumption 2, Tij =
∑nt [FitΣtj ] =
∑
st∈NK(si)(− 1
KC(st − sj)) + C(sj − si) ≈ 0. As to the covariance matrix D = FΣFT = TFT , by
matrix computation we have that
|Dij | ≈
1 +K
K
(
σ2 − Csi
)
, i = j, (3.11)
K + 1
K2
(
Csi− σ2
)
, sj ∈ NK(si) or si ∈ NK(sj), (3.12)
0, Otherwise. (3.13)
Particularly, if i = j, then Dij =∑n
t [Tit[FT ]tj ] ≈
∑Kt=1(−
1K · 1
K (Csi− σ2)) + (σ2 − Csi
) =1+K
K (σ2 − Csi). If i 6= j, sj ∈ NK(si), or si ∈ NK(sj), then Dij ≈ [∑K−1
t=1 (− 1K · 1
K (Csi− σ2)) −
1K (σ2 − Csi
)] + 1K (Csi
− σ2) = ( 1K + 1
K2 )(Csi− σ2). For other cases, where sj /∈ NK(si) and
si /∈ NK(sj), it has Dij =∑n
t [Tit · [FT ]tj ] = 0. We prove this statement by contradiction. Assume
that the value Dij does not equal zero in this situation. Then there must be some t ∈ 1, . . . , n
such that Tit · [FT ]tj 6= 0. This means st ∈ NK(si) and st ∈ NK(sj). According to assumption 1,
either si ∈ NK(sj) or sj ∈ NK(si) must be true, contradiction! Recall that D = V ar(ω∗). The
above results prove that V ar(ω∗i ) = Dii ≈ 1+K
K ; ρ(ω∗i , ω
∗j ) = Dij/Dii ≈ 1
K , if sj ∈ NK(si) or
si ∈ NK(sj); and ρ(ω∗i , ω
∗j ) ≈ 0, in other cases.
3.3.2 Theoretical Properties of GLS 29
Theorem 2 indicates that the correlations between the components in ωωω∗ are mostly zero, except
for neighboring points. Particularly, the correlations between neighboring points are all negative,
and their major impact factor is the neighborhood size K. The greater the value of K, the less the
neighbor points are correlated. However, K cannot be arbitrary large; otherwise, the assumptions
made above will be violated. For example, suppose n = 200 and K = 10, then only about 5% of
pairs are correlated. For these correlated components, the correlations are only close to −0.1. As
shown in Figure 3.1, 0.1 indicates a negligible correlation.
Figure 3.1: An example of correlation: it reflects the noise and direction of a linear relationship
Theorem 2 states two approximate properties of ω∗. However, it is not directly known how these
properties are impacted if assumptions 1 and 2 are violated. The next Theorem 3 will delve deeper
into this issue and provide more specific analysis on ω∗i . For Theorem 3, the following less restrictive
assumptions are employed:
1. The spatial locations s1, . . . , sn follow a grid structure and n ≤ 2500;
2. The spatial distance is defined by L2 (Euclidean) distance;
3. The covariance function Cov(Z(si), Z(sj)) = C(h), where h =‖ si − sj ‖2, follows a popular
spherical model;
4. Consider 4 or 12-nearest neighbors as spatial neighbors for each object.
Assumptions 1 and 2 are generic properties that can be readily applied to spatial data in general [53,
54]. In many applications, the total number of spatial locations is smaller than 200. Here, we consider
a much enlarged range with n ≤ 2500, for the purpose of generality. For assumption 3, a spherical
model is defined as
C(h;θθθ) =
B, if h = 0, (3.14)
b
(
1 −3h
2c+
1
2
(
h
c
)3)
, if 0 ≤ h ≤ c, (3.15)
0, if h > 0, (3.16)
where θ = (b, c)T , b ≥ 0, c ≥ 0, b = C(0;θ) refers to the constant variance for each object s, and
C(h;θ) is a decreasing function on the distance h.
3.3.2 Theoretical Properties of GLS 30
The reason for using a spherical model as opposed to exponential or Gaussian models is that the
spherical model leads to closed-form analytical results. The closed-form results will provide impor-
tant insights into its statistical properties. As for assumption 4, K is set to 4 or 12 due to the use of
the grid structure (assumption 1). In the grid, each object has four nearest objects with the same
distance r and eight next-nearest objects with the same distance 2r, where r is the grid cell size,
and so on. Hence, we can select K = 4, 12, 24, . . . We select the first two values with K = 4 and 12,
which are equivalent to defining neighborhoods with radiuses of r and 2r, respectively.
To make the results concise, we further set r2h/c3 ≈ 0 and r3/c3 ≈ 0, since r/c is usually very small
(e.g., 0.1) and h ≤ c. If h > c, then C(h;θ) = 0 and will lead to zero covariance. These components
are negligible compared to the components r/c and rh2/c.
Theorem 3 Under the above four assumptions, the random vector ω∗ has following properties on
the correlation structure
1. If K = 4 then
a) ρ(ω∗i , ω
∗j ) = 0, if d(sj , si) > c+ 2r,
b) |ρ(ω∗i , ω
∗j )| ≤ 0.4, if c ≤ 2r and d(sj , si) ≤ 2r,
c) |ρ(ω∗i , ω
∗j )| ≤ 0.22, if c > 2r and d(sj , si) ≤ 2r,
d) |ρ(ω∗i , ω
∗j )| ≤ 0.05, if d(sj , si) > 2r;
2. If K = 12, d(sj , si) ≥ c+ 4r, then ρ(ω∗i , ω
∗j ) = 0;
3. If K = 12, c < 4r, then
a) |ρ(ω∗i , ω
∗j )| ≤ 0.220, if d(sj , si) ≤ 2r,
b) |ρ(ω∗i , ω
∗j )| ≤ 0.110, if 2r < d(sj , si) ≤ 3r,
c) |ρ(ω∗i , ω
∗j )| ≤ 0.050, if d(sj , si) > 3r;
4. If K = 12, c ≥ 4r and row(sj) = row(si)(or col(sj) = col(si)), then
a) |ρ(ω∗i , ω
∗j )| ≤ 0.4741− 0.1179·c2/r2
1+c2/(2.707·r2) , if d(sj , si) = r
b) |ρ(ω∗i , ω
∗j )| ≤ 0.1203, if d(sj , si) = 2r
c) |ρ(ω∗i , ω
∗j )| ≤ 0.1719−
0.0158·h2ij/r2
1+c2/(10.5174·r2) , otherwise;
5. If K = 12, c ≥ 4r, row(sj) 6= row(si), and col(sj) 6= col(si), then |ρ(ω∗i , ω
∗j )| ≤ 0.1085 −
0.0028·h2ij/r2
1+h2ij/(37.6723·r2)
,
where r refers to the grid cell size, row(si) and col(si) refer to the row and column locations of the
object si in the grid structure, and hij = d(sj , si) is the L2 (or Euclidean) distance between si and
sj.
3.3.2 Theoretical Properties of GLS 31
(a) K = 4 (b) K = 12
Figure 3.2: The neighborhoods defined by 4 or 12-nearest-neighbors rules in gridded data, equal tothose defined by radiuses r and 2r
Proof The neighborhoods topologies defined by 4 and 8-nearest-neighbors rules are shown in Figure
3.2. The grayed objects are the spatial neighbors of the black object si. The symbol r refers to the
grid cell size.
Recall that ω∗ = Fω, where ω is the smooth scale variation (see Section 3.3.1 Equation 3.2). Let
Σ = V ar(ω), D = V ar(ω∗) = FΣFT , and T = FΣ. By assumption 3, ΣΣΣij = Cov(ωi, ωj) =
C(hij ;θθθ) = C(hij ;θθθ) −1K
∑
st∈NK(si)C(htj ;θθθ). Given that F is a neighborhood weight matrix
(see Equation 3.4), the component Tij =∑n
t=1(FitΣΣΣtj). By the relation D = TFT , we have that
Dij = Tij −1K
∑
st∈NK(sj)Tit. The correlation ρ(ω∗
i , ω∗j ) has the analytical form
ρ(ω∗i , ω
∗j ;θθθ) =
Dij
Dii=
Tij −1K
∑
st∈NK(sj)Tit
D11(3.17)
where Dii is constant and the same denominator D11 is used for different Dii. Notice that the form
(3.17) is actually the sum of K2 weighted spherical functions (C(·, θθθ)). This complex form makes
the function properties not well interpretable, such as the minimum value, the maximum value,
and the global trend with respect to the major parameters hij and c. For this reason, we further
develop a tight upper bound function of (3.17) that is monotone and has a simpler analytical form.
The development is based on five different cases as indicated in Theorem 3. Here we focus on two
representative cases, the second and the fifth cases. The upper bound functions for other cases can
be proved similarly.
• Case 1: K = 12 and d(sj , si) > c+ 4r.
It has C(hij ;θθθ) = 0 and C(htd;θθθ) = 0, ∀st ∈ NK(sj) ∪ sj, ∀sd ∈ NK(si) ∪ si. It implies that
ρ(ω∗i , ω
∗j ) = 0.
• Case 5: K = 12, c > 4r, row(sj) 6= row(si), and col(sj) 6= col(si).
3.3.2 Theoretical Properties of GLS 32
Based on the observations by visualization, we select a rational quadratic model (f(h;ααα) = α1 +α2h2
1+h2/α3) for the upper bounding function. The estimation of the parameters ααα is based on the
following steps:
Step 1: Let S1 = 1, 2, 3, . . . , 49, S2 = 1, 2, 3, . . . , 49, and S3 = 4, 5, 6, . . . , 15, 20, 40, 60, 80 ⊂
Sc = c|c ∈ R, c ≥ 4.
Step 2: Solve the following optimization problem
ααα = arg minααα∈R4
∑
row(sj)−row(si)∈S1,col(sj)−col(si)∈S2,
c∈S3
(
f(hij ;ααα) − |ρ(ω∗i , ω
∗j ;θθθ)|
)
subject to f(hij ;ααα) ≥ |ρ(ω∗i , ω
∗j ;θθθ)|, ∀i, j, c with row(sj) − row(si) ∈ S1,
col(sj) − col(si) ∈ S2, and c ∈ S3, where θθθ = (b, c) and b = 1.
(3.18)
Step 3: ∀(i, j) ∈ S1 × S2, solve the following optimization problem
cij = arg minc∈R, c≥4
(f(hij ; ααα) − |ρ(ω∗i , ω
∗j ;θθθ)|)
subject to θθθ = (b, c), b = 1.
(3.19)
Step 4: If ∀(i, j) ∈ S1 × S2, it satisfies the condition f(hij ; ααα) − ρ(
ω∗i , ω
∗j ;θθθ = [1, cij ]
T)
≥ 0, then
return ααα as the estimated values of ααα and terminate the algorithm; Otherwise, select a larger subset
(e.g., S3 = 1, 2, 3, , 100) of the feasible set x|x ∈ R, x ≥ 4 for the parameter c, and go to step 2.
The objective of the above algorithm is to estimate a local optimal setting for α. Particularly,
by assumption 1, the spatial locations follow a grid structure and the total number of points is
smaller than 2500. It implies that the set S1 × S2 includes all valid settings for the pair (row(sj)−
row(si), col(sj) − col(si)). The feasible set of the parameter c is Sc = c|c ∈ R, c ≥ 4. At step
one, we only select a representative subset (S3) of Sc. The optimization problem (3.18) is to find a
tight upper bound function based on the subset S3. Steps 2 and 3 test if the estimated parameters
ααα satisfy the upper bounding conditions that f(hij ; ααα) ≥ ρ(ω∗i , ω
∗j ;θθθ) for every valid settings of i, j,
and c. If the test is passed, then we can conclude that a feasible and local optimal ααα is obtained.
Otherwise, the algorithm will start a new iteration based on an enlarged subset of Sc.
The optimization problem (3.18) is a nonconvex problem. A local optimal solution of (3.18) can
be obtained by numerical methods, such as interior point method [66]. The estimated parameters
ααα = (0.1085,−0.0028, 37.6723). A local optimal solution of (A2) is acceptable for us, since our
objective is to find a tight upper bound function, but not necessarily a global optimal bound.
The optimization problem (3.19) is also a non-convex problem. Because it is a feasibility test-
ing procedure, a global optimal solution must be obtained. This can be achieved by exploring
the special structure of (3.19). Particularly, first the denominator of ρ(ω∗i , ω
∗j ;θθθ) is D11. By
the equation r3/c3 ≈ 0, it follows that D11 = τr/c, where τ is some scalar constant. Re-
3.3.2 Theoretical Properties of GLS 33
call that the numerator of ρ(ω∗i , ω
∗j ;θθθ) is a weighted sum of 144 spherical functions. Let S =
htd|st ∈ NK(sj) ∪ sj, sd ∈ NK(si) ∪ si. The set S has totally 144 components (scalars), which
can be used to divide the feasible region Sc = c|c ∈ R, c ≥ 4 into 145 sub-regions. It can be readily
derived that, in each sub-region, the absolute value of the correlation ρ(ω∗i , ω
∗j ;θθθ) has the polynomial
form ρ(ω∗i , ω
∗j ;θθθ) = τ1 + τ2
1c + 3
1c2 , where τ1, τ2, and τ3 are constant scalars depending on this sub-
region. By this polynomial form, we have that |ρ(ω∗i , ω
∗j ;θθθ)| only has one local (global) maximum
in each sub-region. By checking the maximum value in each region, we can obtain a global optimal
solution for the problem (3.19).
• Other Cases:
The upper bound functions can be obtained by using similar procedures in cases 1 and 5.
The complete form of the estimated upper bound function is stated in Theorem 3. Readers are
referred to Appendix A.1 for an empirical plot of the estimated bounds.
Theorem 3 implies similar patterns as drawn by Theorem 2 although Theorem 2 provides only
approximate properties. Theorem 3 is a further justification of these patterns. In the following
discussions, we consider the situation with c ≥ 5. The situation with c ≤ 5 will be discussed
separately. By Theorem 3, if c ≥ 5 , then |ρ(ω∗i , ω
∗j )| ≤ 0.22 when K = 4; and |ρ(ω∗
i , ω∗j )| ≤ 0.18
when K = 12. It indicates small absolute correlation values for different K values. The correlation
values slightly decreases when K increases. It can also be shown that most correlations are negative
and are close or equal to zero. Readers are referred to the Appendix for more detailed information
about (ω∗i , ω
∗j ). All these observations are consistent with the results from Theorem 2.
We have a comparison between σ20FFT and FΣFT . Consider two typical situations: K = 4 repre-
sents a small neighborhood; and K = 12 represents a relatively large neighborhood. If K = 4, then
|ρ(e∗i , e∗j )| ≤ 0.4 and |ρ(ω∗
i , ω∗j )| ≤ 0.22. If K = 12, then |ρ(e∗i , e
∗j)| ≤ 0.2 and |ρ(ω∗
i , ω∗j )| ≤ 0.18.
The impacts of these correlation values (degrees) are shown in Figure 3.1. Although both |ρ(e∗i , e∗j )|
and |ρ(∗i ,∗j )| increase when the neighborhood size K decreases, the absolute correlation |ρ(e∗i , e
∗j )|
increases more drastically. Based on these results, we will approximate FΣFT by σ2 I for different
settings of K, but will only approximate σ20FFT by σ2
0I, when K is relatively large, such as K ≥ 8.
Theorem 3 also indicates that when c is small (e.g., c < 5r), some correlations are relatively high
(e.g., |ρ(ω∗i , ω
∗j )| = 0.4 if K = 4, c = r, and d(sj , si) = r). In this case, an important observation is
that the correlation matrix of ω∗ exhibits similar structure as that of e∗. Particularly, if c < r, these
two correlation matrices become identical. In this situation, it is still reasonable to approximate
the correlation matrix of ω∗ as identity or unit matrix, since the lost structure information by this
approximation will be recovered while estimating the parameter σ0 for the vector e∗, because of the
similar structure between the covariance matrices V ar(ω∗) and V ar(e∗). For example, suppose c < r
and the constant variance for each component of e is σ2e , then we have that V ar(e) = ΣΣΣ = σ2
eI,
and V ar(e∗) = V ar(Fe) = FΣFT = σ2eFFT . By the Equation 3.5, the true distribution model
is: diff(Z) ∼ N (FXβββ,FΣFT + σ20FFT ) = N (FXβββ, (σ2
0 + σ2e)FFT ). If we approximate FΣFT as
3.4 Estimation and Inferences 34
σ2I instead, then by the Equation 3.6 the approximate model becomes diff(Z) ∼ N (FXβββ, σ2I +
σ20FFT ). By robust parameter estimation, the approximate model can still completely recover the
true distribution, ex., by setting the estimated parameters σ = 0 and σ0 =√
σ20 + σ2
e .
3.4 Estimation and Inferences
Spatial outlier detection (SOD) is usually coupled with a robust estimation process for the related
statistical model. This section introduces ordinary estimation methods for the GLS model, then
presents two robust estimation and outlier detection methods to reduce the masking and swamp-
ing effects, and discusses the connection between the proposed GLS-SOD methods with existing
representative methods, such as Kriging-based and Z-test SOD methods.
3.4.1 Generalized Least Squares Regression
Given a set of observationsZ(s1), Z(s2), . . . , Z(sn), the objective is to estimate the parameters
βββ, σ, and σ0 for the proposed GLS model. We consider mean squared error (MSE) as the score
function which is the most popular error function in spatial statistics [63]. This leads to a generalized
least square problem and can be formulated as:
minimizeβββ,σ0,σ
[
(FZ − FXβββ)T (σ2I + σ20FFT )−1(FZ − FXβββ)
]
subject to σ20 + σ2 = 1, σ0, σ ≥ 0.
(3.20)
Note that we scale σ0 and σ by a factor c with σ∗0 = σ0/c and σ∗ = σ/c, such that σ∗2
0 + σ∗2
= 1.
Without this constraint, the objective function in (3.20) will always be minimized by setting σ0 =
σ = ∞, and βββ to any value. For simplicity, we directly use the original symbols σ0 and σ, rather
than σ∗0 and σ∗. As shown in Theorem 4, the problem (3.20) is a convex optimization problem which
can be solved efficiently by numerical optimization methods such as interior point method [66]. Note
that when the neighborhood size (i.e., K) is large, the following holds: σ20FFT ≈ σ2
0I (see Section
3.3.2). Then (3.20) reduces to a regular least squares regression problem and an explicit solution is
available with βββ = (XT FT FX)−1XT FTFZ, and (σ2 + σ20) =‖ FXβββ − FZ ‖2
2 /(n− p− 1), where p
is the size of the vector βββ. For the purpose of outlier detection, it is unnecessary to further derive
the explicit forms of σ and σ0.
Theorem 4 The problem (3.20) is a convex optimization problem.
Proof Suppose λi and qi are the eigenvalues and corresponding (orthonormal) eigenvectors of the
matrix FFT . It can be readily shown that the problem (3.20) is equivalent to
3.4.1 Generalized Least Squares Regression 35
minimizeβββ,σ0,σ
[
n∑
i=1
(FZ− FXβββ)T qi2
σ2 + σ20λi
]
subject to σ20 + σ2 = 1, σ0, σ ≥ 0.
(3.21)
Let fi = (FZ−FXβββ)T qi2
σ2+σ20λi
, It suffices to prove that fi is a convex function, or equivalently ∂2
∂θθθ2 fi <
0, θθθ = [βββT , σ2, σ20 ]
T .
∂2
∂θθθ2fi =
XTFqi(σ2 + σ2
0λi)
(qTi Z − qT
i FXβββ)T
λi(qTi Z − qT
i FXβββ)T
XTFqi(σ2 + σ2
0λi)
(qTi Z − qT
i FXβββ)T
λi(qTi Z − qT
i FXβββ)T
T
0
When the parameters βββ, σ, and σ0 are estimated by generalized least squares, we can calculate the
standard residuals and use standard statistic test procedure to identify the outliers. This method
works well for sample data with small data contamination, but is susceptible to the well-known
masking and swamping effects when multiple outliers exist. For the GLS model, the masking and
swamping effects originate from two phases of the estimation process:
1. Phase I contamination occurs in the process of calculating local differences FZ. For exam-
ple, suppose we define neighbors by the K-nearest-neighbor rule. Consider an outlier object
Z∗(s1) = Z(s1)+ζ1, where Z(s1) is the normal value but it is contaminated by a large error ζ1,
and suppose only one of its neighbors is an outlier with Z∗(s) = Z(s)+ ζ, where ζ is the error.
The local difference diff(Z∗(s1)) = [Z(s1)−1K
∑
si∈N(s)(Z(si))]+ ζ1 − ζ/K. If ζ = Kζ1, then
the error is marginalized and we obtain a normal local difference for a outlier object Z∗(s1)
which will be identified as a normal object. If Z∗(s1) is a normal object with ζ = 0, then the
related local difference is contaminated by the error −ζ/K. This leads to the swamping effect
where the normal object Z∗(s1) may be misclassified as an outlier. For a relatively large K
(e.g., 8), it can be readily shown that Phase I contamination is more significant for a spatial
sample with clusters of outliers than a spatial sample with isolated outliers. Another important
observation is that the masking and swamping effects will not completely distort the ordering
of true outliers. The top ranking outliers are still usually a subset of the true outliers. This
observation motivates the backward algorithm presented in Section 3.4.3.
2. Phase II contamination occurs in the generalized regression process, where we regard Z∗ =
FZ as the pseudo “observed” values. The masking and swamping effects in this phase are the
same effects occurred in a general least squares regression process. This is consequence of the
biased estimates of the regression parameters (e.g., βββ, σ, and σ0) due to abnormal observations
in Z∗.
Drawbacks of existing robust estimation techniques:
3.4.2 GLS-Backward Search Algorithm 36
Most existing robust regression techniques are designed to reduce the effect of Phase II contamina-
tion. There are two major categories of estimators [65]. The first category (also called M -estimators)
is to replace the MSE function by more robust score function such as L1 norm and Huber penalty
function. The second category is to estimate parameters based on a robustly selected subset of data,
such as least median of square (LMS), least trimmed square (LTS), and the recently proposed
forward search (FS) method. Unfortunately, all these robust techniques cannot be directly applied
to address both Phase I and Phase II contaminations concurrently. As with the M -estimators, the
application of robust penalty function (e.g., L1) will lead to a non-convex optimization problem
where local optimal solution may be found. With the second type of estimators based on subset
selection, the estimation results are highly sensitive to the selected objects which can detrimentally
impact neighborhood quality. The next section will adapt existing robust methods to the problem
of concurrently handling Phase I and Phase II contaminations.
3.4.2 GLS-Backward Search Algorithm
As discussed above, the existing methods only address the Phase II contamination. The motivation
for our proposed backward search algorithm is to address both Phase I and Phase II contaminations
concurrently. The algorithm is described as follows:
Backward Search Algorithm Given a spatial data set Z(s1), . . . , Z(sn), the covariate vectors
x(s1), . . . ,x(sn), the value of K for defining K-nearest neighbors, and the confidence interval
α ∈ (0, 1),
1. Set SZ = Z(s1), . . . , Z(sn), Sx = x(s1), . . . ,x(sn), and Soutput be an empty set.
2. Estimate the parameters βββ, σ, σ0 of the GLS model by solving the generalized least squares
regression problem (3.20).
3. Calculate the absolute values of standard estimated residuals e = [e1, . . . , e|SZ|]T =∣
∣
∣(σ2I + σ20FFT )−1/2(FZ − FXβββ)
∣
∣
∣.
4. Set em = maxei|SZ|i=1 .
If em ≥ Φ−1(α/2), where Φ is the CDF of the standard normal distribution, then update
SZ = SZ − Z(sm), Sx = Sx − x(sm), and Soutput = Soutput + Z(sm), and go to Step 2.
Otherwise, stop the algorithm and return Soutput as the ordered set of candidate outliers.
In the above algorithm, the confidence interval α can be set to 0.001, 0.01, and 0.05. In step 2, we
apply interior point [66] method to solve the optimization problem (3.20). When the neighborhood
size is large, we may approximate σ20FFT as σ2
0I. The parameters βββ, σ, σ0 can be efficiently estimated
by least squares regression: βββ = (XT FTFX)−1XTFT FZ, and (σ2+σ20) =‖ FXβββ−FZ ‖2
2 /(n−p−1),
where p is the size of the vector βββ.
3.4.3 GLS-Forward Search Algorithm 37
This backward search algorithm is designed based on the observation that top ranked outliers iden-
tified by the regular least squares method are still true outliers (in most cases) under both Phase I
and II contaminations. Suppose a true outlier s is removed after the first iteration, then both Phase
I and Phase II contaminations in the next iteration will be reduced. To illustrate this process, we
use the same example in Section 3.4. Recall that an outlier object Z∗(s) is decomposed into two
additive components Z∗(s) = Z(s)+ ζ, where Z(s) represents the normal value and ζ represents the
contamination error. Suppose s is the only outlier neighbor of an object s1 that happens to be an
outlier. Then the local difference diff(Z∗(s1)) = [Z(s1) −1K
∑
si∈N(s)(Z(si))] + ζ1 − ζ/K will be
marked as normal if ζ = K · ζ1. Suppose now that the true outlier Z(s) is removed and the newly
replaced neighbor for s1 is normal, then diff(Z∗(s1)) = [Z(s1) −1K
∑
si∈N(s)(Z(si))] + ζ1. This
local difference becomes an abnormal value and the masking effect is removed. Similarly, suppose
Z∗(s1) is a normal object, then its local difference is contaminated (swamped) by the error −ζ/K,
because of its outlier neighbor Z(s). The removal of s will make −ζ/K = 0 and therefore reducing
the swamping effect. For Phase II contamination, the removal of Z(s) leads to the removal of an
abnormal difference diff (Z∗(s)). The set of remaining local differences will therefore have less con-
tamination. The center of the distribution is less attracted by outliers, and the distributional shape
becomes less distorted. As a result, outliers tend to be more separated and normal objects tend to
be closer together. The masking and swamping effects are therefore reduced.
3.4.3 GLS-Forward Search Algorithm
This section adapts the popular Forward Search (FR) algorithm [65] to the GLS parameters esti-
mation problem. There are several restrictions to apply FR here. As discussed in Section 3.4.1, FR
starts from a robustly select subset of sample, but GLS is a statistical model based on neighborhood
aggregations. Considering only a subset of the observations Z(s1), . . . , Z(sn) will significantly im-
pact the quality of the calculated local differences. To apply FR algorithm, we make the assumption
that Phase I contamination is negligible compared to Phase II contamination. As discussed in Sec-
tion 3.4.1, this is reasonable for the case of isolated outliers. Based on this assumption, we consider
the local differences diff(Z(s1)), . . . , diff(Z(sn)) as pseudo “observations”, and then apply FR
algorithm to estimate the model parameters. By simulations, we also noticed that in this case
there is no significant difference between applying generalized least squares regression and regular
least squares regression. For the sake of efficiency, we only apply regular least squares regression to
estimate the parameters βββ, σ, and σ0. The FR algorithm is described as follows:
Forward Search algorithm Given a spatial data set Z(s1), . . . , Z(sn), the covariate vectors
x(s1), . . . ,x(sn), and the value of K for defining K-nearest neighbors,
1. Calculate the local differences: diff(Z) = FZ, and set Soutput be an empty set.
2. Set S = s1, . . . , sn; Set Z∗(S) = [Z∗(s1), . . . , Z∗(sn)] = diff(Z), and X∗(S) = [x∗(s1), . . . ,x
∗(sn)] =
FX, as the vector of pseudo “observations” and pseudo “covariates”.
3.4.4 Connections with Existing Methods 38
3. Apply least trimmed squares (LTS) [65] to find a robust subset of S, defined as S∗, and set
S∗test = S − S∗. The size of the subset S∗ is ⌊(n+ p+ 1)/2⌋ by default.
4. Estimate the parameter βββ based on Z∗(S∗) and X∗(S∗). Then calculate the absolute standard
residuals of S∗test as e =
√n−p−1|Z∗(S∗
test)−X∗(S∗
test)βββ|‖Z∗(S)−X∗(S)βββ‖2
.
5. Find the minimal residual of the test set S∗test:
em = mineiei∈S∗test
.
6. Update Soutput = Soutput + sm,S∗ = S∗ + sm,S∗test = S∗
test − sm. If S∗test is not empty,
go to step 4; otherwise, output the ordered set Soutput and terminate the algorithm.
The proposed FR algorithm provides an ordering of objects based on their agreements with the
GLS model. To identify outliers, it plots and monitors the change of the minimal residual with
the increasing size of the normal set S∗. A drastic drop implies that an outlier was added to
S∗. This plot could also help identify masked or swamped objects. Readers are referred to [65]
for details. A direct method for calculating the local differences can be achieved via robust mean
functions such as median and trimmed mean. However, as indicated by our simulation study, this
direct approach will deteriorate the performance of GLS. Recall that the statistical model of GLS:
diff(Z) ∼ N (FXβββ,FΣFT + σ20FFT ) . If we replace the left hand side diff(Z) = FZ by medians or
trimmed means, the right side will remain unchanged and thus still employs the average matrix F.
The increased bias caused by this inconsistency is much larger than the reduction of contamination
effects achieved through robust means.
3.4.4 Connections with Existing Methods
This section studies the connections between global (kriging) based [63–65], local spatial statistics
(LS) based methods [56–58,60,61,286,62], and the proposed GLS based SOD approach. First, we
review the first two approaches: Kriging-SOD and LS-SOD. The basic idea of Kriging-SOD is
to first apply robust methods to estimate the parameters of a global kriging model. The method
uses the estimated statistical model to predict the Z attribute of each sample location s, denoted as
Z(s), based on the Z values of other locations. The standardized residual (|Z(s)−Z(s)|/σs) follows
a standard normal distribution, where σs is the estimated standard deviation. If a residual is outside
the range [−Φ−1(α/2),Φ−1(α/2)], the corresponding object is reported as an outlier, where Φ is the
CDF and α is usually set 0.05. The LS-SOD approach assumes that diff(Z) ∼ N (1, σ2I). The
set of components in diff(Z) can be regarded as an i.i.d. sample of a univariate normal distribution
N (µ, σ). Robust techniques are designed to estimate µ and σ. The remaining steps are similar to
Kriging-SOD.
Theorem 5 Suppose that FΣFT = σ2I and the parameters of Kriging-SOD and GLS-SOD are
correctly calculated by robust estimation, then Kriging-SOD and GLS-SOD are equivalent.
3.4.4 Connections with Existing Methods 39
Proof
For Kriging-SOD, we consider a universal kriging model [53], since other kriging models (e.g.,
ordinary kriging) are its special cases. It suffices to prove that the standardized residuals cal-
culated by Kriging-SOD and GLS-SOD are identical. Without loss of generality, we test the
standardized residual of one particular sample point Z(sn). Let Z∗ = [Z(s1), . . . , Z(sn−1)]T and
Z = [Z∗T
, Z(sn)]T . By Section 3.3.1 Equation 3.3, Z ∼ N (Xβββ,D), where D = Σ+σ20I =
[
Σ∗ σσσ
σσσT σ2n
]
,
V ar(Z∗) = Σ∗, Cov(Z(s1),Z∗) = σσσ, and V ar(Z(sn)) = σ2
n.
Then, the standard residual by Kriging-SOD is
StdRsdKriging−SOD (Z(sn)) =[xT
nβββ + σσσT Σ∗−1
(Z∗ − X∗βββ)]
σn − σσσT Σ∗−1σσσ(3.22)
The standard residual by LS-SOD is
StdRsdKriging−SOD (Z(sn)) = StdRsdGLS−SOD (Z(sn)) (3.23)
The condition FΣFT = σ2I implies that σ2I + σ20FFT = FΣFT + σ2
0FFT = FDFT . Then, (σI +
σ0FFT )−1/2 = (FDFT )−1/2 = (FD1/2)−1 = D−1/2F−1. It follows that (σI + σ0FFT )−1/2(FZ −
FXβββ) = D−1/2F−1(FZ − FXβββ) = D−1/2(Z − Xβββ).
Further, given that D =
[
Σ∗ σσσ
σσσT σn
]
, it can be readily shown that
D−1/2 =
[
C−11 + C
−1/22 Σ∗−1
ββββββTΣ∗−1]1/2
0
−σσσTΣ∗−1
C−1/22 C
−1/22
, (3.24)
where C1 = Σ∗−1
− σnσσσσσσT and C2 = σn − σσσT ∗−1
σσσ.
Then, [(σI+σ0FFT )−1/2(FZ−FXβββ)]n = [D−1/2(Z−Xβββ)]n =
[
D−1/2
[
X∗βββ
xTnβββ
]
]
]
n
= −C−1/22 βββTΣ∗−1
X∗βββ+
C−1/22 xT
nβββ = xTnβββ + σσσT Σ∗−1
(Z∗ − X∗βββ)/(σn − σσσT Σ∗−1)
σσσ).
The above indicates that
StdRsdKriging−SOD (Z(sn)) = StdRsdGLS−SOD (Z(sn)) , (3.25)
We conclude that Kriging-SOD and GLS-SOD are equivalent.
Theorem 6 If FΣFT = σ2I, σ20FFT = σ2
0I, the parameters of GLS-SOD and LS-SOD are
correctly calculated by robust estimation, and one of the following conditions is true, then GLS-
SOD becomes equivalent to LS-SOD.
3.5 Simulations 40
1. Z(s) has a constant trend (mean): Xβββ = cI, where c is a constant value.
2. Z(s) is a linear trend of spatial coordinates, and each point s is the geometric center (or
centroid) of its neighbors.
Proof For either condition (1) or (2), it can be readily derived that FXβββ = 0. By conditions
FΣFT = σ2I and σ20FFT = σ2
0I, we have FZ ∼ N (0, (σ2 + σ20)I) which is consistent with the i.i.d.
assumption in LS-SOD. If we use the same robust methods to estimate the parameters, such as
using median and median absolute deviation (MAD) to estimate the mean and standard deviation
σ, then GLS-SOD becomes equivalent to LS-SOD.
Discussion: By Theorem 6, LS-SOD is a special form ofGLS-SOD. LS-SOD assumes V ar(diff(Z)) =
σ2I for some constant σ, but no justifications are presented. From this perspective, GLS-SOD ac-
tually provides a theoretical foundation for LS-SOD. Section 3.3.1 discusses the situations where
V ar(diff(Z)) can be approximated by (σ2 + σ20)I. Furthermore, under the conditions of Theorem
6, LS-SOD is equivalent to GLS-SOD and since the conditions also include “FΣFT = σ2I”, then
by Theorem 4 we have that GLS-SOD is equivalent to Kriging-SOD. Therefore, LS-SOD be-
comes equivalent to Kriging-SOD in this situation. Hence, it can be seen that the proposed GLS
framework can be parameterized to become instances of LS-SOD or Kriging-SOD. Further study
on various outlier detection methods can be greatly enhanced under the lens of this unifying GLS
framework.
As discussed in Section 3.3.1, FΣFT can be reasonably approximated by σ2I. From Theorem 5, the
major difference between Kriging-SOD and GLS-SOD is for which approach the related model
parameters can be estimated more accurately and efficiently. From this perspective, GLS-SOD is
superior to Kriging-SOD based on three major reasons: First, GLS-SOD has less uncertainty than
Kriging-SOD, since Kriging-SOD needs to further assume a semivariogram model. If the semi-
variogram model is not selected properly, the performance may be significantly impacted. Second,
GLS-SOD is a convex optimization problem and therefore a global optimal solution exists. How-
ever, Kriging-SOD is a non-convex optimization problem and relies on an iteratively reweighted
generalized least square (IRWGLS) approach [64] to determine a local solution. Finally, as shown
in Section 3.5 simulations, GLS-SOD runtime performance is superior to Kriging-SOD.
3.5 Simulations
This section conducts extensive simulations to compare the performance between the proposed
GLS based SOD methods and other related SOD methods. The experimental study follows the
standard statistical approach for evaluating the performance of spatial outlier detection methods
found in [63, 64, 53, 54].
3.5.1 Simulation Settings 41
3.5.1 Simulation Settings
Data set: The simulation data are generated based on the following statistical model:
Z(s) = xT (s)βββ + ω(s) + e(s) (3.26)
where ω(s) is a Gaussian random field with covariogram model C(h;θθθ).
We consider two popular covariogram models: spherical model and exponential model. See Equation
3.16 in Section 3.3.2 for the definition of a spherical model. The exponential model is defined as
C(h;θθθ = [b, c]T ) =
b, if h = 0, (3.27)
b(1 − exp(−h/c)), if 0 < h ≤ c, (3.28)
0, if h > c, (3.29)
These two models have the same parameters b and c. Recall that b is also the constant variance for
each Z(s).
For the trend component xT (s)βββ, we define x(s) = [1, x(s), y(s), x(s) ·y(s), x(s)2, y(s)2]T , where x(s)
and y(s) be the X and Y coordinates of the location s. This implies that the trend x(s)βββ is a
polynomial of order two. The nonlinearity of the trend is decided on the regression parameters βββ.
For example, if βββ = [1, 0, 0, 0, 0, 0]T ,then the trend is constant; if βββ = [1, 1, 1, 0, 0, 0]T , then the trend
is linear trend.
For the white noise component, we employ the following standard model [53]:
e(s) ∼
N (0, σ20), with probability 1 − α, (3.30)
N (0, σ2C), with probability α. (3.31)
There are three related parameters σ0, σC , and α. σ20 is the variance of a normal white noise, σ2
C is
the variance of contaminated error that generates outliers, and α is used to control the number of
outliers. Note that it is possible that the distribution N (0, σ2C) will also generate some normal white
noises. All true outliers must be only identified based on standard statistical test by calculating
the conditional mean and standard deviation for each observation [54]. We also consider the case
of clustered outliers. This can be simulated by constraining that the noises of a random cluster of
n · α points follow N (0, σ2C). In the simulations, we tested several representative settings for each
parameter, which are summarized in Table 3.2.
Outlier detection methods: We compared our methods with the state of the art local and global
based SOD methods, including Z-test [56], Median Z-test [58], IterativeZ-test [57], trimmedZ-
test [60], SLOM -test [61], and universal kriging (UK) based forward search [11,12] (noted as UK-
forward). Our proposed methods are identified as GLS-backward-G, GLS-backward-R, and GLS-
forward-R. GLS-backward-G refers to the GLS backward algorithm using generalized least squares
regression. GLS-backward-R refers to the GLS backward algorithm using regular least square
3.5.2 Detection Accuracy 42
Table 3.2: Combination of parameter settings
Variable Settings
n n ∈ 100, 200. Randomly generate n spatial locations sini=1 in the range
[0, 25]× [0, 25].
b, c n ∈ 100, 200. Randomly generate n spatial locations sini=1 in the range
[0, 25]× [0, 25].
βββ For constant trend, β1 ∼ N (0, 1) and βi = 0, i = 2, . . . , 5; For linear trend,β1, β2, β3 ∈ N (0, 1), βi = 0, i = 4, 5, 6; For nonlinear trend, βi
ni=1 ∈
N (0, 1).
σ0, σc σ20 = 2, 10; σ2
C = 20
α α = 0.05, 0.10, 0.15
K K = 4, 8
Covariance model Exponential, spherical
Outlier type Isolated, Clustered
regression (See Section 3.4.2). The implementations of all existing methods are based on their
published algorithm descriptions.
Performance metric: We tested the performance of all methods for every combination of param-
eter setting in Table 3.2. For each specific combination, we run the experiments six times and then
calculate the mean and standard deviation of accuracy for each method. To compare the accuracies
of each method, we use the standard ROC curves. We further collected accuracies of top 10, 15, and
20 ranked outlier candidates for each method, and then the counts of winners are shown in Table
3.3. To calculate these winning counts, we use as an example of the GLS-backward-R result in the
top left cell of table 3.3: “47, 47, 45”. This column refers to the constant trend cases. If within
this particular case, we only consider the true accuracy of the top 10 candidate outliers, then the
GLS-backward-R has “won” 47 times over all combination of parameters against all other methods.
A win is given to the method that exhibits the highest accuracy. Consequently, if we consider the
true accuracy of the top 20 candidate outliers, then the GLS-backward-R has won 45 times.
All the simulations are conducted in a PC with Intel (R) Core (TM) Duo CPU, CPU 2.80 GHz, and
2.00 GB memory. The development tool is MATLAB 2008.
3.5.2 Detection Accuracy
We compared the outlier detection accuracies of different methods based on different combinations
of parameter settings as shown in Table 3.2. Six representative results are displayed in Figure 3.4.
First we considered the detection performance between local based methods. For a constant trend,
our methods were competitive with existing techniques. For data sets exhibiting linear trends, our
GLS algorithms achieved on average 10% improvement over existing local based methods. However,
for data sets with nonlinear trends, our GLS algorithms exhibited more significant improvement
(approximately 50% increase) over existing local methods. For the other combination of parameter
3.5.3 Computational Cost 43
settings in Table 3.2, the winning statistics for each method are displayed in Table 3.3. These results
further justify the preceding performance results.
We also compared our GLS algorithms against the global based method UK-forward. Overall, our
methods were comparable to UK-forward. Particularly, GLS-backward-G attained better accuracy
than UK-forward on about half of the data sets. For the remaining data sets, the GLS-backward-G
is still competitive to the UK-forward. Additionally, as shown in Section 3.5.3, the UK-forward
incurs a significantly much higher computational cost than the GLS algorithms.
As discussed in Section 3.4.3, when K is small, the effects of σ20FFT must be considered and a gen-
eralized least regression is necessary. The theorems indicate that GLS-backward-G should perform
better then GLS-backward-R, this was justified in Figure 3.4 c).
Table 3.3: Competition statistics for different combinations of parameter settings
Algorithm Constant Trend Linear Trend Nonlinear Trend
GLS-backward-R 47, 47, 45 79, 72, 82 76, 81, 77
GLS-backward-G 88, 86, 89 114, 102, 120 141,144, 138
GLS-forward-R 13, 11, 14 22, 25, 27 40, 36, 47
Z-test 47, 35, 40 29, 30, 13 0, 0, 0
IterativeZ-test 35, 46, 63 16, 20, 21 0, 0, 0
MedianZ-test 20, 23, 29 1, 7, 8 0, 0, 0
TrimmedZ-test 15, 23, 32 5, 13, 13 0, 0, 0
SLOM -test 0,0, 0 0, 0, 0 0, 0, 0
Note: Each cell contains three values, representing the win timesfor the related method on the accuracies of top 10, 15, and 20ranked outlier candidates for all methods.
3.5.3 Computational Cost
The comparison on computational cost is shown in Figure 3.3. The results indicate that the time
cost of UK-forward is much higher than other methods. Even the second slowest method GLS-
backward-G, is still three times faster than UK-forward. The other local methods are approxi-
mately equal and hence much faster than UK-forward.
From the comparisons of both the accuracy and computational cost, it can be seen that our proposed
GLS-SOD algorithms (especially GLS−backward−G) is significantly more accurate than existing
local based algorithms when the spatial data exhibits either a linear or nonlinear spatial trend.
Our GLS algorithms are comparable to the global based method UK-forward on accuracy, but
significantly faster than UK-forward.
3.5.4 Conclusion 44
Figure 3.3: Comparison on computational cost (setting: linear trend, isolated outliers,α = 0.1, σ2
0 = 2, c = 15,K = 8, n = 200)
3.5.4 Conclusion
This chapter presents a generalized local statistical (GLS) framework for existing local based meth-
ods. This generalized statistical framework not only provides theoretical foundations for local based
methods, but can also significantly enhance spatial outlier detection methods. This is the first work
to present the theoretical connection between local and global based SOD methods under the GLS
framework.
3.5.4 Conclusion 45
(a) Constant trend, isolated outliers,α = 0.1, σ2
0 = 2, c = 15, K = 4(b) Linear trend, isolated outliers,
α = 0.1, σ20 = 2, c = 15, K = 8
(c) Nonlinear trend, isolated outliers,α = 0.15, σ2
0 = 10, c = 15, K = 4(d) Constant trend, clustered outliers,
α = 0.1, σ20 = 2, c = 25, K = 4
(e) Linear trend, clustered outliers,α = 0.15, σ2
0 = 2, c = 25, K = 8(f) Nonlinear trend, clustered outliers,
α = 0.15, σ20 = 10, c = 5, K = 8
Figure 3.4: Outlier ROC Curve Comparison (the same setting: n = 200, b = 5, σ2C = 20)
Chapter 4 46
Chapter 4
A GeneralizedApproach toNon-Numerical SpatialOutlier Detection
4.1 Introduction
Spatial outlier (anomaly) detection is an important problem that has received much attention in
recent years. Most existing methods are focused on numerical data, but in real world applications
we are often faced with a variety of data types. For example, in the field of disease surveillance, we
are monitoring public health data sources, such as medical sales (numerical attributes) and hospital
visits (count attributes). In the field of economics studies, the living areas (numerical attributes)
and the indicator which shows if a dwelling is located in a certain country (binary attributes) are
measured to characterize house sale prices. In the field of agriculture, the combinations (nominal
attributes) of soils are measured to study the geographic distribution of different plan types.
The traditional outlier detection algorithms can be classified into the following categories: clustering-
based, distribution-based, depth-based, density-based, and distance-based. Most of these approaches
are designed for numerical attributes, whereas real world datasets are usually of non-numerical data
types, such as binary, count, ordinal, and normal attributes. Direct application of these approaches
to non-numerical data leads to the loss of significant correlations between data objects, and their
extension to non-numerical data is also technically challenging. For example, the distance based
approach relies on well-defined measures to calculate the proximity between data observations, but
there is no unified distance measure that can be used for non-numerical attributes. The statistical
model based approach relies on the modeling of correlations between attributes, but there is no
4.1 Introduction 47
unified correlation measure available for non-numerical attributes.
There exists only one method designed for dealing with non-numerical spatial data, namely, pair
correlation function (PCF) based [303]. The authors propose a new metric, namely Pair Correlation
Ratio (PCR), to measure the spatial correlations between spatial categorical observations. The
PCR values are then applied to calculate the weights of neighbors to estimate the probability of a
given object being an outlier, and the weighted average of its neighbors is used as the estimator.
Note that, a number of methods have been proposed for general categorical datasets, which can be
grouped into four categories: rule based [1, 7, 15, 16, 26, 36], probability distribution based [6, 10,
25, 27], entropy based [13, 14], and similarity based [5, 28]. Because these general outlier methods
do not take spatial correlations in consideration, these general methods cannot be directly applied
to spatial categorical data.
To the best of our knowledge, there is no existing work that is able to address the following challenges
concurrently for spatial non-numerical outlier detection: 1) How to develop a unified framework
that can model spatial correlations for a variety of data types, such as binary, nominal, ordinal,
and count? 2) How to model large data variations caused by outliers? 3) How to develop an
efficient detection algorithm that is scalable for large spatial datasets? In this paper, we present a
statistical outlier detection model to address the preceding three challenges. We begin by presenting
a Bayesian generalized spatial linear model to model spatial correlations for a variety data types
characterized by the exponential distribution family. We then incorporate an additional “error
buffer” component based on Student-t distribution to capture the large variations caused by outliers.
Student-t distribution has been widely used in robust statistics to minimize the effects of outliers
in a variety of applications [10, 11]. After that, we integrate a latent reduced-rank spatial Kriging
model and present a approximate inference algorithm that can conduct the outlier detection process
in linear time.
The main contributions of our work can be summarized as follows:
• Design of a Robust and Reduced Rank Bayesian SGLMM (3RB-SGLMM) model.
A new 3RB-SGLMM model is developed, which integrates the advantages of SGLMM, robust
SLM, reduced-rank GLM, and Bayesian hierarchical model. Readers are referred to Chapter
2 about these four traditional models. This model supports all the data types characterized
by the family of exponential distribution. Although it still does not avoid the problem of high
dimensionality in the latent random variables, the special conditional independence structure
makes it possible to develop efficient detection algorithms with a linear time time complexity.
• Develop of an efficient algorithm for robust parameter estimation. The posterior
distribution of latent variables the 3RB-SGLMM model is approximated by Gaussian distri-
bution, and the model is calculated by iterative reweighted least squre (IRLS). The posterior
distribution of the model parameters are then estimated by Laplace approximation. Efficient
matrix manipulations are designed to guarantee that the whole estimate process can be done
in linear time.
4.2 Theoretical Preliminaries 48
• Design of an efficient algorithm for non-numerical spatial outlier detection. Given
the designed 3RB-SGLMM model, the outlier detection problem is then addressed by estimat-
ing the posterior distribution of the “error buffer” random variables that follow a Student-t
prior distribution. An efficient algorithm based on Gaussian and Laplace approximation tech-
niques to estimate the mode and Hessian of the negative log posterior, which are then used to
form an approximate Gaussian distribution for outlier detection.
• Comprehensive experiments to validate the effectiveness and efficiency of the pro-
posed techniques. We conducted extensive experiments on both simulation and real-life
datasets. The detection accuracy, time complexity, as well as the impact of parameters, are
evaluated, and the results demonstrated the good performance of our proposed non-numerical
spatial outlier detection approach.
The rest of the chapter is organized as follows. Section 4.2 presents theoretical preliminaries, in-
cluding reduced-rank spatial linear model and spatial generalized linear model (SGLM). Section 4.3
formulates a new robust and reduced rank Bayesian SGLM model, and discusses its connection with
traditional spatial models. Section 4.4 designs efficient algorithms to infer latent variables, estimate
model parameters, and detect non-numerical spatial outliers. Section 4.5 evaluates the effectiveness
and efficiency of our proposed techniqeus using both simulation and real life datasets. Section 4.6
concludes with a summary of our major work.
4.2 Theoretical Preliminaries
This section introduces two fundamental spatial statistical models, including Reduced-Rank Spatial
Linear Model (RR-SLM) and spatial generalized linear mixed Model (SGLMM).
4.2.1 Reduced-Rank Spatial Linear (Gaussian Process) Mode l
Spatial inferences (e.g., spatial prediction, outlier detection) based on the SLM model involve the
inversion of the N by N correlation matrix R(φ), which has the time complexity O(N3). This
characteristic makes the SLM model unscalable to large datasets. In order to increase the scalability,
Banerjee et al. proposed a reduced rank SLM model based on a set of knots s∗1, · · · , s∗M. The basic
idea is to estimate latent variables η(s1), · · · , η(sN based on η(s∗1), · · · , η(s∗M ) by using spatial
Kriging [45].
η = cT R∗(φ)−1η∗, (4.1)
η∗ ∼ N (0, σ2R∗(φ)), (4.2)
4.2.2 Spatial Generalized Linear Mixed Model (SGLMM) 49
where η∗ = [η(s∗1), · · · , η(s∗m)]T , R∗
ij(φ) = C(η(s∗i ), η(s∗j )|φ), and ci = C(η(s), η(s∗i )|φ). The reduced
rank SLM model can be formalized as
Y = Xβ + η + ε
η = cT R∗(φ)−1η∗,
η∗ ∼ N (0, σ2R∗(φ)),
ε ∼ N (0, τ2I). (4.3)
It is important to select a reasonable number of knots as well as their spatial locations. This is
related to the problem of spatial design and a rich literature can be found ( [46], [45]). There are
two popular knots selection strategies. This first one is to draw a uniform grid to cover the study
region and each grid point is regarded as a knot. The second one is to place knots such that each
knot covers a local domain and the regions that have dense data have more knots. In practice, it
is feasible to validate models by using different number of knots and different choices of knots to
obtain a reliable and robust configuration.
4.2.2 Spatial Generalized Linear Mixed Model (SGLMM)
The spatial generalized linear mixed model (SGLMM) can be described by a two-layer hierarchical
structure, including the observations and the latent Gaussian process layers.
• The Observations Layer
Let Y (s) be a response variable at locations s ∈ D ⊂ R2. It is assumed that Y (s) follows an
exponential family distribution with the probability density
f(Y (s)|θ(s), τ) = exp
(
Y (s)θ(s) − a(θ(s))
d(τ)+ h(Y (s), τ)
)
, (4.4)
where θ(s) and τ are model parameters. θ(s) is related to the mean of the distribution that
varies by location, and τ is called the dispersion parameter and is related to the variance of
the distribution. The functions h(y(s), τ), a(θ(s)), and d(τ) are known. Y (s) has mean and
variance
E(Y (s)) := µ(s) = a′(θ(s)), (4.5)
V ar(Y (s)) := σ(s)2 = a′′(θ(s))d(τ), (4.6)
where a′(θ(s)) and a′′(θ(s)) are the first and second derivatives of a(θ(s)). Many popular
distributions belong to this family, such as Gaussian, exponential, Binomial, Poisson, gamma,
and inverse Gaussian, Dirichlet, and chi-squared beta distributions.
4.3 Robust and Reduced-Rank Bayesian SGLMM model 50
• The Latent Spatial Gaussian Process Layer
Each random variable Y (s) in the observation layer is related to a latent random variable
(η(s)) through its mean (µ(s)) and a link function
g(µ(s)) = η(s) := x(s)Tβ + η(s), (4.7)
where x(s) refers to a vector of covariates and β refers to the vector of regression parameters.
The component η(s) follows a zero mean spatial Gaussian process as introduced in Section 2.1
η(s) ∼ GP(
0, σ2C(η(s), η(s′)|φ))
.
Given the observations Y = Y (s1), · · · , Y (sN ), a discretized form of the SGLMM model can be
described as
Y (sn) ∼ Exp(θ(sn), τ), n = 1, · · · , N,
µ = a′(θ),
g(µ) = Xβ + η,
η ∼ N (0, σ2R(φ)), (4.8)
where θ = [θ(s1), · · · , θ(sN )]T , a′(θ) = [a′(θ(s1)), · · · , a′(θ(sN ))]T , and Exp(θ(sn), τ) refers to an
exponential family distribution with the probability density
f(Y (sn)|θ(sn), τ) = exp
(
Y (sn)θ(sn) − a(θ(sn))
d(τ)+ h(Y (sn), τ)
)
. (4.9)
4.3 Robust and Reduced-Rank Bayesian SGLMM model
This section presents a Robust and Reduced-Rank Bayesian SGLMM model (3RB-SGLMM), which
integrates the advantages of SGLMM, robust SLM, reduced-rank GLM, and Bayesian hierarchical
model. The 3RB-SGLMM model can be formalized in the framework of Bayesian hierarchical model
with three layers, including the observations layer, the latent robust Gaussian process layer, and the
parameters layer. The graphic representation of the 3RB-SGLMM model is shown in Figure 4.1.
4.3.1 The Observations Layer
Given the observations Y = [Y (s1), · · · , Y (sN )], denote Yn = Y (sn). It is assumed that each Y (sn)
follows a distribution of exponential family
Yn ∼ Exp(θn, τ), n = 1, · · · , N,
4.3.1 The Observations Layer 51
Figure 4.1: Graphic Model Representation of the 3RB-SGLMM Model
where θn and τ refer to the distribution parameters, θn is related to the mean of the distribution
that varies by location sn, and τ is called the dispersion parameter and is related to the variance of
the distribution. The probability density function f(Yn|θn, τ) has the form
f(Yn|θn, τ) = exp
(
Ynθn − a(θn)
d(τ)+ h(Yn, τ)
)
, (4.10)
in which the specifications of functions θn, a(θn), d(τ), and h(Yn, τ) are defined based on the specific
distribution considered, such as poisson, binomial, and gamma distributions. For example, the
Binomial distribution B(mn, πn) has the density
p(Yn) =
(
mn
Yn
)
πYnn (1 − πn)mn−Yn . (4.11)
Taking logs, we can rewrite the density function as
log p(Yn) = Yn log(πn
1 − πn) +mn log(1 − πn) + log
(
mn
Yn
)
. (4.12)
This shows that θn = log( πn
1−πn), a(θn) = mn log(1 + exp θn), and h(Yn, τ) = log
(
mn
Yn
)
, where the
second term in the density function is rewritten as log(1 − πn) = − log(1 + exp θn).
4.3.2 The Latent Robust Gaussian process Layer 52
4.3.2 The Latent Robust Gaussian process Layer
The observations Yn are mapped to latent robust Gaussian process random variables µn through
the link function
g(a′(θn)) = µ, (4.13)
where the link function g(·) is defined on the specific distribution of Yn. For example, for binomial
and poisson distributions, the link functions are defined as g(x) = ln(x/(1 − x)) and g(x) = lnx,
respectively.
Denote µ = [µ1, µ2, · · · , µN ]. The vector of latent robust Gaussian process random variables µ has
the additive form
µ = Xβ + η + ξ,
ξn ∼ Student′t(0, ν, σξ), n = 1, · · · , N, (4.14)
where Xβ refers to the large-scale trend component, β refers to the vector of generalized regression
parameters, η refers to the micro-scale spatial Gaussian process component, and ξ refers to the
“error buffer” component that is added to absorb large variations caused by outliers. Each random
variable ξn follows a Student-t distribution that has a heavy tail in its portability density function.
The micro-scale spatial Gaussian process component η is characterized by a reduced rank spatial
linear model
η = CR∗(φ)−1η∗,
η∗ ∼ N (0, σ2R∗(φ)),
where η∗ = [η(s∗1), · · · , η(s∗m)]T , R∗
ij(φ) = C(η(s∗i ), η(s∗j )|φ), ci = C(η(s), η(s∗i )|φ), and C(·) refers
to a kernel function, such as exponential or Gaussian kernels. In this work, we used the popular
exponential kernel, but our model supports other kernels as well.
4.3.3 The Parameters Layer
The proposed 3RB-SGLMM model has the major parameters β, σ2, σ2ξ , φ, ν, and τ . We present a
Bayesian framework that makes it convenient to integrate prior or domain knowledge. The param-
eters are themselves treated as random variables, and the second-level parameters are known as
4.3.4 Theoretical Interpretation 53
hyper-parameters. The prior distributions are defined as
β ∼ N (µβ ,Σβ),
σ2 ∼ Inv −Gamma(ασ2 , γσ2),
σ2ξ ∼ Inv −Gamma(ασ2
ξ, γσ2
ξ),
φ ∼ Uniform(aφ, bφ),
ν ∼ Uniform(aν , bν),
τ ∼ (ατ , γτ ), (4.15)
where, for the equation “τ ∼ (ατ , γτ )”, we did not state the specific prior distribution of the
dispersion parameter τ , which is dependent on the specific exponential family distribution used
in the model (see Equation (4.10)). For example, for Gaussian distribution or inverse Gaussian
distribution, τ is assigned a inverse Gamma prior, i.e., τ ∼ IG(alphaτ , γτ ). For poisson distribution
or binomial distribution, τ is identical to 1, an non-stochastic value, and no prior distributions
are needed. For gamma distribution or exponential distribution, τ is assigned a gamma prior, i.e.,
τ ∼ Gamma(alphaτ , γτ ).
4.3.4 Theoretical Interpretation
Our proposed 3RB-SGLMM model can be regarded as a general framework for robust spatial in-
ferences. For example, if the original sampled locations s1, · · · , sN are selected as knots, then
the 3RB-SGLMM model de-generalizes to a robust Bayesian SGLMM model. If we further set all
prior distributions as uniform distributions, then the 3RB-SGLMM model de-generalizes to a robust
SGLMM model [256]. If the Gaussian distribution is further selected as the exponential family dis-
tribution, then the 3RB-SGLMM model de-generalizes to a robust GLM model (Equation (2.24)).
If we further set the degrees of freedom parameter to infinity, then the variational component ξn
follows a Gaussian distribution, and hence the 3RB-SGLMM model de-generalizes to a regular GLM
model (Equation (2.22)).
4.4 Robust Approximate Inference
This section presents efficient algorithms to estimate the posterior distributions of latent robust
Gaussian process variables η∗, ξ, model parameters β, φ, σ2, τ, ν. Based on the estimated pos-
teriors, we then present an efficient algorithm to detect non-numerical outliers. Lastly, we show that
all the preceding processes can be conducted in linear time.
4.4.1 Inference on Latent variables 54
4.4.1 Inference on Latent variables
For the purpose of computational convenience, we treat the vector of regression parameters β as
latent variables, instead of model parameters. Denote ω = [η∗,β, ξ]. The objective is to inference
the Gaussian approximation of the posterior p(ω|Y,Θ; Ω), where Θ = [τ, φ, ν, σ2ξ , σ
2] and Ω refers
to the set of hyper-parameters. Given the mode ω and Hessian Σ−1 of the negative log density
function log p(ω|Y,Θ; Ω), the posterior can then be approximated as
p(ω|Y,Θ; Ω) ≈ q(ω|Y,Θ; Ω) = N (ω,Σ). (4.16)
The mode ω can be calculated by solving the optimization problem ω = argminω
log p(ω|Y,Θ; Ω),
and the Hessian at the mode ω can be obtained by applying a second order Taylor expansion of
log p(ω|Y,Θ; Ω):
Σ−1 = −∇2 log p(ω|Y,Θ; Ω)∣
∣
ω=ω= HTGH + diag(R∗,Q), (4.17)
where H = [CT R∗−1,X, I]; R∗i,j = C(|s∗i − s∗j |;φ) for two knot locations s∗i and s∗j ; Ct,n = C(|sn −
st|;φ) for a knot location s∗t and an observation location sn; G is a diagonal matrix and Gn is the
Hessian of the negative log observation density function − log p(Yn|Xβ+ CTR∗−1η∗ + ξ); and Q is
a diagonal matrix with the form
Qnn =νσ2
ξn− ξ2n
ξ2n + νσ2ξn
(ν + 1). (4.18)
The specific form of Gn is decided based on the distribution of observations. For example, if the
observations follow a Binomial distribution , then Gn has the form
Gn = mnexp(xn)
(1 + exp(xn))2, (4.19)
where xn = [Xβ + CT R∗−1η∗ + ξ]n, and mn refers to the number of trials at location sn. If the
observations follow a Poisson distribution instead, then Gn has the form
Gn = exp(xn). (4.20)
The mode ω can be identified using general numerical optimization techniques, such as gradient
decent, Newton’s methods, and interior point methods. In our work, we employed the popular
iterative re-weighted least squares (IRLS) algorithm for generalized linear models that optimizes
the mode ω and Hessian Σ−1 jointly, and in practice a good approximation can be obtained in five
iterations for the task of non-numerical outlier detection.
4.4.2 Inference on Parameters 55
4.4.2 Inference on Parameters
This section presents an approximate algorithm based on Laplace approximation to infer the poste-
rior of model parameters p(ω|Y; Ω), by marginalizing out the latent variables Θ:
p(ω|Y; Ω) =
∫
p(ω,Θ|Y; Ω)dΘ. (4.21)
The above integration process is analytical intractable, and approximate inference techniques must
be applied. Because the posterior p(ω|Y; Ω) is skewed, Gaussian approximation is hence inappro-
priate. We first reformulate the posterior as the form
p(ω|Y; Ω) ∝p(Y|ω,Θ; Ω)p(ω|Θ; Ω)p(Θ; Ω)
p(ω|Y,Θ; Ω). (4.22)
As shown in Section 4.4.1, the denominator p(ω|Y,Θ; Ω) ≈ N (ω,Σ). Laplace approximation on
the right component of Equation 4.22 can be obtained as
p(Θ|Y; Ω) ≈ q(ω|Y; Ω) ∝p(Y|ω,Θ; Ω)p(ω|Θ; Ω)p(Θ; Ω)
p(ω|Y,Θ; Ω)
∣
∣
ω=ω. (4.23)
Because the posterior p(ω|Y,Θ; Ω) is skewed, the mode Θ and Hessian ΣΘ of the negative log
density function q(ω|Y; Ω) are not accurate to characterize the distribution. A more appropriate
strategy is to sample K contour points Θ1, · · · ,ΘK around the mode Θ, and then calculate the
corresponding posteriors q(Θ1|Y; Ω), · · · , q(ΘK |Y; Ω). After normalization, we obtain the weights
∆1, · · · ,∆k, in which∑
k ∆k = 1. The most challenging step is to calculate the mode Θ. First,
the mode can be obtained by solving the following optimization problem
argminΘ
− log p(Y|ω(Θ),Θ; Ω) − log p(ω(Θ)|Θ; Ω) − log p(Θ; Ω) + log q(ω(Θ)|Y,Θ; Ω), (4.24)
where ω(Θ) is the mode of p(ω|Y,Θ; Ω), which is a function of Θ. The Hessian of the negative log
density function − log q(ω(Θ)|Y,Θ; Ω) has been estimated in Section 4.4.1, and those of the other
components can be readily derived. In addition, the above problem is a low dimensionality problem,
since there are only five variables in Θ. It can be efficiently solved by using numerical optimization
techniques, such as scaled conjugate gradients and Newton’s methods.
4.4.3 Non-Numerical Spatial Outlier Detection
As shown in the proposed 3RB-SGLMM model, the “error buffer” variables ξ1, · · · , ξN are designed
to absorb large variations caused by outliers. The anomaly degree of an observation Yn will be
characterized by the anomaly degree of the corresponding variable ξn. The posterior p(ξ|Y,Θ; Ω)
4.4.3 Non-Numerical Spatial Outlier Detection 56
can be calculated as
p(ξ|Y,Θ; Ω) =
∫
p(ξ,β,η∗|Y; Ω)dβdη∗
=
∫
p(ξ,β,η∗|Y,Θ; Ω)p(Θ|Y; Ω)dΘdβdη∗
≈
∫
q(ξ,β,η∗|Y,Θ; Ω)p(Θ|Y; Ω)dΘdβdη∗
≈K∑
k=1
∆k
∫
q(ξ,β,η∗|Y,Θk; Ω)dβdη∗
=
K∑
k=1
∆kq(ξ|Y,Θk; Ω)
= N
(
ωξ,
K∑
k=1
∆2kΣξ
)
, (4.25)
where q(ξ,β,η∗|Y,Θ; Ω) = N (ω,Σ) has been obtained in Section 4.4.1, and the contour sample
points Θ1, · · · ,ΘK have been obtained in Section 4.4.2, and q(ξ|Y,Θk; Ω) = N (ωξ,Σξ) is a
subspace distribution of q(ξ,β,η∗|Y,Θ; Ω). Based on the preceding result, the 3RB-SGLMM model
based non-numerical spatial outlier detection algorithm can be described as follows:
1. Estimate the approximate posterior p(ω|Y,Θ; Ω) ≈ N (ω,Σ) by Equations 4.16 and 4.17.
2. Estimate the contour sample points of model parameters Θ1, · · · ,ΘK and the corresponding
weights ∆1, · · · ,∆K by Equations 4.23 and 4.24.
3. Estimate the approximate posterior q(ξ|Y,Θ; Ω) = N(
ωξ,∑K
k=1 ∆2kΣξ
)
by Equation 4.25.
4. Calculate the normalized Gaussian distribution q(ξ|Y,Θ; Ω) = N (0, I), where
ξ = (K∑
k=1
∆2kΣξ)
1/2(ξ − β). (4.26)
5. The absolute value ξn is returned as an estimate of the anomaly degree of the objection Yn and
the set Soutliers of candidate outliers can be calculated using the standard Z test statistics:
Soutliers = Yn||ξ| > 3. (4.27)
At step 5, the square root of the matrix (∑K
k=1 ∆2kΣξ)
1/2 has the time cost O(N3), which is in-
appropriate for large datasets. In our implementation, we further approximate the matrix Σξ as a
diagonal matrix, which makes the time cost down to O(N).
4.4.4 Time and Space Complexity Analysis 57
4.4.4 Time and Space Complexity Analysis
This section analyzes the space and time costs of the above three inferences discussed in Sections
4.4.1 to 4.4.3. First, for the inference of latent variables, we applied the popular IRLS algorithm to
estimate the mode ω and Hessian Σ−1 jointly. Suppose the required number of iterations is L. For
each iteration the dominated time cost is the inversion of the matrix [HTGH+ diag(R∗,Σβ,Q)] as
shown in Equation 4.17, where H = [CT R∗−1,X, I], R∗ ∈ RM×M , C ∈ RM×N , X ∈ RN×P , and
I ∈ RN×N , G ∈ RN×N , Q ∈ RN×N . The counts N,M, and P refer to the numbers of observations,
knots, and regression attributes (predictors), respectively. Without any matrix optimization, the
time cost of the inversion is O(N3).
However, the special structure of the matrix can be explored and the time cost of the inversion can
be reduced to O(N(M + P )3). Specifically, denote F = CTR∗−1,X and A = GF, F∗ = FT GF.
Then the component HTGH has the decomposition form
(
F∗ AT
A G
)
.
The matrix HTGH + diag(R∗,Σβ,Q) then has the decomposition form
(
F∗ + diag(R∗,Σβ) AT
A G + Q
)
.
According to matrix algebra, the inverse of the above form has the special structure
(
C−1 C12
CT12 C−1
2
)
,
where
C1 = F∗ + diag(R∗,Σβ) − AT (G + Q)−1A,
C2 = G + Q− A(F∗ + diag(R∗,Σβ))−1AT ,
C12 = −(F∗ + diag(R∗,Σβ))−1AT (G + Q)−1.
By the ShermanCMorrisonCWoodbury formulae, the inversion C−12 have the decomposition
C−12 = (G + Q)−1 − (G + Q)−1A
(
F∗ + diag(R∗,Σβ) + AT (G + Q)−1A)−1
A(G + Q)−1.
Based on the above matrix manipulation, the inversion of the matrix [HTGH + diag(R∗,Σβ,Q)]
can be calculated in the time cost N(M + P )3. Note that, the inversions of the matrices Q and R
have linear time cost O(N), since they are both diagonal matrices. Therefore, the total time cost of
the inference of the latent variables ω is O(LN(M + P )3).
4.5 Experiments 58
Second, for the inference of parameters, we applied Laplace approximation to estimate the posterior
of the vector of parameters Θ. The main time cost lies on the calculation of the mode Θ, in which we
applied Trust Region Reflective algorithm, the default setting of the fmincon function in Matlab
R2011b. Suppose the required number of iterations is W . For each iteration, the main cost lies
on the inference of latent variables based on the current estimated Θ value and on the inversion
of the variance-covariance matrix of knots, which has the time cost O(LN(M + P )3) + O(M3) =
O(LN(M + P )3). Therefore, the total time cost of this inference process is O(MLN(M + P )3)
Third, for the non-numerical outlier detection process, Step 1 has the time cost O(LN(M + P )3),
Step 2 has the time cost O(MLN(M +P )3). Steps 3 to 5 have the time cost O(KN). The total cost
of the outlier detection process is O(MLN(M + P )3). The required number (L) of IRLS iterations
is smaller than 5 in practice, and the required number of iterations for inferring the mode Θ is
the same scale as the size of Θ that equals 5. In addition, the number (M) of knots and that of
regression parameters are both negligible when the data set size N is large. To conclude, for large
data sets, we have the linear time cost O(N) for all the three inferences. It can be readily derived
that the total space cost is O(N) as well.
4.5 Experiments
This section evaluates the effectiveness and efficiency of our proposed techniques using both four
simulation and six real life datasets. We focused on binary datasets as a case study, but our proposed
techniques can also be applied to all data types that can be characterized by the family of exponential
distribution, such as count, ordinal, and nominal attributes. All the experiments were conducted
on a PC with Intel(R) Core(TM) I7-Q740, CPU 1.73Ghz, and 8.00 GB memory. The development
tool was MATLAB 2011. Note that, we re-implemented all the competitive methods based on their
original papers in our experiments, because the original implementations are unavailable. Although
we have strictly followed the rules of these papers, it is not guaranteed that we have fully accurately
implemented those methods and optimally tuned the related parameters.
it is still potentially possible that our implementations have some inappropriate components.
4.5.1 Experiment Settings
Simulation Datasets
The simulation datasets were generated based on the regular spatial generalized linear Mixed model
(SGLMM)
Y (s) ∼ Binomial(m, g(µ(s))),
g(µ(s)) = x(s)Tβ + η(s),
η(s) ∼ GP(0, σ2C(|s − t|);φ), (4.28)
4.5.1 Experiment Settings 59
Table 4.1: Simulation Model Settings
Dataset Label N β σ2 φ
Sim-500-1 500 [-14.98, -0.86, 7.92] 3 25Sim-500-2 500 [0.30, 1.98, -1.14] 3 25Sim-1000 1000 [-1.99, 0.19, 0.90] 3 25Sim-1500 1500 [-0.02, 2.50, -1.24] 1 25
where GP refers to a Gaussian process, in which the correlation between two locations s and t is
decided by the kernel function C(·) and we used the exponential kernel in our experiments. The
base µ(s) is set to 1 for any location s, and hence the observation Y (s) can only be 0 or 1. The
parameters of the simulation model include β, σ2, and φ. The settings of the number (N)of data
observations, and the number (P ) of attributes also need to be decided.
The data generative process includes three major steps: 1) Generation of spatial locations.
Sample N spatial locations s1, · · · , sN from a uniform distribution in a two dimensional space
within the range 100 by 100. 2) Generation of predictors and regression parameters. Sample
the set of predictors x(s1), · · · , x(sN ) from a P dimensional space in a unit range, apply k-means
to generate two clusters, and generate the vector of regression parameters β based on the bi-sector
separating hyperplane of the two cluster centers. 3) Generation of a Gaussian process. Sample
the variance parameter σ2 from a uniform distribution of range [1,5], and sample the range parameter
φ from a uniform distribution of range [1, 50]. These two parameters decide a specific Gaussian
process. 4) Generation of latent variables. Sample N latent variables η(s1), · · · , η(sN ) from
a Gaussian process based on the parameter φ. 5) Generation of observations. Sample N
observations from the binomial distribution, whose parameters can be calculated based on the spatial
locations and latent variables generated in previous steps. 6) Generation of outliers. Randomly
select five percent of observations and then flip the observation values to the alternative values. For
other settings, P was fixed to 2, and N was set to 500, 1000, 1500, 2000, 2500, 3000, and 5000, to
simulation different scenarios. Using the preceding generative procedure, we randomly generated a
large number of simulation datasets to mimic a variety of simulations. In this section, we present
four representative simulation datasets to discuss the discovered patterns. The model settings of
these four datasets are shown in Table 4.1. The spatial distributions of observations are shown
in Figure 4.2. For each model setting, we generated five realizations of simulation datasets, and
the following evaluations will be conducted based on the average values of accuracy and time costs
(seconds), in order to avoid potential random effects.
Real Life Datasets
The lake dataset was originally published by Varin et al. [320]. It was used to model the trout abun-
dance in Norwegian lakes as a function of lake acidity. The predictor attributes include Intercept, X
coordinate, Y coordinate, Product of X and Y coordinates, X coordinate squared, and Y coordinated
squared. The MLST dataset came from multiple listings containing structural descriptors of houses,
their sale prices, and their addresses for Baltimore, Maryland in 1978. Dubin [325] estimated a
spatial autocorrelation model that calculated the portion of the price by multiplying the vectors of
4.5.1 Experiment Settings 60
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a) Sim-500-1
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) Sim-1000
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(c) Sim-1500
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(d) Sim-500-2
Figure 4.2: Spatial Distribution of Four Simulation Datasets
4.5.1 Experiment Settings 61
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a) MLST
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) LoaLoa
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(c) BEF
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(d) BostonSMSA
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(e) Lake
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(f) House
Figure 4.3: Spatial Distribution of Six Real Life Datasets
4.5.1 Experiment Settings 62
Table 4.2: Real life Data Settings
Dataset Label N Y # of Predictors (Size of x)
Lake 371 Trout abundance 6MLST 211 BE basal area 3BEF 437 BE basal area 5
LoaLoa 197 Number of positives 3BostonSMSA 506 House price 13
House 20640 House price 8
attributes by their estimated coefficients. The explained attributes used X coordinate, Y coordi-
nate, Product of X and Y coordinates, X coordinate squared, and Y coordinate squared. The BEF
dataset is a forest inventory dataset from the U.S. Department of Agriculture Forest Service. BEF
data is included in the spBayes R package [321]. The house dataset contains information collected
for a range of variables for all the block groups in California from the 1990 Census. The spatial
regression model of House was analyzed by Pace and Barry [322]. The predictor variables include
Median Income, Median Income2, Median Income3, ln(Median Age), ln(TotalRooms/Population),
ln(Bedrooms/Population), ln(Population/Households) and ln(Households). The Loa loa prevalence
data set was collected from 197 village surveys [323], which has the predictor variables, including
longitude, latitude, and elevation. The response variable is the number of positive, and the base
value is the number of people tested. The BostonSMSA dataset was used by Harrison and Rubin-
feld to investigate various methodological issues related to the use of housing data to estimate the
demand for clean air [324]. The predictor variables include levels of nitrogen oxides, particulate con-
centrations, average number of rooms, proportion of structures built before 1940, black population
proportion, lower status population proportion, crime rate, proportion of area zoned with large lots,
proportion of nonretail business area, property tax rate, pupil-teacher ratio, location contiguous to
the Charles River, weighted distance to the imployment centers, and an index of accessibility. These
six datasets are abbreviated as the names Lake, MLST, BEF, LoaLoa, BostonSMSA, and House.
The settings of these data sets are shown in table 4.2. If the response variables are numerical, we
discretized the variables into binary variables be setting the values above the median level as 1, and
those below that as 0. The spatial distributions of observations of these six datasets are shown in
Figure 4.3
Five Comparison Methods
We treated binary attributes as one special case of categorical attributes, and considered categorical
outlier detection methods as competitive methods.
For spatial outlier detection methods, Z-test [56] is one of the most popular methods to identify
spatial outliers under the null hypothesis stating that the data follows a normal distribution. When
operating it on the categorical data, we integrated Z-test with Lin and OF measurements. As a
result, there were two comparable methods, namely, Z-OF and Z-Lin.
Several advanced general categorical outlier detection methods have been proposed for categorical
data, including Bayes Net Method, Marginal Method, LERAD, Conditional Test, Conditional Test-
4.5.2 Detection Effectiveness 63
Combining Evidence, and Conditional Test-Partitioning. Over all these methods, experiments had
shown that the Conditional Test and its two variants outperformed all the other methods [326].
Therefore, we focused on the comparison of our method with the two best methods, denoted as
Conditional Test and Conditional Test-Combining Evidence. These methods were originally pro-
posed for multivariate categorical data, and we made straightforward simplifications to make them
applicable to binary data.
Performance Metric
To measure the effectiveness of our proposed techniques, we considered two popular metrics, includ-
ing precision and recall. To measure the efficiency of our proposed techniques, we considered the
running time cost in seconds. For the purpose of interpretation simplicity, we focus on the detection
recall based on the top K objects returned, where K is changed from 1 to N. By changing the K
values, we were able to draw a detect rate curve for each method, and then the performances of all
the methods can be straightforwardly compared.
4.5.2 Detection Effectiveness
Figures 4.4 and 4.5 show the detection results of our proposed detection method and the other five
competitive methods on four simulation and six real life datasets. The X axis refers to the sequence
of objects, and the Y axis refers to the detection rate (recall). For a given sequence number (k) on
the X axis, the detection rate for a specific method refers to the recall of true outliers among the
top k objects returned by this method as candidate outliers. The corresponding detection rate curve
is obtained by calculating the detect rates for k = 1, 2, 3, · · · , N . The comparison results in these
Figures indicate that our proposed detection method achieved the best detection accuracy on most
of the simulation and real data sets. Specifically, for the real data sets MLST and BostonSMSA, our
method outperformed the other methods by twenty and ten percent, respectively. For the simulation
datasets Sim-500-1 and Sim-1000, our method outperformed the other methods by thirty and twenty
percent, respectively. For the other datasets, our method still performed the best or comparable to
the best of the other methods. Another observation is that among the five competitive detection
methods, spatial outlier detection methods outperformed the general outlier detection methods by
more than twenty percent on all the datasets. One potential interpretation is that general outlier
detection methods did not consider spatial correlations into their designs, but spatial correlations
play an important role on the detection of spatial outliers.
We also observe an interesting pattern that potentially explains why our method performed similar to
the spatial detection methods Lin-Z and OF-Z on the simulation dataset Sim-500-1 and the real life
datasets Lake and House. The methods Lin-Z and OF-Z were designed based on the first Geographic
low: “Everything is related to everything else, but near things are more related than distant things.”
These methods basically consider the weighted average of nonspatial attribute values of an object’s
spatial neighbors to measure the outlier degree of the object. If the labels of its spatial neighbors
are mostly consistent with the label of the object, then this object tends to be normal; otherwise,
4.5.2 Detection Effectiveness 64
100 200 300 4000
0.2
0.4
0.6
0.8
1
Our MethodPCFCond−TestComb−EvidLin−ZOF−Z
(a) Sim-500-1
200 400 600 8000
0.2
0.4
0.6
0.8
1
Data ObjectsD
etec
tion
Rat
e
Our MethodPCFCond−TestComb−EvidLin−ZOF−Z
(b) Sim-1000
200 400 600 800 1000 1200 14000
0.2
0.4
0.6
0.8
1
Data Objects
Det
ectio
n R
ate
Our MethodPCFCond−TestComb−EvidLin−ZOF−Z
(c) Sim-1500
100 200 300 4000
0.2
0.4
0.6
0.8
1
Data Objects
Det
ectio
n R
ate
Our MethodPCFCond−TestComb−EvidLin−ZOF−Z
(d) Sim-500-1
Figure 4.4: Spatial Distribution of Simulation Data
4.5.2 Detection Effectiveness 65
50 100 150 2000
0.2
0.4
0.6
0.8
1
Data Objects
Det
ectio
n R
ate
Our MethodPCFCond−TestComb−EvidLin−ZOF−Z
(a) MLST
50 100 1500
0.2
0.4
0.6
0.8
1
Data Objects
Det
ectio
n R
ate
Our MethodPCFCond−TestComb−EvidLin−ZOF−Z
(b) LoaLoa
100 200 300 4000
0.2
0.4
0.6
0.8
1
Data Objects
Det
ectio
n R
ate
Our MethodPCFCond−TestComb−EvidLin−ZOF−Z
(c) BEF
100 200 300 400 5000
0.2
0.4
0.6
0.8
1
Data Objects
Det
ectio
n R
ate
Our MethodPCFCond−TestComb−EvidLin−ZOF−Z
(d) BostonSMSA
100 200 3000
0.2
0.4
0.6
0.8
1
Data Objects
Det
ectio
n R
ate
Our MethodPCFCond−TestComb−EvidLin−ZOF−Z
(e) Lake
2000 4000 6000 800010000120000
0.2
0.4
0.6
0.8
1
Data Objects
Det
ectio
n R
ate
Our MethodPCFCond−TestComb−EvidLin−ZOF−Z
(f) House
Figure 4.5: Detection Rate Comparison on Four Real Datasets
4.5.3 Detection Efficiency 66
it will be returned as a potential outlier. As a result, if the homogeneity of the spatial distribution
of observations is strong, then these methods tend to perform well. This pattern can be clearly
observed by comparing Figures 4.2 and 4.4. The Figures 4.2 (a) to 4.2 (d) are ordered based on the
homogeneity, from small to large, and we can consistently observe that the corresponding detection
difference between our method and the comparison methods Lin-Z and OF-Z are become smaller,
from 4.4 (a) to 4.4 (d). This pattern can also be identified by comparing the real life datasets MLST
and LoaLoa on both spatial distributions and detection rates.
4.5.3 Detection Efficiency
As analyzed in Section 4.4.4, the time complexity of our method is linearly scalable to the data set
size after matrix optimization, but the time complexity without using any matrix optimization is in
an order of cubic scale to the data set size. This feature was also validated in Figure 4.6, in which we
generated simulation datasets of different sizers from 500 to 5000 shown on the X axis. The Y axis
refers to the corresponding running time costs. First, it can be observed that the original version
of our method without optimization has a clear nonlinear (approximate cubic) increasing tendency
based on the data set size, and the time cost of the optimized version is linear increasing based on
the data set size. The similar pattern was observed by experiments on real life datasets. Notice that,
the increasing pattern for the unoptimized version has a violation point when the data set size is
3000. One potential interpretation is that we applied nonnumerical optimization techniques in our
approximate inference algorithms, in which the convergence rate is not only decided by the dataset
size, and the convergence rate at that point may be high due to some other impact factors related
to the data distribution.
500 1000 1500 2000 2500 3000 50000
500
1000
1500
2000
2500
3000
3500
4000
Sec
onds
Our Method (Optimized)Our Method (Not Optimized)
Figure 4.6: Time Cost Analysis
4.5.4 Impact of Model Parameters 67
4.5.4 Impact of Model Parameters
In our proposed detection algorithm, we need to predefine the hyper parameters of the proposed
3RB-SGLMM model, and also to predefine the number of knots. First, for all the hyper parameters,
we used the settings that lead to uniform distributions of the priors of the model parameters. This
strategy has been popularly used in probabilistic-model based on applications, and the resulting
solution becomes to similar to the MLE solution using a nonbayesian version of our proposed 3RB-
SGLMM model. Second, for the number of knots, we used the value 100 by default for all our
experiments. In practice, we observed that the outlier detection performance is not sensitive to the
number of knots used, as shown in Figure 4.7. In addition to the knot size, the way the knots are
generated may also matter. There are two popular strategies to generate knots. This first one is
to draw a uniform grid to cover the study region and each grid point is regarded as a knot. The
second one is to place knots such that each knot covers a local domain and the regions that have
dense data have more knots. In our experiments, we used k-means based clustering algorithm to
identify high density areas which are used to generate the knots. Our experiments show that this
strategy is relatively better than the uniform grid based methods. An potential interpretation is
that the spatial distribution of data observations are not uniform, such as in the situation where the
center location of a county or a city is used to characterize the spatial location of one observation,
and urban areas will have higher densities than country areas.
100 200 300 4000
0.2
0.4
0.6
0.8
1
Data Objects
Det
ectio
n R
ate
Our Method−Knot−5Our Method−Knot−10Our Method−Knot−20Our Method−Knot−50Our Method−Knot−100
(a) BEF
100 200 300 400 5000
0.2
0.4
0.6
0.8
1
Data Objects
Det
ectio
n R
ate
Our Method−Knot−5Our Method−Knot−10Our Method−Knot−20Our Method−Knot−50Our Method−Knot−100
(b) Boston
Figure 4.7: Detection Rate Comparison Using Different Knot Sizes
4.6 Conclusion
This chapter first presents a new 3RB-SGLMM model for the robust modeling of spatial non-
numerical data, and then develops a generalized approach to detect non-numerical spatial outliers,
4.6 Conclusion 68
such as for count, binary, ordinal, and nominal data. The results on both simulation and real life
datasets demonstrated that our proposed approach outperformed existing methods on the detection
accuracy and at the same archived a linear time complexity. To the best of our knowledge, this is the
first work to present a generalized framework this is suitable for different types of spatial datasets.
Chapter 5 69
Chapter 5
Robust Prediction forLarge Spatio-TemporalData Sets
Efficient prediction for massive amounts of spatio-temporal data is an emerging challenge in the
data mining field. Fixed rank spatio-temporal prediction (FR-STP) offers a promising dimension-
reduced approach for predicting large spatio-temporal data in linear time, but is not applicable for
the nonlinear dynamic environments popular in many real applications. This deficiency can be sys-
tematically addressed by increasing the robustness of the FR-STP using heavy tailed distributions,
such as the Huber, Laplace, and Students t distributions. This chapter presents a robust fixed
rank spatio-temporal prediction (RFR-STP) approach that outperforms the FR-STP in nonlinear
environments where the FR-STPs distribution assumptions are violated. This general RFR-STP
algorithm utilizes the framework of Newtons methods for most popular heavy tailed distributions,
and two optimization techniques for the special Huber and Laplace distributions. Extensive experi-
mental evaluations based on both simulated and real-life data sets demonstrate the robustness and
efficiency of the proposed RFR-STP approach.
The rest of this chapter is organized as follows. Section 2 reviews the STRE model and the FR-
STP approach. Section 3 presents the new R-STRE model and formalizes the RFR-STP problem. A
general approach to the RFR-STP problem is proposed in Section 4, and two optimization techniques
are discussed in Section 5. Experiments on both simulated and real life data sets are presented in
Section 6. The chapter concludes with a summary of the research presented in Section 7.
5.1 Introduction 70
5.1 Introduction
Spatial and temporal information exists almost everywhere in the real world. Most physical and
biological processes involve some degree of spatial and temporal variability [163, 278, 229]. Recent
advances in remote sensing technology mean that massive amounts of spatio-temporal data are now
collected, and this volume will only increase. For example, the National Aeronautics and Space
Administration (NASA) has launched satellites (e.g., the Terra satellite) that have the ability to
collect data on the order of 100,000 observations per day [288].
As one of the major research issues, the prediction of spatio-temporal data has attracted significant
considerations in fields such as environmetrics, biology, epidemiology, geography, and economics. Il-
lustrative applications include climate prediction [290,170], tactics identification in battlefields [168],
molecular dynamical pattern mining [295], medical imaging [226], periodic patterns detection on mo-
bile phone users [167], the prediction of infectious disease outbreaks [269] and urban network traffic
volume [266], advertising budget allocation [178], and financial migration motif prediction [232].
Given the large volume of spatio-temporal data, it is computationally challenging to apply tradi-
tional spatial and spatio-temporal prediction methods in either an allowable memory space limit or
an acceptable time limit, even in supercomputing environments [259]. Efficient prediction for large
spatio-temporal data has therefore become one of the emerging challenges in the data mining field.
There are currently two paradigms for predicting spatio-temporal data, namely the Kriging based
and dynamic (mechanic or probabilistic) specification based approaches [228]. The Kriging based
paradigm extends the spatial dimensions (d) to include an extra time dimension and focuses on mod-
eling the variance-covariance structure between observations in the resulting (d + 1)-dimensional
space. Different joint time-space covariance structures have been proposed to model the hetero-
geneities between temporal and spatial dimensions based on different scenarios. The dynamic spec-
ification based paradigm considers spatio-temporal processes through a dynamical-statistical (or
state space based) framework. Observations in the current state are dependent on those in previous
states through their dynamic mechanical (or probabilistic) relationships. This chapter focuses on the
dynamic statistical paradigm, as it explicitly models the knowledge of the phenomenon under study,
always leads to a valid variance-covariance structure, and supports fast predictions [257], [183].
In recent years, a number of methods have been proposed for spatio-temporal prediction using a
number of different techniques, including a spatio-temporal Kalman filter and smoother [251, 270,
292,227], multi-resolutional dynamics [253], Bayesian inference [241], spatial dynamic factor-analysis
[267], sparse approximations [268], and Markov chain Monte Carlo (MCMC) methods [220]. Recent
advance by Cressie and wikle [228] proposed a fixed rank spatio-temporal prediction (FR-STP)
approach that reduces the STP problem to a fixed dimension problem and thus allows predictions in
linear time. The FR-STP assumes that 1) the spatial dependence can be captured by a predefined
set of basis functions; 2) the temporal dependence can be modeled by a latent first-order Gaussian
autoregressive process; and 3) the measurement error can be modeled by a Gaussian distribution.
These assumptions make the FR-STP mainly applicable to linear dynamic environments.
5.1 Introduction 71
However, the spatio-temporal dynamics of real applications are usually nonlinear, and some of
the FR-STP’s distribution assumptions are often violated. For example, the data may have a
number of outliers, such as random hardware failures in digital control systems [239, 271], sensor
faults in aerospace applications [200,201], co-channel fading and interference in wireless communica-
tions [202], and traffic incidents and malfunctioning detectors in urban traffic networks [250]. This
chapter presents a robust spatio-temporal prediction approach for applications in nonlinear dynamic
environments where some of the FR-STP assumptions are violated.
A number of robust methods have been proposed for different learning problems, including multi-
variate regression, Kalman filtering and smoothing, clustering, and independent component analysis
(e.g., [271, 234, 248, 221, 172, 171, 173]). The majority of these methods can be summarized using a
probabilistic framework [271] in which the measurement error is modeled by a heavy tailed distribu-
tion, such as the Huber, Laplace, Student’s t, and Cauchy distributions, instead of the traditional
Gaussian distribution. The prediction problem can then be reformulated as a Maximum-A-Posterior
(MAP) prediction problem conditional on observations. However, employing heavy tailed distribu-
tions makes the prediction process analytically intractable. Although stochastic simulation methods
have been applied to estimate an approximate posterior distribution, for example via MCMC or
particle filtering [234,248,254], these versatile methods are very computationally intensive. Jylanki
et al. [255] presented an efficient expectation propagation algorithm for robust Gaussian process
regression based on the Student’s t distribution, while Svensn and Bishop [221] proposed a varia-
tional inference approach to robust Student’s t mixture clustering. Gandhi and Mili [239] proposed
a robust Kalman filter based on the Huber distribution and the iterative reweighted least squares
(IRLS) method. An efficient Kalman smoother was presented by Aravkin et al. [205] based on the
Laplace distribution and the convex composite extension of the Gauss-Newton method.
This chapter considers the same probabilistic framework as that used in existing robust methods.
Specifically, the Robust Fixed Rank Spatio-Temporal Prediction (RFR-STP) problem is first formu-
lated, then efficient nonlinear optimization algorithms are designed to perform a MAP prediction
and Laplace approximation applied to calculate a measure of the uncertainty of the MAP prediction.
The main contributions can be summarized as follows:
• Formalization of the RFR-STP problem: A Robust Spatio-Temporal Random Effects (R-
STRE) model is proposed in which the measurement error follows a heavy tailed distribution,
in place of the traditional Gaussian distribution. The RFR-STP problem is then formalized
as a MAP prediction problem based on the R-STRE model.
• Design of a general RFR-STP algorithm: A general RFR-STP algorithm is proposed utilizing a
framework of Newton’s methods that can be applied to most existing heavy tailed distributions.
The RFR-STP outperformed the FR-STP in nonlinear environments, where some of the FR-
STP’s distribution assumptions are violated.
• Development of optimization techniques : For the special Huber and Laplace distributions,
the corresponding RFR-STP problems with non-continuously differentiable objective func-
5.2 Theoretical Preliminaries 72
tions were first reformulated as Quadratic Programming (QP) problems, and then primal-dual
interior point methods were applied to achieve a near-linear-order time prediction efficiency.
• Comprehensive experiments to validate the new algorithm’s robustness and efficiency: The
RFR-STP was evaluated using an extensive simulation study and experiments on two real life
datasets. The results demonstrated that the RFR-STP outperformed the FR-STP when the
data were contaminated by a small portion of outliers.
5.2 Theoretical Preliminaries
This section reviews the Spatio-Temporal Random Effects (STRE) model and the Fixed Rank Spatio-
Temporal prediction (FR-STP) approach based on the STRE model.
5.2.1 Spatio-Temporal Random Effects Model
Consider a real-valued spatio-temporal process, Yt(s) : s ∈ D ⊂ Rd, t ∈ 1, 2, · · · , where D is
the spatial domain under study that can be finite or countably infinite. A discretized version of the
process can be represented as
Y1,Y2, · · · ,Yt,Yt+1, · · · , (5.1)
where Yt = [Yt(s1,t), Yt(s2,t), · · · , Yt(sMt,t)]T , and St = s1,t, s2,t, · · · , sMt,t refers to the set of Mt
study locations at time t. Observations and latent observations are given by the data process,
Zt = OtYt + εt, t = 1, 2, · · · , (5.2)
where Zt = [Zt(s1,t), · · · , Zt(sNt,t)]T , εt = [εt(s1,t), · · · , εt(sNt,t)]
T , and St = s1,t, s2,t, · · · , sNt,t.
It is assumed that Nt ≤ Mt and St ⊂ St, which means that only a subset of locations in St
have observations. The matrix Ot is an Nt ×Mt incidence matrix (a matrix with solely zeros and
ones) that is utilized to handle missing observations. The vector εt = [εt(s1,t), · · · , εt(sNt,t)]T is
a Gaussian random vector with mean zero and variance-covariance matrix σ2ε,tVε,t, where Vε,t =
diag(vε,t(sn,t)Nt
n=1).
The vector Yt is given by the spatial process,
Yt = Xtβt + νt, t = 1, 2, · · · , (5.3)
where Xt = [xt(s1,t), · · · ,xt(sMt,t)]T , xt(sn,t) ∈ ℜp refers to a vector of covariants, and the vector
of coefficients βt = (β1,t, · · · , βp,t)T is unknown. The random process νt captures the small scale
variation. For the traditional spatio-temporal Kalman filtering model, a large number of parameters
need to be estimated and the time complexity is proportional to the cube of the number of observa-
tions. A key advantage of the STRE model is that the small scale variation νt is given by a vector
5.2.2 Fixed Rank Spatio-Temporal Prediction 73
of Spatial Random Effects (SRE) processes,
νt = Stηt + ξt, t = 1, 2, · · · , (5.4)
where St = [St(s1,t), · · · , St(sMt,t)]T , St(sn,t) = [S1,t(sn,t), · · · , Sr,t(sn,t)]
T , 1 ≤ n ≤ Mt, is a vector
of r predefined spatial basis functions, such as wavelet and bisquare basis functions, and ηt is an
r-dimensional zero-mean Gaussian random vector with an r × r covaraince matrix given by Kt.
The first component in Equation (5.4) denotes a smoothed small-scale spatial variation at time t,
captured by the set of basis functions St.
The second component in Equation (5.4) captures the micro-scale variation in a similar way to
the nugget effect as defined in geostatistics [228]. It is assumed that ξt ∼ N (0, σ2ξ,tVξ,t), where
Vξ,t = diag(vξ,t(sn,t)Mt
n=1), and vξ,t(·) describes the variance of the micro-scale variation and is
typically considered to be known. The component ξt is indispensable, since it captures the extra
uncertainty due to the dimension reduction in replacing νt by Stηt. The coefficient vector ηt is
given by a vector-autoregressive process of order one,
ηt = Htηt−1 + ζt, t = 1, 2, · · · , (5.5)
where Ht refers to the so-called propagator matrix, ζt ∼ N (0,Ut) refers to an r-dimensional inno-
vation vector, and Ut is known as the innovation matrix. The initial state η0 ∼ Nr(0,K0), where
K0 is in general unknown.
5.2.2 Fixed Rank Spatio-Temporal Prediction
Given a set of observations Z1, · · · ,ZT , the spatio-temporal prediction problem is to predict the
latent (or de-noised) values Y1, · · · ,YT . As discussed in Subsection 5.2.1, the incidence matrix Ot
allows for the specification of missing observations, which makes it possible to concurrently predict
the latent Y values for both observed and unobserved locations. This is a smoothing problem if
t < T ; and a filtering problem if t = T . The Best Linear Unbiased Prediction (BLUP) based on
the STRE model is referred to as the Fixed Rank Spatio-Temporal Prediction (FR-STP) [228]. The
computational complexity of the FR-STP is O(∑
tNtr3), where r refers to the number of basis
functions used. In general, r is fixed with r ≪ Nt, and the time complexity equals O(∑
tNt).
In contrast, the traditional spatio-temporal Kalman filter and smoother has a time complexity
O(∑
tN3t ).
5.3 Problem Formulation
This section presents a robust version of the STRE model, namely the R-STRE model, and then
formalizes the Robust Fixed Rank Spatio-Temporal Prediction (RFR-STP) problem based on the
5.3.1 Robust Spatio-Temporal Random Effects Model 74
R-STRE model. Solutions to the RFR-STP problem will be presented in Section 5.4 and Section
5.5.
5.3.1 Robust Spatio-Temporal Random Effects Model
The proposed R-STRE model is defined as
Zt = OtYt + εt
Yt = Xtβ + Stηt + ξt,
ηt = Htηt−1 + ζt, t = 1, 2, · · · ,
in which most of the variables are defined as in the STRE model (See Subsection 5.2.1), except that
the measurement error εt(sn,t) now follows a heavy tailed distribution with the probability density
function f(ε;µ, σ2) = 1σh((ε − µ)/σ), where µ refers to the mean and σ refers to the dispersion
parameter. Examples of the h function include 1) the Laplace distribution: h(x) = 12e
−|x|; 2) the
Student’s t distribution: h(x) = c(x+ v)(p+v)/2, where c is a normalization constant, the case v = 1
is the Cauchy density, and the limiting case v → ∞ yields the normal distribution; and 3) the Huber
distribution: h(x) = ce−ϕ(x;κ),
ϕ(x;κ) =
κ|x| −1
2κ2, for |x| > κ
1
2x2, for |x| ≤ κ, (5.6)
where c is a normalization constant that ensures∫
cσe
−ϕ(x;κ) = 1, and κ is a range parameter of the
distribution. The probability density functions (pdf) of the Huber and Laplace distributions and
the pdf of the Gaussian distribution are compared in Figure 5.1.
5.3.2 Problem Formulation
Assume that the model parameters Ψ = σ2ε , σ
2ξ ,β,H1:T ,U1:T ,K0 have been estimated [228,259],
and we observe Z = Z1, · · · ,ZT . The RFR-STP is defined as the procedure used to infer the
posterior distribution p(Y1:T |Z1:T ;Ψ) based on the R-STRE model. Because the R-STRE model
employs a non-Gaussian distribution to model the measurement error, the inference of the posterior
distribution becomes analytically intractable. However, efficient numerical optimizations can be
applied to calculate a MAP estimate, which is a mode of the posterior distribution p(Y1:T |Z1:T ;Ψ),
and a Laplace approximation can then be applied to calculate an approximate measure of the
uncertainty (variance-covariance matrix) of the MAP estimate. The Laplace approximation is a
popular approximate inference method that identifies the Gaussian distribution that best fits a
given pdf, and the estimated mean is identical to the mode of the pdf function and is consistent
with a MAP estimate.
5.4 A General Approach 75
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
f(x)
Gaussian(0,1)Huber(0,1)
(a) Gaussian vs. Huber pdfs
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
x
f(x)
Gaussian(0,1)Laplace(0,1)
(b) Gaussian vs. Laplace pdfs
Figure 5.1: pdfs of Heavy Tailed Distributions
Let Yt|T and Σt|T be the MAP and variance-covariance matrix estimates of p(Yt|Z1:T ;Ψ). Let
ηt|T , ξt|T and Gt|T be the MAP and precision matrix of the joint posterior p(ηt, ξt|Z1:T ). As with
the FR-STP, it can be derived that
Yt|T = Xtβ + Stηt|T + ξt|T (5.7)
Σt|T = [St, I]G−1t|T [St, I]
T , (5.8)
where I refers to an identity matrix.
The key step in the RFR-STP is to estimate the components ηt|T , ξt|T , and Gt|T . The first two
components η1:T |T and ξ1:T |T can be estimated by solving the following MAP optimization problem
minimizeη1:T ,ξ1:T
− ln p (η1:T , ξ1:T |Z1:T ;Ψ) . (5.9)
A general approach to problem (5.9) will be presented in Section 5.4, followed by several optimization
techniques in Section 5.5. The precision matrix Gt|T will be estimated via Laplace approximation
in Subsection 5.4.2.
5.4 A General Approach
This section presents a general approach to the RFR-STP problem, without assuming any specific
distribution of the measurement error (εt(sn,t)). Here, a general form f(εt(sn,t); 0, σ2ε) is used to
denote the probability density function of εt(sn,t), where σε refers to the dispersion parameter.
As discussed in Subsection 5.3.2, the key step is to calculate the MAP estimate η1:T |T , ξ1:T |T
5.4.1 MAP Estimation of η1:T |T , ξ1:T |T 76
and the precision matrix G1:T |T . These will be discussed in Subsection 5.4.1 and Subsection 5.4.2,
respectively.
5.4.1 MAP Estimation of η1:T |T , ξ1:T |T
The MAP estimate η1:T |T , ξ1:T |T can be calculated by solving the following optimization problem
minimizeη1:T ,ξ1:T
− ln p (η1:T , ξ1:T |Z1:T ;Ψ) . (5.10)
Considering only ηt and ξt as variables, the negative logarithm of the pdf can be rewritten as
− ln p (η1:T , ξ1:T |Z1:T ;Ψ)
=T∑
t=1
ρ(σ−1ε,t V
− 12
ε,t (Zt − OtXtβ − OtStηt − Otξt)) +1
2
T∑
t=1
(ηt − Htηt−1)TU−1
t (ηt − Htηt−1)
+1
2
T∑
t=1
σ−2ξ,t ξ
Tt V−1
ξ,tξt + const
= 1Tρ(Z − OSη − Oξ) +1
2ηTMη + ETη +
1
2ξTΛξξ + const, (5.11)
where ρ(·) = − lnh(·), Zt = σ−1ε,t V
− 12
ε,t (Zt − OtXtβ), Z = [ZT1 , · · · , Z
TT ]T , S = diag(St
Tt=1),
Ot = σ−1ε,tV
− 12
ε,t Ot, O = diag(OtTt=1), Λξ,t = σ−2
ξ,t V−1ξ,t , Λξ = diag(Λξ,tT
t=1), η = [ηT1 , · · · ,η
TT ]T ,
and ξ = [ξT1 , · · · , ξ
TT ]T . The definitions of matrices M and E are given in Appendix A.2.
Problem (5.10) can then be simplified as
minimizeη,ξ
1Tρ(Z − OSη − Oξ) +1
2ηT Mη + ETη +
1
2ξTΛξξ + const. (5.12)
The optimal solution to problem (5.12) must satisfy
−ST OTψ(Z − OSη − Oξ) + Mη + E = 0, (5.13)
−OTψ(Z − OSη − Oξ) + Λξξ = 0, (5.14)
where ψ(·) = ∇ρ(·). In some special situations (e.g., for a Gaussian distribution), an analytical
solution may be obtained by solving the above system of equations. However, for heavy tailed
distributions such as the Huber, Laplace, and Student’s t distributions, problem (5.12) is analytically
intractable and efficient nonlinear optimization techniques need to be developed.
Theorem 1 Let φ(x) = d2ρ(x)/dx2. If φ(x) is constantly nonnegative, then problem (5.12) is a
strict convex optimization problem.
Proof To prove the convexity, it suffices to prove that its Hessian matrix Ω is positive definite.
Let f = − ln p(Z1:T ,η, ξ;Ψ) and Ω := [P,C;CT ,R], where P = ∇2f∇η2 , C = ∇2f
∇η∇ξ, and R = ∇2f
∇ξ2 .
5.4.2 LA Estimation of the Precision Matrix G1:T |T 77
By using the property of Schur Complements, the Hessian matrix Ω is positive definite if R and
P−CRCT are positive definite. These two positive definiteness conditions can be proved readily.
The condition required for Theorem 1 is satisfied for most heavy tailed distributions, including the
Huber distribution, Laplace distribution, Student’s t distribution, and Cauchy distribution. This
theorem ensures that the problem stated in Equation (10) is convex and a global optimum can
be obtained by using convex optimization techniques. Considering the high dimensionality of the
variables ξ, we present an iterative optimization algorithm utilizing the framework of Newton’s
methods, in which the variables η and ξ are optimized iteratively, until convergence occurs. One
interesting observation is that when η is fixed, the optimization of ξ can be separated into T
independent sub-optimizations of ξt, t = 1, · · · , T , which further reduces the required memory space
and time cost of the computation. Denote Φ(x) = ∇2ρ(x) = diag(φ(xi)mi ), where m refers to
the dimension of x. The general RFR-STP algorithm based on Newton’s methods is described in
Algorithm 1.
5.4.2 LA Estimation of the Precision Matrix G1:T |T
The precision matrix Gt|T can be decomposed into four components: Gt|T = [Pt|T ,Ct|T ;CTt|T ,Rt|T ],
which can be estimated via Laplace approximation (LA) as
Pt|T =∇2 − ln p (η1:T , ξ1:T |Z1:T ;Ψ)
∇η2t
∣
∣
∣
∣
ηt=ηt|T ,ξt=ξt|T
= STt OT
t Φ(Zt − OtStηt|T − Otξt|T )OtSt + HTt+1U
−1t+1Ht+1 + U−1
t
Rt|T =∇2 − ln p (η1:T , ξ1:T |Z1:T ;Ψ)
∇ξ2t
∣
∣
∣
∣
ηt=ηt|T ,ξt=ξt|T
= OTt Φ(Zt − OtStηt|T − Otξt|T )Ot + Λξ,t,
Ct|T =∇2 − ln p (η1:T , ξ1:T |Z1:T ;Ψ)
∇ηt∇ξt
∣
∣
∣
∣
ηt=ηt|T ,ξt=ξt|T
= STt OT
t Φ(Zt − OtStηt|T − Otξt|T )Ot,
where Φ(Zt − OtStηt|T − Otξt|T ) = diag(φ(Zt − OtStηt|T − Otξt|T )), φ(x) = [φ(x1), · · · , φ(xNt)]T ,
and φ(xn) = d2ρ(xn)dx2
n, n = 1, · · · , Nt. For example, φ(x) = 2 for a Gaussian distribution, and
φ(x) = −(p + v)(x + v)−2 for a student’s t distribution. For a Laplace distribution, the second
order derivative φ(x) equals zero (except at x = 0). In order to make φ(x) feasible everywhere, the
corresponding ρ function is often approximated as a smooth function
ρ(x) = ln(cosh(γ|x|))/γ +1
2ǫx2, (5.15)
where cosh(s) = es+e−s
2 . The parameter γ > 0 is fixed and the approximation converges to |x| as
γ → ∞. The second optional quadratic term in Equation (5.15) is used to stabilize the optimization
algorithms, with ǫ equal to a small positive value (e.g., 0.01). Based on the smoothing approximation,
5.4.2 LA Estimation of the Precision Matrix G1:T |T 78
the second order derivative can be calculated as
φ(x) = −γ/2 +sinh(γx)2
cosh(γx)2(5.16)
Figures 5.2 (a) and (b) visualize the approximate ρ and φ functions with the default ǫ = 0.01 and
γ = 0.5, 1, 2. This indicates that the higher the value of γ, the closer the approximate ρ and φ
functions are to the true ρ and φ functions.
−5 −4 −3 −2 −1 0 1 2 3 4 50
2
4
6
8
10
12
14
16
18
20
22
x
ρ (x
)
Huber(0,1; 1.5)Huber(0,1; 2)Huber(0,1; 3)
(a) The Huber Distribution
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
xab
s(x)
L1Approx: γ = 0.5Approx: γ=1Approx: γ=2
(b) The Laplace Distribution
Figure 5.2: Approximations of Heavy Tailed Distributions
For a Huber distribution, the second order derivative φ(x) exists everywhere except at the points
|x| = κ:
φ(x) =
0, for |x| > κ
1, for |x| ≤ κ
not existent, for |x| = κ
In order to make φ(x) feasible everywhere, the ρ function is often approximated as the following
smooth function
ρ(x) =
κ(x ln |x| − x) +1
2κ2 − κ2 lnκ+ κ, |x| > κ
1
2x2, |x| ≤ κ.
Then we have the approximate φ function for a Huber distribution as
φ(x) =
κ/|x|, for |x| > κ
1, for |x| ≤ κ.
5.5 Optimization Techniques 79
Figures 5.2 (a) and (b) visualize the approximate ρ and φ functions with different parameter settings
κ = 0.5, 1, 2,where the notation Huber(0, 1; 1.5) refers to a Huber distribution with a mean, variance,
and κ equal to 0, 1, and 1.5, respectively.
5.5 Optimization Techniques
This section presents two primal-dual optimization techniques for the RFR-STP algorithm when the
Huber or Laplace distribution is selected to model the measurement error.
5.5.1 Primal-Dual Optimization for Huber Distribution
This subsection explores the special structure of the R-STRE model based on the Huber distribution,
and presents a primal-dual interior point algorithm to achieve a close-to-linear-order time efficiency.
The Huber distribution is one of the most popular heavy tailed distributions used in robust statistics
[271]. In the R-STRE model, the Huber distribution is used to model the measurement error: the
random variable εt(sn,t) ∼ Huber(0, σε,t
√
vt(sn,t), κ). The pdf of the Huber distribution is defined
as p(ε;u, σ, κ) = 1σh(
ε−µσ ;κ), h(x;κ) = ce−ϕ(x;κ), and
ϕ(x;κ) =
κ|x| −1
2κ2, for |x| > κ
1
2x2, for |x| ≤ κ,
where c is a normalization constant to ensure that the integral∫
c/σe−ϕ(x;κ)dx = 1, and κ is a range
parameter of the distribution. The MAP optimization problem to be addressed is
minimizeη,ξ
1Tϕ(Z − OSη − Oξ) +1
2ηTMη + ETη
+1
2ξTΛξξ + const. (5.17)
Dual Problem
To derive a Lagrange dual of the primal problem stated in Equation (5.17), we first introduce a
new variable r and a new equality constraint r = Z − OSη − Oξ. The primal problem can be
reformulated as
minimizeη,ξ
1Tϕ(r) +1
2ηTMη + ETη +
1
2ξTΛξξ + const
subject to r = Z − OSη − Oξ
(5.18)
Associating an auxiliary variable ω with the equality constraint, we can derive the Lagrangian as
L(η, ξ, r,ω) = 1Tϕ(r) +1
2ηTMη + ETη +
1
2ξTΛξξ + ωT (r − Z + OSη + Oξ) + const.
5.5.1 Primal-Dual Optimization for Huber Distribution 80
Theorem 2 The dual function is
infη,ξ,r
L(η, ξ, r,ω) =
−ωT Z−1
2ωT O(SM−1ST + Λ−1
ξ )OTω − ωT OSM−1E −1
2ωTω + const, |ω| ≤ κ1
−∞, otherwise,
and
η = −M−1(ST OTω + E), (5.19)
ξ = −Λ−1ξ OTω. (5.20)
Proof See Appendix A.3.
Let G = O(SM−1ST + Λ−1ξ )OT + I. The dual problem can be reformulated as
minimizeω
ωT Z + ωT OSM−1E +1
2ωTGω + const
subject to ω − κ1 ≤ 0,−ω − κ1 ≤ 0.
(5.21)
Here, the condition “ω − κ1 ≤ 0” means that ωi − κ ≤ 0, ∀i. The above problem is a Quadratic
Programming (QP) problem with the variable ω. If it satisfies ω − κ1 ≤ 0 and −ω − κ1 ≤ 0,
the problem is said to be strictly dual feasible. After we have obtained the solution ω of the dual
problem stated in Equation (5.21), the primal variables η and ξ can be recovered via Equations
(5.19) and (5.20).
A Primal-Dual Interior-Point Method
The QP problem stated in Equation (5.21) can be solved by using standard numerical optimization
techniques such as deepest decent, Newton’s method, or interior point methods. This subsection
explores the special structure of the QP problem (5.21) by proposing a primal dual interior point
method. In the worst case, primal dual interior methods require a number of iterations equal to
O((∑
Nt)1/2). However, in practice this class of methods usually solves QP problems in a number of
steps that is independent of the data size. If the QP problem is processed correctly, the computation
cost is usually dominated by the cost of computing the search directions with time complexity
O(∑
Nt). Therefore, the total complexity is O(∑
Nt), which is the same as the time cost of the
traditional FR-STP approach.
5.5.1 Primal-Dual Optimization for Huber Distribution 81
The Lagrangian of problem (5.21) is
L(ω,λ1,λ2) = ωT Z + ωT OSM−1E +1
2ωTGω + λT
1 (ω − κ1) + λT2 (−ω − κ1) (5.22)
In order to select the search direction for the Newton step, we define the residual vector as
rt(ω,λ1,λ2) =
GTω + Z + OSM−1E + λ1 − λ2
−diag(λ1)(ω − κ1) − (1/t)1
−diag(λ2)(ω − κ1) − (1/t)1
,
where ω is primal feasible and λ1,λ2 are dual feasible, with a duality gap m/t. If ω,λ1,λ2 satisfy
rt(ω,λ1,λ2) = 0, then this is the optimal solution of problem (5.21). When t → ∞, the condition
rt(ω,λ1,λ2) = 0 degeneralizes to the standard Karush-Kuhn-Tucker (KKT) conditions for the dual
problem (5.21). The first component rdual is called the dual residual and the left two components
are centrality residuals. The basic idea of the primal dual interior point method is to iteratively
apply a Newton step to solve nonlinear Equations rt(ω,λ1,λ2) = 0 with increasing values of t.
Denote the current estimate and the Newton update as g = (ω,λ1,λ2) and ∆g = (∆ω,∆λ1,∆λ2),
respectively. The Newton step can be now represented by a system of linear equations
rt(g + ∆g) ≈ rt(g) +Drt(g)∆g = 0. (5.23)
In terms of ω, λ1, and λ2, we have
G I −I
−diag(λ1) −diag(ω − κ1) 0
diag(λ2) 0 diag(ω + κ1)
∆ω
∆λ1
∆λ2
= −
GTω + Z + OSM−1E + λ1 − λ2
−diag(λ1)(ω − κ1) − (1/t)1
−diag(λ2)(ω − κ1) − (1/t)1
The primal-dual search direction ∆gpd = (∆ω,∆λ1,∆λ2) is defined as the solution of the following
set of linear equations (5.24):
∆λ1 = −λ1 − diag(ω − κ1)−1
[
1
t1 + diag(λ1)∆ω
]
,
∆λ2 = −λ2 + diag(ω + κ1)−1
[
1
t1 + diag(λ2)∆ω)
]
,
∆ω = (B + D)−1
(−Bω + F) ,
B = OSM−1ST OT ,
D = OΛ−1ξ OT + I − diag(ω − κ1)−1diag(λ1) + diag(ω + κ1)−1diag(λ2), (5.24)
F = −OΛ−1ξ OTω − ω − Z − OSM−1E−
1
tdiag(κ1− ω)−11 +
1
tdiag(ω + κ1)−11.
5.5.2 Primal-Dual Optimization for Laplace Distribution 82
The primal-dual interior point algorithm is summarized in Algorithm 2.
5.5.2 Primal-Dual Optimization for Laplace Distribution
In this subsection, we consider how the Laplace distribution can be used to model the measurement
error. The Laplace distribution is another popular heavy tailed distribution that is widely used in ro-
bust statistics [271]. In the R-STRE model, the measurement error εt(sn,t) ∼ Laplace(0, σε,t
√
vt(sn,t)),
in which the pdf is defined as p(ε;u, σ) = 1σh(
ε−µσ ), and h(x) = 1
2e−|x|.
The MAP optimization problem to be solved is
minimizeη,ξ
∣
∣
∣Z − OSη − Oξ∣
∣
∣+1
2ηTMη + ETη +
1
2ξTΛξξ + const. (5.25)
To derive a Lagrange dual of the primal problem stated in Equation (5.25), we first introduce a new
variable r and a new equality constraint r = Z − OSη − Oξ. The primal problem (5.25) can be
reformulated as
minimizeη,ξ
1T |r| +1
2ηTMη + ETη +
1
2ξTΛξξ
subject to r = Z − OSη − Oξ
Associating an auxiliary variable ω with the equality constraint, we derive the Lagrangian as
L(η, r,ω) = 1T |r| +1
2ηTMη + ETη +
1
2ξTΛξξ + ωT (r − Z + OSη + Oξ).
Theorem 3 The dual function is
infη,r
L(η, r,ω) =
−1
2ωT O(SM−1ST + Λ−1
ξ )OTωωT Z− ωT OSM−1E + const, −1 ≤ ω ≤ 1 (5.26)
−∞, otherwise,
and
η = −M−1(ST OTω + E), (5.27)
ξ = −Λ−1ξ OTω. (5.28)
Proof See Appendix A.4.
By Theorem 3, the dual problem can be formalized as
minimizeω
ωT Z +1
2ωT O(SM−1ST + Λ−1
ξ )OTω + ωT OSM−1E + const
such that − 1 ≤ ω ≤ 1.
(5.29)
5.5.3 Time and Space Complexity Analysis 83
The above dual problem is a QP problem with the variable ω. If this satisfies −1 ≤ ω ≤ 1, the
problem is said to be strictly dual feasible. After we have obtained the solution ω of the dual
problem stated in Equation (5.29), the primal variable can be recovered via Equations (5.27) and
(5.28). An efficient primal-dual interior point algorithm can be designed, which has the same steps
as Algorithm 2 except for the formulas for calculating the components ω, λ1, and λ2. Readers are
referred to [236] for the detailed implementation.
5.5.3 Time and Space Complexity Analysis
This section evaluates the time and space complexity of the proposed RFR-STP-Huber algorithm,
which is designed based on interior point methods. Suppose the required number of interior point
iterations is L. As indicated in Algorithm 2, for each iteration the dominated time cost lies on the
calculation of the component ∆ω, which has the decomposition form
∆ω = (B + D)−1(−Bω + F).
The inversion of the matrix (B + D) can be calculated using the Sherman-Morrison-Woodbury
formula
(
OSM−1ST OT + D)−1
= D−1 − D−1OS(M + ST OTD−1OS)−1ST OTD−1.
Hence, the cost of inverting a square matrix of a large size (∑T
t=1Nt) is reduced to the cost of
inverting a square matrix of a much smaller size (Tr), and the time complexity is reduced from
O((∑T
t=1Nt)3) to O(
∑Tt=1Ntr
3). Note that the component (M + ST OTD−1OS) is sparse and the
inversion can be calculated using efficient solvers for sparse systems of linear equations. In practice,
the actual time cost is close to O(∑T
t=1Ntr3). Therefore, the total time cost of the proposed RFR-
STP-Huber algorithm is O(L∑T
t=1Ntr3). The total space cost is dominated by the required space
of the matrix (M + ST OTD−1OS), which takes O((∑T
t=1Nt)2). However, because this matrix is a
sparse matrix, the space cost of the compressed form reduces to O(∑T
t=1Nt ∗ r).
5.6 Experiments
This section focuses on the Huber distribution as a case study, and evaluates the robustness and effi-
ciency of the proposed RFR-STP based on both simulated and real-life data sets. All the experiments
were conducted on a PC with Intel(R) Core(TM) I7-Q740, CPU 1.73Ghz, and 8.00 GB memory.
The development tool was MATLAB 2011. Note that, we re-implemented all the competitive meth-
ods based on their original papers in our experiments, because the original implementations are
unavailable. Although we have strictly followed the rules of these papers, it is not guaranteed that
we have fully accurately implemented those methods and optimally tuned the related parameters.
5.6.1 Simulation Study 84
Figure 5.3: Experiment Design
As shown in Figure 6.4, the experimental design consisted of the following steps: 1) Data Prepro-
cessing, in which the raw data was reprocessed to obtain clean data Z, the log-transformation was
applied to Z such that the distribution was close-to-symmetric, and the study region selected; 2)
Parameter Estimation, in which the parameters of the STRE model were estimated based on Z
using the EM algorithm [259]; 3) Data Contamination, in which isolated or regional (cluster of)
outliers were added to the clean data Z to obtain the contaminated data Z; 4) Prediction, in which
the FR-STP was applied to Z to obtain the predicted Y, as the “true” Y, and then the FR-STP
and RFR-STP were applied to Z to predict Y; 5) Results Evaluation, where the mean absolute
percentage error (MAPE) and Root Mean Square Error (RMSE) between Y and Y were calculated:
MAPE =1
(∑T
t=1Nt)
T∑
t=1
Nt∑
n=1
|Ytn − Ytn|
Ytn
, (5.30)
RMSE =
1
(∑T
t=1Nt)
T∑
t=1
Nt∑
n=1
(Ytn − Ytn)2
12
. (5.31)
Subsection 5.6.1 presents a comprehensive simulation study, and Subsections 5.6.2 and 5.6.3 present
empirical evaluations based on two real-life datasets.
5.6.1 Simulation Study
This subsection presents a simulation study comparing the robustness of the proposed RFR-STP
approach with that the FR-STP approach. Here, we considered the same simulation model as that
used in the original FR-STP paper [228] to generate the simulated data.
1) Simulation Settings
The spatial domain was designed with one dimension and consisted of the observation locations
D = s : s = 1, · · · , 256. The temporal domain ranged from t = 1 to t = 100. The trend component
µt(s) was assumed identical to zero and the values Yt and Zt were simulated according to Equations
5.6.1 Simulation Study 85
(5.2) and (5.3). A stationary process was used with the settings St = S, Ht = H, and Ut = U. The
small scale (autoregressive) process ηt was generated by the matrix parameters H and U. The
spatial basis functions S were defined by 30 W-wavelets from the first four resolutions [276].
Two types of outliers were considered, including isolated and regional (clusters of) outliers. First,
for isolated outliers, we randomly picked locations from s = 1 to s = 256 and times from t = 1 to
t = 100, and then shifted the related observations to large values (e.g., ±5). The normal observations
Zt(s) were between -2.4 and 2.4. Five scenarios were generated with 5, 10, 15, 30, and 50 isolated
outliers, respectively. Second, for regional outliers, we randomly picked locations and times, and
shifted the related observations in the same way as for the isolated outliers generation. Three
scenarios with 2 regional outliers in each were generated, with outlier region sizes set at 5, 10, and
15, respectively. We also tested a variety of other scenarios for both isolated and regional outliers
and observed patterns that were consistent with the result we reported here.
2) Robustness of the RFR-STP
Figure 5.4 illustrates the impacts of isolated outliers on different prediction algorithms for four
different times and with different numbers of outliers. Each sub-figure depicts four curves that are
related to the original observations Zt, the contaminated observations Zt, the predicted values Yt by
the FR-STP, and the predicted values Yt by the proposed RFR-STP, respectively. Note that all the
prediction algorithms were conducted on the contaminated observations Zt. The X-axis refers to the
location index, with a total of 256 distinct locations. The Y-axis denotes the Z (or the predicted Y)
values. The symbol “t” refers to the time stamp. For ease of visualization, all outlier observations
were randomly set to 5 or -5. The results indicate that with increasing number of outliers, the Y
curve predicted by the FR-STP was clearly distorted to an increasing degree. In comparison, the
proposed RFR-STP demonstrated a high degree of resilience to outlier effects. Even for the case of
a high rate of contamination (e.g., 50 outliers, around 20% of the total), the proposed RFR-STP
still predicted the true Yt very accurately. This pattern is especially clear in predicting the Y values
at unobserved locations from s = 113 to s = 127, as shown in Figures 5.4 (a) to (d).
Figure 5.5 illustrates the impacts of regional outliers on different prediction algorithms at two times
and with different outlier region sizes (the number of adjacent outliers). When the outlier region
size is small (e.g., 5 adjacent outliers), the proposed RFR-STP had a high prediction accuracy at
all locations, whereas the FR-STP was very sensitive to regional outliers and had a much lower
prediction accuracy in regions around outliers and unobserved locations. At locations distant from
the outlier region, the predictions by the RFR-STP were almost the same as the predictions by the
FR-STP. This indicates a particular strength of the RFR-STP approach: although it performed as
well as the original FR-STP in nominal conditions, it was more accurate when outliers were involved.
However, we observed that large regional outliers had significant impacts on both the FR-STP and
RFR-STP approaches. When the outlier region size was increased to a large value, such as 10 or 15,
both the FR-STP and RFR-STP were adversely affected and their predictions around the outlier
region were close to the outlier values. This can be potentially interpreted by the STRE model
5.6.1 Simulation Study 86
assumptions (See Subsection 5.2.1) that define the spatio-temporal dependence between Z(si;u)
and Z(sj ; t), with i 6= j or u 6= t. In particular, the STRE model assumes a Gaussian process to
model the spatial dependence between Z(si; t) and Z(sj ; t), i 6= j, so observations will have a high
spatial correlation if they are spatially close. For the temporal dependence, the STRE model assumes
a first order autoregressive Markov process. That is, in addition to its dependence on observations
over other locations at time t, Zt is also dependent on its previous time observations Zt−1. Hence,
the STRE model considers a spatial Gaussian process, log-1 temporal autocorrelation, and white
noise (Gaussian distribution) to model the whole data variation. The proposed R-STRE model is
similar to the STRE model except that heavy tailed distributions such as the Huber and Laplace
distributions are utilized to model the white noise (the measurement error), instead of a Gaussian
distribution.
Spatio-temporal outliers can be interpreted as observations that have abnormally low correlations
with their spatio-temporal neighbors, considering normal deviations due to measurement error (white
noise). For the regular STRE model, when a data set has outliers the additional variation due to
those outliers will be captured by distorting the spatio-temporal dependence (or the sharpness of
the predicted Y curve). The white noise component is unable to handle large deviations due to
the light-tailed feature of the Gaussian distribution. This explains the distorted blue curves shown
in Figures 5.5 (a) and (b). A specific spatio-temporal autocorrelation pattern is associated with a
specific degree of sharpness of the resulting smoothed curves. In comparison, the proposed R-STRE
model uses heavy tailed distributions to model the measurement error. When outliers appear, the
proposed R-STRE model directly captures the additional large variation due to outliers as the
measurement error. When the outlier region becomes large, however, it becomes possible to use
normal spatio-temporal autocorrelations to directly capture the outlier variation. Intuitively, we are
able to use a smooth and unsharp curve to fit the observations well. This may explain why the
proposed RFR-STP failed to predict the correct Y values at locations close to the regional outliers
for large regional outliers.
2) Computational Efficiency of the RFR-STP
Table 5.1 compares the time cost for the RFR-STP and FR-STP approaches for different scenarios.
The results indicate that the optimized RFR-STP consistently achieved the same order of time
efficiency as FR-STP in all the scenarios tested. In contrast, the general RFR-STP had a time
efficiency that was around ten times lower than either the general FR-STP or the optimized RFR-
STP. One interpretation is that the optimized RFR-STP is a customized algorithm based on the
special structure of the Huber distribution, whereas the general FR-STP is a unified algorithm
designed for use with most existing heavy tailed distributions. Customized algorithms are usually
more efficient than non-customized algorithms. Considering that the FR-STP has a linear-order
time complexity, the general RFR-STP was still very fast, and should scale well with large datasets.
Note that the general RFR-STP was only compared with the optimized RFR-STP with regard to
time efficiency. As shown by Theorem 1, the RFR-STP is a strict convex problem for most existing
heavy tailed distributions, which implies that there exists a unique local (and global) optimum for
5.6.2 Experiments on Aerosol Optical Depth Data 87
the RFR-STP given a specific heavy tailed distribution. Both the general and optimized RFR-STP
algorithms will return the same prediction results, and hence have the same robustness.
5.6.2 Experiments on Aerosol Optical Depth Data
The Aerosol Optical Depth (AOD) data used for this study was collected by NASA’s Terra satellite
using an onboard MISR (Multi-angle Imaging SpectroRadiometer) to measure and monitor global
aerosol distributions, and provide information such as aerosol optical depth, aerosol shape, and size.
The spatial resolution of the AOD level-2 data collected by MISR is 17.6 × 17.6km. The level-2 data
are then converted to level-3 data with lower spatial (0.5×0.5) and temporal resolution (1-day).
For this study, the level-3 data collected between July 1 and August 9, 2001, were subjected to the
same preprocessing procedure as that used in [228]. A total of 5 time units were considered, with
each time unit representing eight days. Time unit 1 relates to the period from July 1, 2001 to July
8, 2001; time unit 2 relates to July 9-16, · · · , and time unit 5 relates to the period from August 2 to
August 9, 2001.
We focused on the data collected in a rectangular region D located between longitudes 14 and 46
and between latitudes 14 and 30, shown in Figure 5.6 (a). The study region therefore covers the
Northeastern part of Africa, the Red Sea, and parts of the Saudi Arabian Peninsula. The number
of level-3 observations (pixels) in the region is 32 × 64 = 2048. Other geographical regions were
also examined, including North and South America, and similar patterns were observed. In order to
evaluate the robustness of different prediction algorithms on the AOD data, 10 percent of the AOD
data were randomly selected and shifted to an abnormal value (e.g., ±5) that is outside the normal
region of the observations (−0.0843 ± 0.4958), and 10 percent of the AOD data were randomly
selected and set as missing values.
1) Robustness of the RFR-STP
Figures 5.6 (a) to (f) demonstrate the robustness of the proposed RFR-STP compared with that of
the FR-STP. Figure 5.6 (a) shows our study region, which is located within the white box area of the
map. Figure 5.6 (b) shows the heatmap of the detrended observations Zt=5. Figure 5.6 (c) displays
the heatmap of the contaminated observations Zt=5, in which the red dots are outliers. Figure 5.6 (d)
shows the FR-STP predicted heatmap based on the clean detrended observations Zt. Figure 5.6 (e)
displays the FR-STP predicted heatmap based on the contaminated observations Zt=5. Figure 5.6 (f)
shows the heatmap of the proposed RFR-STP predictions based on the contaminated observations
Zt=5. Comparing Figures 5.6 (d), (e), and (f), we observe that the FR-STP predictions were clearly
distorted by outliers. In contrast, the proposed RFR-STP predictions were almost the same as
the FR-STP predictions based on the original detrended observations. Similar patterns were also
observed in the predicted results for other times. Table 5.2 presents the MAPE and RMSE measures
of the FR-STP and RFR-STP predictions in five different areas, including unobserved locations,
outlier locations, regular locations, and all locations. The results indicate that the proposed RFR-
STP predictions were far more accurate than the FR-STP predictions in all three areas, especially
5.6.3 Experiments on Traffic Volume Data 88
the predictions at outlier locations.
2) Computational Efficiency of the RFR-STP
Table 1 compares the time cost for the the RFR-STP and FR-STP approaches. The results indicate
similar patterns as observed in the preceding simulation study. Both the general and the optimized
RFR-STP were computationally comparable to the FR-STP, and the optimized FR-STP had a much
higher computational efficiency than the general FR-STP algorithm. Similar patterns were observed
consistently on the traffic volume data set that will be discussed in the following subsection.
5.6.3 Experiments on Traffic Volume Data
The traffic volume data used here were collected in the downtown area of the city of Bellevue,
Washington (WA). A total of 105 detectors in this area were included in the modeling process. NE
8th Ave was selected as the test route because this is a major city corridor, with an annual average
weekday traffic of 37,700 veh/day. Data from 14 detectors, seven eastbound and seven westbound, on
NE 8th Ave were used to evaluate the robustness of different prediction algorithms. The evaluation
data were collected during the first week of June, 2007, and all data were aggregated into 5-minute
intervals to reduce the effect of random noise. Details of the preprocessing and model specification
are given in [294].
The X-axis refers to the timestamps from 5 am to 9 pm, and the Y-axis refers to the traffic volume,
aggregated at 5 minute intervals. Figure 5.7 (a) shows the traffic volume from detector #75 with
one significant spike of 1900 around 11 am, which was probably caused by a detector of malfunction.
On this detector, the FR-STP predictions exhibited a spike of over 800 triggered by the outlier.
However, the RFR-STP predictions had only a minor spike of around 550, which is a very reasonable
value. Figure 5.7 (b) shows the results for detector #215, with oscillating volumes throughout the
day. Because this detector was located close to detector #75 on the same route, the outlier on
detector #75 also affected the FR-STP predictions on detector #215. As can be observed from the
figure, the FR-STP predictions had a significant spike at exactly the same time the outlier appeared
on detector #75. In contrast, the RFR-STP predictions successfully limited the impact from the
spatially neighboring outlier to a reasonable value.
Another interesting observation shown in Figure 5.7 (b) is that most of the FR-STP predictions on
detector #215 were over-estimated, and the resulting curve predicted by the FR-STP failed to follow
the observation Z curve well. In contrast, the RFR-STP predictions were more accurate in most
locations. Although some of the RFR-STP predictions around 11AM were slightly over-estimated,
they were still more accurate than those of the FR-STP predictions. In addition, the RFR-STP
clearly limited the impacts of the outlier in a local temporal region, which provides a very good
demonstration of its robustness. One potential interpretation is that the FR-STP predictions were
conducted based on spatio-temporal autocorrelations captured by the STRE model. The outlier
observation on detector #75 around 11 am consequently had an impact on the predictions of both
5.6.3 Experiments on Traffic Volume Data 89
its spatial and temporal neighbors. The RFR-STP predictions were conducted based on the R-STRE
model, which is able to cope with large deviations caused by outliers as a part of the measurement
error by using a heavy tailed distribution. This feature limits the effect of outliers to a reasonable
value.
5.6.3 Experiments on Traffic Volume Data 90
ALGORITHM 1: A General RFR-STP Algorithm
input : Z1:T ,O1:T ,S1:T ,Vε,1:T ,Vξ,1:T ,Ψ
output: Y1:T |T
Calculate Z1:T , Z, O1:T , O, S,Λξ,1:T ,M,E by Equation (5.11);
Select initial values for η = [ηT1 , · · · , η
TT ]T and ξ1:T ;
Select a tolerance ǫ > 0;
repeatrepeat
Calculate the gradient and Hessian matrix for η:
b = −ST OTψ(Z − OSη − Oξ) + Mη + E;
P = ST OT Φ(Z − OSη − Oξ)OS + M;
Calculate the Newton step and decrement for η:
∆η = −P−1b; λ2η = bTP−1b;
Choose step size t by backtracking line search;
Update η = η + t∆η;
until λ2η/2 ≤ ǫ ;
for t = 1, · · · , T dorepeat
Calculate the gradient and Hessian matrix for ξt by Equations (10) and (11):
c = −OTt ψ(Zt − OtStηt − Otξt) + Λ
ξ,tξt;
R = OTt Φ(Zt − OtStηt − Otξt)Ot + Λ
ξ,t Compute the Newton step and
decrement for ξt:
∆ξt = −R−1c; λ2ξt
= cTR−1c;
Choose a step size t by backtracking line search;
Update ξt = ξt + t∆ξt;
until λ2ξt/2 ≤ ǫ ;
endUpdate λ2
η;
until λ2η ≤ ǫ, λ2
ξt≤ ǫT
t=1 ;
Calculate Y1:T |T by Equation (5.7)
5.6.3 Experiments on Traffic Volume Data 91
ALGORITHM 2: An Optimized RFR-STP-Huber Algorithm
input : Z1:T ,O1:T ,S1:T ,Vε,1:T ,Vξ,1:T ,Ψ
output: Y1:T |T
Calculate Z, O, S,Λξ,M,E by Equation (5.11);
Set a tolerance ǫ > 0;
Find an initial ω such that
ω − κ1 ≤ 0,−ω − κ1 ≤ 0,λ1,λ2 > 0, u > 0,m = 2;
repeat
Calculate the surrogate gap η := λ1 − λ2;
Determine t: t = umη
;
Compute the primal-dual search direction (∆ω,∆λ1,∆λ2) by Equations stated in (5.24);
Choose a step size s by backtracking line search;
ωnew = ω + s∆ω;
λ1,new = λ1 + s∆λ1;
λ2,new = λ2 + s∆λ2;
until ‖rdual(ωnew,λ1,new,λ2,new)‖ ≤ ǫ, ‖η‖ ≤ ǫ ;
Calculate η and ξ by Equations (5.19) and (5.20);
Calculate Y1:T |T by Equation (5.7);
Table 5.1: Comparison of Time Cost using the Simulated and AOD Data (Seconds)
Dataset Outliers (#) FR-STP RFR-STP RFR-STP(General) (Optimized)
Sim
ula
tion
Data
IsolatedOutliers
5 4.84 54.52 5.8610 5.38 77.63 6.2315 5.38 78.31 6.2030 5.49 81.63 6.2750 5.60 102.20 6.74
RegionalOutliers
5 5.76 64.40 6.1410 5.38 40.70 6.2015 5.74 40.89 5.85
AOD Data 10% 52.30 12.84 6.52
Note: The simulated data has 256 locations and 100 time units. The AOD data has 2048 locations and 5 time units.
Table 5.2: Comparison of Robustness using the AOD data
Approach Measure Unobserved Outlier Regular AllLocations Locations Locations Locations
FR-STPMAPE 2.97 3.32 3.67 3.51RMSE 1.67 1.53 1.76 1.73
RFR-STPMAPE 1.67 1.53 1.76 1.73RMSE 0.35 0.34 0.35 0.35
5.6.3 Experiments on Traffic Volume Data 92
0 50 100 150 200 250
−4
−2
0
2
4
s
Observation Z Contaminated Z RFR−STP FR−STP
(a) t = 81, 10 outliers
0 50 100 150 200 250
−4
−2
0
2
4
s
(b) t = 17, 15 outliers
0 50 100 150 200 250
−4
−2
0
2
4
s
(c) t = 25, 30 outliers
0 50 100 150 200 250
−4
−2
0
2
4
s
(d) t = 63, 50 outliers
Figure 5.4: Comparison between the FR-STP and RFR-STP using the data observed at fourdifferent times and with different numbers of isolated outliers (15 unobserved locations from s =
113 to s = 127)
5.6.3 Experiments on Traffic Volume Data 93
0 50 100 150 200 250
−4
−2
0
2
4
s
Observation Z Contaminated Z RFR−STP FR−STP
(a) t = 8, 2 regional outliers of size 5
0 50 100 150 200 250−6
−4
−2
0
2
4
6
s
(b) t = 8, 2 regional outliers of size 15
Figure 5.5: Comparison between the FR-STP and RFR-STP using the data observed at twodifferent times and with different sizes of regional outliers (15 unobserved locations from s = 113 to
s = 127)
Longitude
Latit
ude
20 40 60 80 100 120 140 160 180
10
20
30
40
50
60
70
80
90
100
(a) Study Region
Longitude
Latit
ude
10 20 30 40 50 60
5
10
15
20
25
30
−1.5
−1
−0.5
0
0.5
1
1.5
(b) Detrended Observation Zt=5
Longitude
Latit
ude
10 20 30 40 50 60
5
10
15
20
25
30
−1.5
−1
−0.5
0
0.5
1
1.5
(c) Contaminated Zt=5
Longitude
Latit
ude
10 20 30 40 50 60
5
10
15
20
25
30
−1.5
−1
−0.5
0
0.5
1
1.5
(d) FR-STP Predictions on Zt=5
Longitude
Latit
ude
10 20 30 40 50 60
5
10
15
20
25
30
−1.5
−1
−0.5
0
0.5
1
1.5
(e) FR-STP Predictions on Zt=5
Longitude
Latit
ude
10 20 30 40 50 60
5
10
15
20
25
30
−1.5
−1
−0.5
0
0.5
1
1.5
(f) RFR-STP Predictions on Zt=5
Figure 5.6: Comparison between the FR-STP and RFR-STP on the contaminated AOD data setsobserved at time t = 5
5.6.3 Experiments on Traffic Volume Data 94
5 AM 7 AM 9 AM 11 AM 1 PM 3 PM 5 PM 7 PM 9PM0
500
1000
1500
2000
Time
Volu
me
Observation ZRFR−STP FR−STP
(a) t = 4th day, detector #75
5 AM 7 AM 9 AM 11 AM 1 PM 3 PM 5 PM 7 PM 9 PM0
200
400
600
800
1000
Time
Volu
me
Observation ZRFR−STP FR−STP
(b) t = 4th day, detector #215
Figure 5.7: Comparison between the FR-STP and RFR-STP using the Traffic Volume Data on the4th day. (Detectors #75 and #215 are spatial neighbors)
Chapter 6 95
Chapter 6
Application 1: ActivityAnalysis Based onLow Sample RateSmart Meters
Activity analysis disaggregates utility consumption from smart meters into specific usage that as-
sociates with human activities. It can not only help residents better manage their consumption
for sustainable lifestyle, but also allow utility managers to devise conservation programs. Existing
research efforts on disaggregating consumption focus on analyzing consumption features with high
sample rates (mainly between 1 Hz ∼ 1MHz). However, many smart meter deployments support
sample rates at most 1/900 Hz, which challenges activity analysis with occurrences of parallel activ-
ities, difficulty of aligning events, and lack of consumption features. We propose a novel statistical
framework for disaggregation on coarse granular smart meter readings by modeling fixture char-
acteristics, household behavior, and activity correlations. This framework has been implemented
into two approaches for different application scenarios, and has been deployed to serve over 300 pilot
households in Dubuque, IA. Interesting activity-level consumption patterns have been identified, and
the evaluation on both real and synthetic datasets has shown high accuracy on discovering washer
and shower.
This chapter is organized as follows: Section 2 illustrates the application deployment for the pro-
posed approach, and introduces the related challenges. A novel general statistical framework for
disaggregation is proposed in Section 3. The detailed implementations for water consumption disag-
gregation are described in Section 4. Section 5 evaluates the performance of the proposed approaches
under different scenarios with real-world and synthetic datasets and demonstrates some interesting
findings from the pilot households. The related work is reviewed in Section 6. Finally, Section 7
concludes our work with future directions.
6.1 Introduction 96
6.1 Introduction
Sustainability and design of sustainable technologies have become urgent and important priority for
cities given the unprecedented level of resource demand - water, energy, transit, healthcare, public
safety - to every imaginable service that makes a city attractive and desirable. At the same time,
digital reification of cyber-physical world has been possible with widespread penetration of sensing
and monitoring technologies. These two important catalysts have fuelled significant interest and cross
organizational collaboration among researchers, industries, urban planners, and government. A lot of
technology and research has recently focused on leveraging information from such digital reification
of cyber-physical world to help manage various services more efficiently. Our work takes a step
in that direction - examines the feasibility and provides innovative approaches towards influencing
people’s consumption behavior. More precisely, we provide activity analysis based on smart water
meter readings.
Given the real world constraints, we research the feasibility of activity analysis to identify activities
from smart utility meter readings. Our study is based on the hypothesis that consumption activities
disaggregated from meter readings will empower residents with appropriate insights to influence and
shape their behavior. This has been rightly validated through a city-wide survey [233] followed by
four-month-long experimentation with a real city [293]. In addition, from disaggregated consump-
tion, utility managers can design and assess conservation programs, and prioritize energy-saving
potential retrofits.
Research on disaggregating electricity or water load has been conducted on smart meter readings with
fine granularity (mainly between 1 Hz ∼ 1MHz). Existing approaches identify appliances (fixtures)
based on analyzing steady state or transient state change in real-time consumption. However, they
are not suitable for many existing smart meter infrastructure.
Real-world deployments of smart meters are designed for utility billing and some basic analysis
requirement, but many of them are not suitable for consumption disaggregation. Smart meters
transmit consumption readings using wireless protocols, which consume battery and have depen-
dency on physical environments. Although the meters can sample at a rate even higher than 1MHz,
many of existing deployments have chosen to accumulate to 15 min or even longer intervals to ensure
reliable data transmission. However, physical environment may still affect the data transmission.
This scenario brings the following challenges to consumption disaggregation: 1) Parallel usage ac-
tivities, e.g., a toilet flush and shower in the same 15 minute interval. 2) Difficulty of aligning usage
events temporally, e.g., a shower may appear in one or two intervals. 3) Lack of features, i.e., only
aggregated consumption and start time of each interval can be used to identify usage activity. An
example of such water meter data and expected disaggregated activities is illustrated in Figure 6.1.
To handle these challenges, we have designed a novel statistical framework for activity analysis
on coarse granular smart water meter readings, and deployed it as a component in Smarter Wa-
ter Service for Dubuque, IA. In this framework, fixture characteristics, household behavior, and
activity correlations are utilized to disaggregate consumption. To implement this framework, we
6.2 Background 97
Figure 6.1: An Example of Data and Disaggregated Activities
propose two approaches to identify activities. The first approach applies hidden Markov model to
capture the relationship among consumption events and hidden activities. The second approach
utilizes classification techniques to learn from labeled activities, and a Gaussian mixture model is
used for disaggregation. The proposed approaches have been validated using both real-world water
consumption and synthetic datasets. The experiments have demonstrated the capability of the pro-
posed disaggregation framework, illustrated the appropriate sample rate for disaggregation in various
applications, and revealed interesting usage insights from 300+ pilot households. In summary, the
major contributions of this work include:
• Providing activity-level consumption insights to residents and the city management team to
support decision making;
• Designing a general disaggregation framework with two implementations for different scenarios;
• Designing a general disaggregation framework with two implementations for different scenarios;
• Revealing interesting consumption patterns from the disaggregation results.
6.2 Background
The activity analysis is an important function provided in Smarter Water Service based on smart
water meters. The deployed environment of our smart water meter infrastructure is shown in
Figure 6.2. Since August 2010, over 300 pilot households have volunteered to install Neptune R900
smart water meters [274] with UFR (Unmeasured Flow Reducer), which transmit a new aggregated
reading roughly every 15 minutes through 900MHz wireless connection. Each aggregated reading
is broadcasted repeatedly within the entire interval to ensure the success of transmission. Wireless
gateways have been deployed in the city to collect these readings, attach timestamps, and send to
a data center through 3G network every hour. In addition, 6 volunteer households had applied
6.2 Background 98
data logger which records water consumption every 10 seconds, and had done water usage activity
journaling accordingly for a week. All the meter readings have been anonymized and sent to IBM
Computing Cloud for analytics.
Figure 6.2: Data Acquisition
The software architecture of the deployment is visualized in Figure 6.3. The smart meter data are
first cleaned and transformed by InfoSphere Information Server (IIS), and then stored in a Smart
Meter Database managed by DB2 . On top of this database, Cognos is utilized to provide OLAP
functions such as consumption metric and pattern monitoring; a java-based module is developed to
perform advanced analytics functions such as disaggregation and prediction. IBM WebSphere Ap-
plication Server (WAS) hosts the service layer to allow users interact with the services. In addition,
a community engagement component plays the role of motivating residents through competition and
collaboration via multiple media channels. The whole system, as a $850K deployment engagement
with Dubuque, IA, has been deployed on IBM Smarter Cities Sustainable Model Cloud, and provides
services to residents (300+ pilot households) and the city management team (about 10 government
employees) [293].
The main objective of this Smarter Water Service is to provide affective services that can help the
volunteers modify their behavior to be more sustainable, in other words, let the residents know
what they need to know to change their behavior. To achieve that goal, one important process is
to reveal disaggregated water consumption, so that the users can know where in their houses they
could conserve water, and sustainable operations or investment can be suggested. As a component
of Smarter Water Service, activity analysis shared the computing resources with the other custom
analytics. It works as a backend service that outputs activity-level consumption distribution reports
every month from 15-minute aggregated consumption. This component will continuously provide
consumption insights as part of the Smarter Water Service, and will be updated by enhancing
learning ability and expanded to the expected 4000 households with hourly readings by 2013.
A preliminary summary has shown 6.6% normalized accumulative consumption reduction in 8 weeks
after the Smarter Water Service was published in September 2010. In addition, a survey conducted
in December 2010 showed that since September, out of 64 respondents, 15 households had fixed leaks,
6.2.1 Problem and Definition 99
Figure 6.3: Smarter Water Service Architecture
13 respondents had shortened their showers, and 14 purchases on water-efficient toilet/appliances
had been made.
6.2.1 Problem and Definition
The problem of disaggregation from coarse granular smart water meter readings can be informally
described as follows:
Definition 1 (Disaggregation) Given a sequence of aggregated interval water consumption Con(T ) =
(Con1, · · · , ConT ), where Coni refers to the aggregated water consumption at the i-th time interval,
the proposed solution should return a set of activities ((A1, E1), · · · , (Ak, Ek)) that are most likely to
cause the aggregated consumption Con(T ), where Ai refers to an activity state (e.g., washer, shower,
or toilet uses), and Ei refers to an observation (event) of water consumption for this activity state
and is represented by a vector of event features, including total water consumption and start/end
time intervals.
The related terms and their definitions are summarized in Table 1, and will be used in the rest of the
chapter. We use capital letters to denote random variables and small letters to denote observations.
6.2.2 Research Challenges
General challenges for usage disaggregation from single main meter include the following: 1) Ap-
pliances (fixtures) with similar consumption patterns, e.g., certain sink usage and a toilet flush; 2)
Appliances/fixtures with multiple settings, e.g., normal, dedicated, and permanent of a washer; 3)
6.2.3 Observations 100
Table 6.1: Terms & Definitions
Term Symbol Definition
Consumption Con Amount of water used in terms of gallons
Interval Int The time period between 2 consecutive me-ter readings
Activity A Integer value that represents one of the fol-lowing: sink, toilet, shower, and washer
Event E A vector of features to represent an event.The event features include total consump-tion, start/end time, etc
Event sequence (E1, · · · , ET ) A sequence of events occurs in a time win-dow (e.g., 24 hours), where T is the numberof events
Parallel activities (At1, · · · , Ats) s activities occur together in event
Events of parallel activities P (E(T )) A set of events in (E1, · · · , ET ) generatedby parallel activities
Parallel sub-events (Et1, · · · , Ets) A set of parallel sub-events whose aggre-gation generates the event Et. Each sub-event Eti is generated by a single activityAti
Load variation, e.g., low, medium, and full load of a washer, or length of showers; 4) Multiple cycles,
e.g., washer and dishwasher; 5) Lack of real-world ground truth, i.e., hard to collect sufficient la-
beled data from consumers. Disaggregation with the above challenges can be treated as a real-world
classification problem.
In addition, the specific application scenario introduced in the previous section brings more challenges
because of the coarse granularity and unstable reading intervals caused by unreliable communication.
These limitations cause: 1) Parallel usage activities, e.g., two toilet flushes and a shower in the same
15 minute interval. 2) Difficulty of aligning usage events temporally, e.g., a shower may appear in
one or two intervals. 3) Lack of features, i.e., only aggregated consumption and start time of each
interval can be used to identify usage activity. These specific challenges make the task of water
usage disaggregation more than a classification problem and difficult to solve.
The existing disaggregation approaches focus on analyzing steady state or transient state changes.
They cannot handle the specific challenges in this scenario, because no steady state or transient
state can be detected with such a low sample rate.
6.2.3 Observations
Due to the challenges discussed, the aggregated consumption of each interval alone surely cannot
provide confident disaggregation results. We need to investigate the available ground truth on
what other factors may help improve the disaggregation accuracy. After a study over the activity
journaling from the volunteers, we have found three useful characteristics of water usage activities:
fixture-dependant, household-dependant, and time-dependant.
6.3 A NEW STATISTICAL DISAGGREGATION FRAMEWORK 101
Observations
Each fixture category has its own usage pattern in term of consumption and duration that can be
used to distinguish it from the others. Specifically, the amount of water consumed in a toilet flush
usually fell in several small ranges between 1.5 5 gallons, and was consistent for a specific toilet. A
load of washer generally lasted between 30 60 minutes, and consisted of multiple cycles with similar
water usage. Showers had consistent flow rate most of the time, and lasted from 5 minutes to 15
minutes in most cases. Sink usage was usually short in time and low in consumption. These patterns
can help briefly categorize the usage events. For example, any interval with flow rate lower than 0.1
gallons per 15 minutes can be filtered out as sink usage. However, using a fixture specification library
is not enough to identify parallel activities, or to deliver customized models for each household.
Household-dependant Pattern
Activity patterns heavily depend on the fixture models and occupants of a specific household. For
example, households with kids generally spent more time on shower every day; households with
open leaks showed continuous usage for a long time; some households have 3 toilets and each has a
different specification. Therefore, each household needs to be modeled separately to ensure accurate
disaggregation. These models can be learned from historical consumption records and household
profiles if available.
Time-dependant Pattern
According to human behavior, some activities may happen frequently during a specific time period,
which can be used to distinguish ambitious water usage. One interesting example of such pattern is
shower. Most of the labeled showers happened either close to the first event of usage in the morning
or close to the first event after work. Although toilet flush occurred almost any time in a day, it
was less frequent in working hours and midnight than the rest of a day. Not only time of day, but
also day of week has been found drawing impacts on activity patterns. An example could be washer
usage which happened mostly during weekends in some households. In addition, some activities are
found temporally associated. For instance, a toilet flush in many cases was followed by a short sink
usage for hand washing. According to the time-dependant activity patterns, timestamps of usage
events should be able to improve disaggregation results significantly.
6.3 A NEW STATISTICAL DISAGGREGATION FRAMEWORK
Coarse granular smart meter readings cause a large portion of parallel activities, and disaggregation
of parallel activities has become a critical and important challenge. This section introduces a new
General Disaggregation Framework (GDF) to address the disaggregation problem. As illustrated in
6.3 A NEW STATISTICAL DISAGGREGATION FRAMEWORK 102
Figure 6.4, the GDF framework applies six phases to disaggregate water consumption. The work
flow is described as follows:
Figure 6.4: Disaggregation Framework
Phase 1 Event extraction: Given a sequence of aggregated interval consumption Con(T ) =
(Con1, · · · , ConT ), the intervals with continuous consumption are grouped to generate events where
each represents one activity or parallel activities. The output of this phase is an event observation
sequence of a given time window: e(T ) = (e1, e2, · · · , eT ). Hence, e(T ) is regarded as one observation
of the event random variables E(T ) = (E1, E2, · · · , ET ). Each event Ei may be generated by a
hidden activity (Ai) or several parallel hidden activities (Ai1, · · · , Ais).
Phase 2 Model selection and training: Select an appropriate stochastic model D(E(T ); θ), such
as HMM or GMM, and estimate parameters θ based on historical labeled or unlabeled observations.
Phase 3 Parallel activity detection: Given the estimated stochastic model D(E(T ); θ), the
events with parallel activities P(e(T )) can be identified from anomalous events O(e(T )). Anomalous
events can be obtained using leave-one-out test, i.e., O(e(T )) = et|et ∈ R(E(−t) = e(−t), α),
where E(−t) = (E1, · · · , Et−1, Et+1, · · · , ET ), e(−t) = (e1, · · · , et−1, et+1, · · · , eT ). R(·) refers to
the outlying region of normal event Et that is defined based on the conditional distribution of
[Et|E(−t) = e(−t)] and a confidence level α (e.g., 0.99). The calculation of outlying regions based on
HMM and GMM models will be discussed in Section 4. This phase assumes all anomalous events are
generated due to parallel activities. An anomalous event may also be generated by true abnormal
activities such as a shower lasting more than an hour. However, it is difficult to differentiate these
only based on coarse granular meter readings. Hence, we only consider parallel activities.
Phase 4 Parallel size estimation: For each anomalous event observation et ∈ O(e(T )), the
number of parallel activities that generate et can be estimated by
s = mins|et ∈ R−Agg(E
(−t) = e(−t), Agg(Et1, · · · , Ets), α) (6.1)
where Et1, · · · , Ets refers to the parallel activities (random variables) whose aggregation generates
the event et, Agg(·) refers to the vector of aggregated features, and R−Agg(·) refers to the normal
region of the aggregated features Agg(Et1, · · · , Ets).Agg(Et1, · · · , Ets) returns aggregated features,
such as the total water consumption, the earliest start time, and the latest end time of the sub-events
6.4 DISAGGREGATION APPROACHES 103
Et1, · · · , Ets. The reason of selecting the minimal s is that heavy consumption (a washer load)
can always be decomposed into a large number of small activities (e.g., toilet flushes), which is not
reasonable.
Phase 5 Hidden activity identification: For each abnormal event Et ∈ O(ET ), given s, the
estimated size of parallel activities, this phase estimates the disaggregated activities
(at1, · · · , ats) = argmin(at1,··· ,ats)∈1,··· ,ms
Pr(At1 = at1, · · · , Ats = ats|E(−t) = e(−t),
Agg(Et1, · · · , Ets) = et). (6.2)
Phase 6 Consumption decomposition: Given the hidden parallel activities at1, · · · , ats esti-
mated in Phase 5, the related water consumption of these hidden activities can be estimated as:
(Con(et1), · · · , Con(ets)) = argmax(Con(et1),··· ,Con(ets))
L(Con(Et1) = Con(et1), · · · , Con(Ets)
= Con(ets)|E(−t) = e(−t), At1 = at1, · · · , Atm = ats,
Agg(Et1, · · · , Ets) = et), (6.3)
where L is the likelihood function, and Con(eti) is the consumption feature of the sub-event obser-
vation eti, i = 1, · · · , s.
Theorem 4 Given a sequence of aggregated consumption intervals Con(T ) = (Con1, · · · , ConT ),
GDF is able to identify true hidden activities ((A1, E1), · · · , (Ak, Ek)) of Con(T ), if the following
assumptions are satisfied: a) In Phase 1, The events can be correctly identified and the features
extracted are sufficient; b) The distribution D(E(T );θ) is correctly selected and estimated; c) All
anomalous events are due to parallel activities; d) The minimal s selected in Phase 4 is correct.
Proof The four conditions stated above assure that the built statistical model by GDF is consistent
with the true distribution of hidden activities of Con(T ). It follows that the activities identified by
GDF are most probable results and should be consistent with true hidden activities.
6.4 DISAGGREGATION APPROACHES
This section presents two approaches based on GDF to handle different disaggregation scenarios.
When there is no sufficient training data available, which is true in many real-world scenarios, we
propose an approach to learn hidden relationship among consumption events and activities without
user input based on hidden Markov model (HMM). When labeled activities are available for training,
we design the second approach to construct statistical models using classification techniques and
disaggregate parallel activities using Gaussian mixture model (GMM).
6.4.1 HMM-based Approach 104
6.4.1 HMM-based Approach
This section presents an implementation of GDF based on HMM. It is trained based on unlabeled
data and performs disaggregation without user input. For the purpose of simplicity, each event Ei
is represented by a single feature, the total water consumption. Other features, such as start/end
time intervals, and duration can be included to this approach in a straightforward manner.
Event Extraction (GDF Phase 1)
The key challenge of event extraction is the segmentation process. Without labeled historical data,
it is necessary to define a set of heuristic rules to generate meaningful events based on domain
knowledge. The basic criterion is to keep adjacent interval consumption in a single event if they
possibly relate to one activity or parallel activities. This is to avoid the situation where one activity is
divided to two separate events, which is not recoverable in our approach. If two nonparallel activities
are mistakenly grouped to one event, they can still be identified in the consequent disaggregation
process.
Similar to the idea of hierarchical clustering, a bottom-up based segmentation algorithm is proposed
as follows:
1. Preprocessing. Remove leaking effects, and filter out all zero-consumption intervals.
2. Initialization. Regard each left interval as one event. Then we have the sequence of initial
events (e1, , ek), where k is the number of nonzero consumption intervals.
3. Merging heavy events. Define a water consumption threshold ϑ (e.g., 5.5 gallons for 15-minute-
size intervals). For each continuous event pair (ei, ei+1), if Con(ei) > ϑ andCon(ei) > ϑ, merge
ei and ei+1. Repeat until no such pair exists.
4. Merging light events. For each event ei with Con(ei) > ϑ, if Con(ei−1) > 0, then merge ei
and ei−1. Similarly, if Con(ei+1) > 0, then merge ei and ei+1. If there is an event ei with
Con(ei) > 0, and both Con(ei−1) and Con(ei+1) greater than ϑ, then ei is merged to the
segment with the smallest consumption.
5. Merging peak events, Merge two peak events (Con(ei), Con(ej)) if dist(ei, ej) ≤ τ , where
dist(ei, ej) = tstart(ej)− tend(ei), and tstart(·) and tend(·) refer to the start and end time of an
event respectively. We define an event as a peak if its total water consumption is greater than a
threshold γ (e.g., 20 gallons). This step is specifically designed for fixtures like washers, which
consists of multiple peaks with more than 15 minutes empty cycle (no water consumption)
between peaks.
6.4.1 HMM-based Approach 105
HMM Parameter Estimation (GDF Phase 2)
A hidden Markov model is usually trained based on EM algorithm, which can only guarantee local
optimum. Given a large number of parameters to be estimated in a HMM model, including the
number of hidden states, the initial probabilities, the emission distribution of each state, and the
transition matrix, it is critical to find appropriate initial settings for these parameters. By empirical
evaluation, we decided a mixture model of three Gaussians for sink events, and Gaussian models for
other activity events. This section presents a heuristic based approach to seek initial settings for
each household based on generic domain knowledge:
1. Toilet identification. Hierarchical clustering is applied on events to identify toilet clusters. By
domain knowledge, toilet clusters could be identified by requiring the cluster size to be greater
than 3 times the total number of days in the training data, and the consumption standard
deviation smaller than 0.5 gallons.
2. Sink identification. Sink events can be identified as the events with consumption lower than
(µi − 2 ∗ σi), where µi and σi are the mean and standard deviation of the toilet cluster with
the smallest mean consumption in all toilet clusters.
3. Frequent pattern identification. After removing sink events and toilet clusters, hierarchical
clustering is applied on the remaining events to identify other qualified clusters. In order to
control the HMM complexity, we only keep the 12 clusters with the smallest standard deviation.
4. Cluster labeling. This step gives labels to the qualified clusters based on predefined rules such
as a shower usage should be within 5 ∼ 25 gallons. If some clusters are still not labeled, we
label these clusters as ”others”, which may relate to some unknown activity state or frequent
combination of parallel activities.
5. Anomaly removal. The anomalous events are identified based on a Gaussian mixture distri-
bution estimated from qualified clusters. These outliers will impact the training of HMM,
therefore they are removed from training data.
6. Probability estimation. Regarding each qualified cluster as a hidden state, we can get the
number of hidden states, the mean and standard deviation of each hidden state. The transition
matrix and initial probabilities can be estimated based on labeled events.
Disaggregation and Labeling (GDF Phase 3-6)
First, several notations are defined as follows. The set of activity states is 1, ,m, D is an m by
m transition matrix, π is the initial probability of the m states, pi(et) = Pr(Et = et|At = i), and
ui(t) = Pr(At = i). For the purpose of simplicity, we assume that each event Et conditioned on
activity state At follows a Gaussian distribution [EtAt = i] ∼ N (µi, σ2i ). Note that the following
derivations can also be straightforwardly extended to Gaussian mixture distributions.
6.4.1 HMM-based Approach 106
Let P (e) =
p1(e) 0 0
0 · · · 0
0 0 ps(e)
∈ Rs×s, αt = Pr(e1, . . . , et, At) ∈ Rs, αt(at) = αt = Pr(e1, . . . , et, At =
at) ∈ R, βt = Pr(et+1, . . . , eT |At) ∈ Rs, βt(at) = Pr(et+1, . . . , eT |At = at) ∈ R, and Bt = DP (et).
The HMM implementations of GDF Phase 3 to 6 are as follows:
GDF Phase 3: Parallel activity detection
The probability density function
P (Et = e|E(−t) = e(−t)) =αT
t−1DP (e)βt
αTt−1Dβt
=∑
i
wi(t)pi(e),
where wi(t) = di(t)∑
mj=1 dj(t)
, di(t) = [αTt−1D]i[βt]i. It indicates that [Et = e|E(−t) = e(−t)] follows a
GMM :
[Et = e|E(−t) = e(−t)] ∼∑
i
wi(t)N (x|µi, σ2i ).
The outlying region of the GMM model can be calculated as
R(e(−t), α) =
e∣
∣
∣|e− µk∗ | > σk∗Φ−1(
1 − α
2)
,
where k∗ is the Gaussian component closest to e, and Φ(·) is the cumulative density function (CDF)
of a standard Gaussian distribution. Here, we assume that the statistics of outlying events are
dominated by the component closest to the observation. This outlying region estimation has been
justified in [281] using extreme value statistics.
GDF Phase 4: Parallel size estimation
The probability density function
P (Et1 = et1, . . . , Ets = ets|e(−t)) =
αTt−1
∏si−1 DP (eti)βt
αTt−1D
sβt
=∑
(l1,...,ls)∈1,...,ms
wl1,...,lsPls(et1) · · ·Plm(ets),
where wl1,...,lm is the weight that can be calculated based on the form αTt−1 ·
∏si=1DP (eti) · βt/α
Tt−1D
sβt.
It implies that
[Et1, . . . , Ets|E(−t) = e(−t)] ∼
∑
(l1,...,ls)∈1,...,ms
wl1,...,lsN(
[µl1 , . . . , µls ]T , diag(σ2
l1 , . . . , σ2ls))
.
6.4.2 Classification-GMM-based Approach 107
By linear transformation, we have that
[Et1 + · · · + Ets|E(−t) = e(−t)] ∼
∑
(l1,...,ls)∈1,...,ms
wl1,...,lsN(
s∑
k=1
µli ,
s∑
k=1
σ2li
)
.
Note that here Agg(Et1, . . . , Etm) = Et1 + · · · + Ets. Since [Agg(Et1, . . . , Etm)∣
∣ E(−t) = e(−t)]
follows a Gaussian Mixture distribution, the normal region R−Agg(·) can be estimated similarly as in
the above GDF Phase 3.
GDF Phase 5: Hidden activity identification
The probability density function
Pr(At1 = at1, . . . , Ats = ats
∣
∣ E(−t) = e(−t), Et1 + · · · + Ets = et)
=αt1(at1)
∏s−1i=1 Pr(at(i+1)|ati)Pr(
∑
k Etk = et|at1, . . . , ats)βts(ats)
LT
where LT is the likelihood of the whole sequence and can be neglected when solving the problem
(6.2). Note that the random variables Et1, . . . , Ets are independent to each other given their hidden
activity states At1, . . . , Ats. The probability density function Pr(∑
k Etk = et | at1, . . . , ats) can be
calculated by simple linear transformation of independent Gaussian random variables.
GDF Phase 6: Consumption decomposition
Given the hidden activity states at1, . . . , ats, we have that
[Et1, . . . , Ets|at1, . . . , ats] ∼ N (µ,Σ),
where µ = [µat1 , . . . , µats]T ,Σ = diag(σ2
t1, . . . , σ2ats
). The optimal solution of the problem (6.3) can
be obtained as [282]
[et1, . . . , ets]T = µ− Σ−11T(1TΣ1)−1(1Tµ− et).
6.4.2 Classification-GMM-based Approach
Different from the HMM -based approach, this section presents a mixed model approach to the
disaggregation problem that requires labeled data for training. It first applies a classification model
(e.g., support vector machine, neural network, and k-nearest neighbor classifier) to classify each
event as a single activity, or a known frequent combination of parallel activities, or an unknown
infrequent combination of parallel activities. For the events classified to the last category (unknown
infrequent combinations), it applies an implementation of the GDF framework based on GMM to
disaggregate parallel activities.
Assume that we are given a sequence of aggregated interval consumption Con(T1) = (Con∗1, . . . , Con
∗T1
)
and the related hidden activities(
(a∗1, e∗1), . . . , (a
∗k, e
∗k))
as the labeled training data. The objective
6.4.2 Classification-GMM-based Approach 108
is to build a model on Con(T1) that can identify unknown hidden activities(
(a1, e1), . . . , (ak, ek))
of a new aggregated intervals consumption sequence Con(T ) = (Con1, . . . , ConT ).
Event Extraction (GDF Phase 1)
This phase first applies the same procedure as in Section 3.2.1 to identify a sequence of events. Here
each ei has six features, which include the start time, duration, total consumption, minimal interval
consumption, maximal interval consumption, and number of peaks.
Classification (GDF Phase 2)
The event extraction phase returns an event sequence (e1, . . . , ek), where each ei is represented by a
vector of six features (ei ∈ R6). Note that all the features are mapped to real type values, in order
to apply classification models such as SVM and neural network.
Here, we neglect the dependencies between events and treat (e1, . . . , ek) as a set of independent
training instances: e1, . . . , ek. Based on the labels (a∗1, e∗1), . . . , (a
∗k, e
∗k), it is able to identify hidden
activities of each event ei. To decide class labels, not only single activities (e.g., toilet, shower, and
washer) are treated as distinct classes, but also frequent combinations of parallel activities are
regarded as distinct classes. The current setting is that frequent parallel activities should occur at
least once per week.
GMM-based Disaggregation (GDF Phase 3-6)
After the classification process, each event has been labeled as a single activity, or known/unknown
combination of parallel activities. For parallel activities, a GMM -based implementation of the GDF
framework is proposed to disaggregate parallel activities. The basic procedures are as follows:
Based on the labels of training events e1, . . . , ek, it is able to collect training instances for each
activity state, such as toilet, shower, and washer. For simplicity, in this disaggregation step, we
only consider a single feature (the total water consumption), for each event ei. Each single-activity
related event (Et) can modeled by a Gaussian mixture distribution as Et ∼∑m
i=1 πiN (µi, σ2i ), where
πi is the prior probability of the activity state i, and N (µi, σ2i ) is the event distribution of activity
i.
Given an event et that is classified as parallel activities, the objective is to identify the most probable
hidden activities(
(at1, et1), . . . , (ats, ets))
with Agg(et1, . . . , ets) = et. Here the aggregation function
Agg is the summation function∑
(·). The GDF disaggregation framework can be employed here,
which can be regarded a simplified case of HMM based approach. Readers are referred to [261] for
detailed specifications.
6.5 Evaluation & Findings 109
6.5 Evaluation & Findings
The framework has been implemented using JDK 1.5 and deployed in the Custom Analytics Layer
of the Smarter Water Service (Figure 6.4). Pie charts of activity consumption distribution are
generated to illustrate how each fixture has been used on monthly basis. From the Smarter Water
Service layer interface, the residents can browse their own consumption distribution; meanwhile, the
government agency and utility manager can explore how water has been consumed by each activity
at regional level.
Both HMM-based and GMM-based approaches have been implemented and evaluated. Specifically,
for the GMM-based approach, we have assessed three classification methods, k-Nearest Neighbor
classification (kNN-GMM), Artificial Neural Network (ANN-GMM), and Support Vector Machine
(SVM-GMM) accordingly. Given the available labeled activities, the evaluation focused on identi-
fying toilet flushes, showers, and washer loads.
To evaluate the effectiveness of consumption disaggregation on identifying these activities, we
adopted three metrics, precision, recall, and F-measure. The major reason of using these metrics is
that the disaggregation evaluation is similar to an information retrieval process, where subsets of
intervals represent certain true activities and the testing results are also subsets of intervals labeled
as activities. The metrics need to capture not only how many labels are matched, but also how many
true activities are missed and how many false labels are placed. These metrics are defined as follows:
Precision refers to the portion of matched activities within the corresponding disaggregation results;
Recall refers to the portion of matched activities within the corresponding true activities; F-measure
is the harmonic mean of precision and recall.
To evaluate the proposed disaggregation solution, we have applied both HMM-based and GMM-
based approaches on the consumption of 6 volunteer households, as well as 50 simulation datasets
that were generated based on their labeled consumption. In addition, we varied the sample rate in
these datasets to investigate its impact on disaggregation results. The correlation between sample
rate and effectiveness can provide guidance to future planning and deployment of human activity
analysis applications.
Due to the lack of labeled activities from most of the pilot households, we only applied the HMM-
based model to analyze activities of the 300+ pilot households. Some interesting patterns discovered
can illustrate common human behavior characteristics.
6.5.1 Datasets
A real-world dataset was collected from 6 volunteer households. It consists of 1/10 Hz water reading
and the corresponding usage journaling records for 7 days. The usage journaling was input manually
by these volunteers, so it always has approximated timestamps and missing activities, which intro-
duce inaccuracy which needs to be handled carefully. Note that these households came from various
demographic categories and showed significantly different consumption patterns. A summary of
6.5.2 Parameter Settings & Baseline Methods 110
Table 6.2: Water Journaling of One Household
Fixture Occurrences Total Amount Percentage
Shower 1 5 71 7%Shower 2 5 57 6%Washer 9 366 38%Toilet 1 43 217 24%Toilet 2 33 68 7%
Other (sink & unlabeled) N/A 186 19%
labeled activities from one volunteer is listed in Table 6.2 as an example.
50 simulation datasets were generated by simulating occurrences and corresponding consumption
of activities according to their distributions in the labeled dataset from the 6 volunteer households.
Firstly, from the labeled activities, the number of instances of each activity in a week was estimated
using Poisson distribution. Each instance was randomly assigned to a day and time according to
the distributions of labeled activities in day-of-week and time-of-day domains. These distributions
were captured by activity occurrence histograms generated from labeled activities and smoothed
by kernel density. Once date and start time of an instance was determined, its consumption and
duration was randomly picked from a dictionary of the corresponding labeled activities. Finally,
consumption noise of each day was randomly picked from 42 (6 households * 7 days) samples, of
which each contains unlabeled consumption (¡2 gallons) of a whole day. In this way, simulated
consumption data for 6 months were generated in each dataset.
A live dataset was constructed from the 15-min consumption of all the pilot households since Au-
gust 2010. This dataset has inconsistent reading intervals all the time, missing readings due to
communication failure, and even water leaks that can impair the disaggregation results.
6.5.2 Parameter Settings & Baseline Methods
For HMM-based approach, the major settings are as follows: 1) in GDF Phase 1 (event extraction)
Step 3 (merging heavy events), the threshold ϑ was set to 5.5 gallons; 2) in GDF Phase 1 (event
extraction) Step 5 (merging peak events), the thresholds and were set to 15 minutes and 20 gallons,
respectively; 3) in GDF Phase 2 (HMM parameter estimation) Step 4 (cluster labeling), the clusters
with mean consumption between 1.2 gallon and 6 and frequency greater than two times per day were
labeled as toilets; the clusters with mean consumption between 8 and 30 were labeled as showers;
the clusters with mean consumptions between 30 and 55 gallons were labeled as washers; the clusters
with frequency smaller than 1 times per day were disregarded; and the left clusters were labeled as
”others”; 4) the number of states in HMM was decided automatically (See GDF Phase 2 step 3).
Note that all the preceding parameters were decided based on domain experiences.
For kNN-GMM-based approach, the event extraction phase was the same as that in HMM-based ap-
proach. Note that the same event extraction process was also used in all other compared approaches.
The kNN classifier used in the experiments was provided by MATLAB-2008a Bioinformatics Tool-
6.5.3 Effectiveness Comparison 111
box. One major parameter is the number of nearest neighbors used in the classification. We applied
10-folder cross validation to select the best k from the candidate values from 5 to 15.
For ANN-GMM-based approach, the neural network classifier was provided by MATLAB 2008a
Neural Network Toolbox. We used one-per-class cording for multiclass classification. In one-per-
class coding, each output neuron is designated the task of identifying a given class. The output code
for that should be 1 at this neuron and 0 for others. We used Levenberg-Marquardt backpropagation,
which is the default training algorithm in MATLAB. 10-folder cross validation was used to select
the best parameter ”the number of hidden layers” in the range from 2 layers to 8 layers. Other
parameters were the default settings. Note that, another popular training algorithm is ”Gradient
descent back propagation” with two major parameters ”learning rate” and ”the number of hidden
layers”. We have also tried this training algorithm in experiments. But results indicate that the
Levenberg-Marquardt backpropagation method is more accurate and efficient. For SVM-GMM-
based approach, the SVM classifier was provided by LIBSVM [222]. We used the popular radial
basis function as the kernel function. There are two parameters including cost (c) and gamma
(g). These two parameters were tuned by 10-folder cross validations, and the best parameters was
selected from different combinations of the cost parameter (c) range: log2(c) = 1 : 0.25 : 5, and the
gamma parameter (g) range: log2(g) = −7 : 0.25 : −1. We used the ”one-against-one” method for
multiclass classification.
Two baseline approaches, named random-pick and knapsack based, were applied to evaluate the
effectiveness of the above four proposed methods. The random-pick method is described as follows:
First, conduct the same event extraction as in HMM-based method; second, the events with con-
sumption smaller than 2 gallons are labeled as sink uses; third, the left events are randomly labeled
to toilet, shower, and washer uses.
The knapsack based method is described as follows: First, conduct the same event extraction as
in HMM-based method; second, knapsack each segment to the best combination of the follow-
ing activities: ”Toilet-old (1.6 gallons)”, ”Toilet-new (4 gallons)”, ”Shower-Low-flow (15 gallons)”,
”Shower-Standard (30 gallons)”, ”Laundry (50 gallons)”, and ”Sink (¡=1.6)”.
6.5.3 Effectiveness Comparison
To demonstrate the effectiveness of proposed approaches, we used the labeled activities from water
journaling and the simulation datasets as ground truth, and compared the proposed approaches.
The comparison was conducted among 4 versions of disaggregation approaches, HMM, kNN-GMM,
ANN-GMM, and SVM-GMM; and the two baseline solutions, random pick and knapsack. Cross
validation was applied to find the best parameters for the corresponding classification methods.
As shown in Table 6.3, all the proposed approaches achieved about 95% precision on shower iden-
tification, while the recall was relatively low (77 81%). It was because the deviation of shower
consumption is very high in real life. In many cases, consumption of a shower may be similar to that
of two toilet flushes, or a front-load washer. Therefore, some true showers could not be correctly
6.5.3 Effectiveness Comparison 112
Table 6.3: Precision, Recall, and F-measure on Simulation Data
Precision, Toilet Shower WasherRecall, Mean Mean Mean
F-measure, (Standard Deviation) (Standard Deviation) (Standard Deviation)
0.7704 (0.08), 0.9471 (0.04), 0.7839 (0.06),HMM 0.6651 (0.04), 0.7883 (0.04), 0.9610 (0.04),
0.7110 (0.04) 0.8594 (0.03) 0.8620 (0.04)
0.7291 (0.07), 0.9552 (0.02), 0.8536 (0.06),kNN-GMM 0.8552 (0.03), 0.7723 (0.05), 0.8937 (0.09),
0.7850 (0.04) 0.8530 (0.03) 0.8702 (0.06)
0.5982 (0.05), 0.9584 (0.03), 0.8554 (0.08),ANN-GMM 0.8709 (0.03), 0.7670 (0.06), 0.8994 (0.12),
0.7075 (0.04) 0.8505 (0.04) 0.8710 (0.09)
0.4669 (0.07), 0.9622 (0.02), 0.8613 (0.06),SVM-GMM 0.8873 (0.02), 0.8057 (0.05), 0.9329 (0.06),
0.6086 (0.06) 0.8761 (0.03) 0.8940 (0.04)
0.1022 (0.03), 0.1514 (0.03), 0.0737 (0.02),Random Pick 0.0531 (0.01), 0.1608 (0.04), 0.3237 (0.10),
0.0699 (0.02) 0.1560 (0.03) 0.1201 (0.07)
0.0655 (0.01), 0.4570 (0.05), 0.8619 (0.16),Knapsack 0.1534 (0.02), 0.3294 (0.05), 0.3516 (0.13),
0.0918 (0.02) 0.3828 (0,05) 0.4995 (0.19)
identified. But once an activity is labeled as a shower, it’s very likely to be true. Although these
four methods performed similarly on labeling showers, SVM-GMM achieved the highest scores.
Different to shower, washer loads were disaggregated with very high recall (89 96%), and relatively
low precision (78 86%). Generally, cloth washer is the heaviest and meanwhile the least frequent
activity on water consumption in a household. Based on the specifications and settings of a washer,
its water consumption is usually consistent. That’s the reason why almost all of the washer instances
can be learned and identified. On the other hand, a washer usage usually crosses multiple intervals.
This usage pattern may be similar to certain combinations of other consumption. Therefore, some
other consumption was classified as washer by the disaggregation approaches. In overall, SVM-GMM
achieved the best overall performance, and HMM got the highest recall.
Detecting toilet flushes is the most difficult task comparing to shower and washer. Because toilet
usage typically happens very frequently and costs a small amount of water, it is hard to be distin-
guished from sink usage in 15-minute interval, or be identified when combined with heavy activities
such as a shower or a washer load. All the four approaches had F-measure between 61% and 78%.
HMM was the only approach with precision higher than recall. KNN-GMM performed the best in
terms of F-measure.
Due to the small number of training data (¡= 4 days per house), GMM-based approaches failed
to disaggregate consumption on the volunteer households. As shown in Table 6.4, HMM perfectly
identified the washer usage, and disaggregated showers with high scores. The F-measure for toilet
disaggregation with HMM only achieved 55%, although still much better than the baselines.
6.5.4 Impact of Sample Rate 113
Table 6.4: Precision, Recall, and F-measure on Volunteers
Precision, Toilet Shower WasherRecall, Mean Mean Mean
F-measure, (Standard Deviation) (Standard Deviation) (Standard Deviation)
0.516 (0.27), 0.831 (0.138), 1 (0),HMM 0.597 (0.17), 0.818 (0.144), 1 (0),
0.5536 (0.22) 0.8244 (0.14) 1 (0)
0.20 (0.18), 0.08 (0.09), 0.07 (0.09),Random Pick 0.19 (0.08), 0.19 (0.16), 0.29 (0.34),
0.1949 (0.13) 0.1126 (0.17) 0.1128 (0.27)
0.20 (0.10), 0.52 (0.34), 0.44 (0.52),Knapsack 0.904 (0.01), 0.47 (0.16), 0.23 (0.27),
0.3275 (0.05) 0.4937 (0.25) 0.3021 (0.39)
6.5.4 Impact of Sample Rate
Choosing an appropriate sample rate for smart meter deployment is a very important decision
that may affect hardware and maintenance cost. This set of experiments can provide practical
suggestions from the requirement of activity analysis. Reading intervals of the simulation datasets
were varied from 15 min to 3 hours in this set of experiments to evaluate its impact on the accuracy
of disaggregation results. Both HMM and GMM methods were evaluated in this set of experiments.
SVM-GMM was selected to represent GMM, because it had shown practically good accuracy and
efficiency in previous experiments. As suggested in Figure 6.5, both 15 and 30 min intervals provide
acceptable results. 1 hour interval supports fair disaggregation of washer and shower, but cannot
identify more than half of toilet flushes.
Figure 6.5: Impact of Interval Length
6.5.5 Disaggregation for Pilot Households
The proposed HMM-based approach has been applied on 300+ pilot households with 15 minute meter
readings. Hidden Markov models were constructed for each household, and water consumption
since August 2010 was disaggregated into activities to provide insights to residents and the city
6.5.5 Disaggregation for Pilot Households 114
management team. Some interesting usage patterns discovered from the disaggregation results are
illustrated in the following paragraphs.
Figure 6.6: Distribution vs. Demographic Info
By combining with demographic survey results, we first summarize the consumption distribution
of different types of households in pie charts as shown in Figure 6.6. Each pie chart shows the
portion of water each activity used by a given group of households. The consumption that cannot
be disaggregated is included in category ’others’. The consumption distribution of all the pilot
households is illustrated in Figure 6.6 a), where toilet and shower used about 30% each, and washer
used about 25%. Households with single occupant (Figure 6.6 b)) showed different usage pattern,
where shower only consumed 21% of the overall usage and washer reduced to 22%. Figure 6.6 c)
shows the pie chart for households with two adults only. Compared to the single adult households,
households of two adults consumed significantly higher in shower. On the other hands, kids in general
caused more washer usage. As shown in Figure 6.6 d) and e), households with kids brought washer
usage to 28%, and more specifically, households with toddlers had increased washer usage further to
30%. By comparison, a resident can easily figure out on which activity his or her household needs
more efforts to conserve water.
Temporal patterns of washer and shower usage have been identified from the disaggregation results.
As shown in Figure 6.7, the pilot households preferred to use washer in weekends, and each weekday
there was about 0.9 load per household in average. Not only the number of loads, but also the size
of each load increased in weekends. Figure 7 b) illustrates that each load on Saturday used 9% more
water than a load on Tuesday or Wednesday. This is reasonable because usually heavy laundry is
saved to weekends.
6.5.5 Disaggregation for Pilot Households 115
(a) Daily Occurrences (b) Gallons per Load
Figure 6.7: Washer Usage vs. Day of Week
Similar to washer, as can be seen in Figure 6.8 a), more showers happened during the days in
weekends. However, interestingly, an average shower on Sunday used the least water in a week,
which was 10% less than one on Saturday. Furthermore, a shower on Friday consumed the highest
amount of water in a week. It seemed that people wanted to relax and enjoyed longer showers on
Friday, while the stress from work arrived early on Sunday.
(a) Daily Occurrences (b) Gallons per Load
Figure 6.8: Shower vs. Day of Week
Figure 6.9 demonstrates the time of day distributions of shower and washer across the pilot house-
holds. As expected, the peaks of showers happened during 8 9 am and 6 7 pm in a day, which are
before and after work. Washer usage showed a similar distribution in b), although the pm peak was
not significant. That consistency could be explained as that many washer loads occurred right after
a shower to handle the changed clothes.
6.6 Related Work 116
(a) Shower (b) Washer Usage
Figure 6.9: Shower/Washer vs. Time of Day
6.6 Related Work
Non-intrusive load monitoring has been proposed based on analyzing steady state change and tran-
sient state change. So far most of the research effort has been focused on electricity load disaggrega-
tion with high sample rate [246,279,215,273,284,213,216]. A power meter with high sample rate (¿=
1Hz) can identify most of the state changes of multiple metrics (e.g., power, reactive power, voltage,
and harmonics) caused by individual appliances in a real-world home. Based on state change of cur-
rent and voltage, a non-intrusive load monitoring approach [246] was proposed to determine power
consumption of individual appliances. An electrical noise sensor has been used to disaggregate con-
sumption by running SVM on transient noise of turning on and off appliances [279]. By measuring
voltage of each outlet in a house, one approach [213] applied kNN and SVM to classify appliances.
This approach collected peak, average, and RMS of voltage of a single target with 4kHz sample rate,
and achieved best results using an NN classifier. An NN-based disaggregation approach has been
proposed to identify appliances with 90% accuracy using only the main power meter [215,216]. The
features it used consist of power, reactive power, voltage RMS, and harmonics for state transition.
RECAP has recently been proposed using artificial neural network (ANN) to disaggregate electricity
usage [284]. Features including power factor, peak and RMS of voltage and current were aggregated
every minute and analyzed in a 3-layer ANN. To extract better features, Matrix Pencil [273] has
been proposed to model each signal as complex plan, and use residues and poles as features for
disaggregation. Improved disaggregation results have been demonstrated.
Compared with electricity disaggregation, residential water disaggregation has attracted much less
research effort. To the best of our knowledge, there has not been any design that can disaggregate
water consumption either using a single water meter or from a sample rate lower than 500Hz.
Microphone-based sensors were applied on major water pipes (cold inlet, hot inlet, and sewing) to
recognize usage activities [237]. Combining the timestamps that these microphones detect noise, the
authors identified most of the water usages. However, this approach has difficulties to disaggregate
concurrent activities and cannot determine water volume. Integration of a water meter and a
6.6 Related Work 117
network of accelerometers [261] has been proposed to estimate the flow rates based on pipe vibration.
This approach has been applied in laboratory environments to disaggregate water usage. To avoid
accessing water pipes, an approach using pressure sensor on main source [238] was proposed to
identify fixtures. This approach applies hierarchical classifiers to first detect valve open and close
events, and then label fixtures. Due to the 1 kHz sample rate, it can clearly capture on and off
signals of fixtures from water pressure.
Chapter 7 118
Chapter 7
Application 2:Wireless PassiveDevice Fingerprintingusing Infinite HiddenMarkov Random Field
This chapter presents a new concept of device fingerprinting (or profiling) to enhance wireless se-
curity using Infinite Hidden Markov Random Field (iHMRF). Wireless device fingerprinting is an
emerging approach for detecting spoofing attacks in wireless network. Existing methods utilize ei-
ther time-independent features or time-dependent features, but not both concurrently due to the
complexity of different dynamic patterns. In this paper, we present a unified approach to finger-
printing based on iHMRF. The proposed approach is able to model both time-independent and
time-dependent features, and to automatically detect the number of devices that is dynamically
varying. We propose the first iHMRF-based online classification algorithm for wireless environment
using variational incremental inference, micro-clustering techniques, and batch updates. Extensive
simulation evaluations demonstrate the effectiveness and efficiency of this new approach.
The rest of the chapter is organized as follows. Section 7.1 introduces the background of the
problem. Section 7.2 formalizes the fingerprinting problem based on both time-dependent and
time-independent features. Section 7.3 discusses theoretical preliminaries, including Hidden Markov
Random Field (HMRF) and infinite Gaussian Mixture Model (iGMM). Section 7.4 formulates an
infinite hidden Markov random field (iHMRF) model for the fingerprinting problem, and Section 7.5
presents a new incremental inference algorithm for wireless streaming environment. Empirical vali-
dations of our proposed fingerprinting framework are presented in Section 7.6. The paper concludes
and discusses our future work in Section 7.7.
7.1 Introduction 119
7.1 Introduction
Nowadays, the proliferation of mobile devices moves the wireless networks toward “anytime-anywhere”
mobile service model. However, the open nature of wireless networks renders them susceptible to
various types of spoofing attacks. For example, the adversaries can collect nodes’ identity infor-
mation by passively monitoring the network traffic, and then masquerade as legitimate nodes to
disrupt network operations. Various attacks can be launched, such as packet injection [242], Sybil
attack [231], masquerade attack [235], etc. These identify-based attacks may hinder normal com-
munication and result in privacy leakage, which will lead to a huge outbreak of cybercrimes. As a
result, how to detect the presence of identity spoofing becomes a critical issue.
Two categories of existing solutions exist to detect identity spoofing attacks, namely, active detection
and passive detection. Active detection allows additional messages to be injected into the network,
such as challenges and responses used in cryptographic-based schemes for user authentication. In
the case that the entire node being compromised such that the cryptographic keys are exposed,
location related information can be used to facilitate node authentication. For example, in [71],
specific chipset, firmware or the driver of an 802.11 wireless device can be identified by watching
its responses to a crafted malformed 802.11 frames. However, the downside of active detection
methods lies in its requirement on extra message exchanges, which will accelerate the energy usage
and consume available bandwidth. In addition, the responses can also be spoofed, if they are device
dependent.
In contrast, passive detection methods extract device specific features from message transmissions,
which can be categorized as time-independent and time-dependent features. The main strength is
that these features are device dependent and hence can be used as an unique pattern to fingerprint
a specific device. Particularly, time-independent features include clock skew (observed from message
time stamps), sequence number anomalies (in MAC frames), timing (of probe frames for channel
scanning), and various RF parameters (transient phases at the onset of transmissions, frequency
offsets, phase offsets, I/Q offsets, etc.) [275]. Time-dependent features include radio signal strength
(RSS), angle of arrival, time of arrival, differential received signal strength, frequency difference
of arrival, etc. Note that, time-independent features refer to the signal measurements that have
constant mean values and are only randomized by white noises across the time. Time-dependent
features refer to the signal measurements whose mean values are time varying due to the essential
dynamical nature.
For the fingerprinting methods based on time-independent features [73,275,235,287,225,297], though
with a variety of implementations, basically it is assumed that the features form a cluster for each
device, which can be regarded as the unique fingerprinting pattern to identify the device. Two most
recent works are conducted by Brik et al. [73] and Nguyen et al. [275]. Brik et al. [73] proposed
the Passive RAdio-metric Device Identification System (PARADIS) utilizing modulation domain
radio-metrics, such as carrier frequency error, I/Q offset, etc. Nguyen et al. [275] further proposed
an unsupervised clustering method based on non-parametric Bayesian method and infinite Gaus-
7.1 Introduction 120
sian mixture model, which can automatically determine the number of clusters. To summarize,
time-independent features can be regarded as accurate and robust wireless signatures for particular
devices. However, the fingerprinting methods using time-independent features also have some limita-
tions. For example, these features are much harder to extract. Usually, some high-end measurement
devices are required to perform feature extraction. Moreover, the accuracy of these feature relies on
the precision of the measurement devices. Therefore, although time-dependent features are accurate
wireless signatures, the extracted features might include some errors due to the limitation of wireless
measurements.
For time-dependent features, the most popular family of methods for device identification is RSS-
based. In [72], a geographic location based identification technique against masquerading threats
was employed, where two alternate approaches are proposed: distance ratio test (DRT), which uti-
lizes the received signal strength (RSS) of a device, and distance difference test (DDT), which relies
on the received signal’s relative phase difference when the signal is received at different devices.
Zhao et al. [70] proposed a radio environment map (REM) which is a comprehensive database of
geographical features, available services, spectral regulations, locations, and activities of radio de-
vices and policies. Identification of cognitive radio (CR) node through an analysis of the transmitted
signal is investigated in [69] where wavelet transform is utilized to identify the transmitter finger-
print. However, the RSS measurements are time varying and only provide coarse spatial resolution.
Therefore, due to the dynamic nature, time-dependent features, such as RSS, cannot be regarded
as an accurate and reliable wireless signature alone.
The goal of this paper is to improve existing detection methods by considering additional features
that could potentially help improve the fingerprinting performance. Studies have been shown that
both time-independent features (e.g., frequency difference and phase shift difference) and time-
dependent features (e.g., RSS and time difference of arrival) can be used to do spoofing detection
[73, 235, 275, 287, 225, 297, 296, 298]. In this paper, we propose to concurrently model all the useful
features in a unified statistical framework, based on infinite hidden Markov random field (iHMRF).
All the device dependent features can be categorized into time-independent and time-dependent
features. The autocorrelation on time-dependent features is captured by using the so-called Markov
Property in iHMRF, in which data points that are similar on time-dependent features tend to have
consistent cluster labels. The time-independent features are captured through embedded Gaussian
mxitures in iHMRF. The main contributions of this work can be summarized as follows:
1. Design of a unified fingerprinting framework. To the best of our knowledge, this is the
first statistical approach to model both time-dependent and time-independent features in a
systematic framework for device fingerprinting.
2. Formulation of the fingerprinting problem via iHMRF modeling. We propose a novel
application of the iHMRF model to the device fingerprinting problem that captures correlations
on time-dependent features using the Markov property, and correlations on time-independent
features using an embedded Gaussian mixture model.
7.2 Related Work 121
3. Design of an online learning algorithm. We propose a new online classification algorithm
for the fingerprinting problem based on variational incremental inference, micro-clustering
techniques, and batch updates.
4. Comprehensive empirical validations. We conducted extensive simulations on a variety
of scenarios to validate the effectiveness and efficiency of our proposed techniques, competing
with existing state-of-the-art methods.
7.2 Related Work
A large body of literature has been dedicated to the issue of wireless device identification for detecting
spoofing attacks. In this section, we review the most relevant work in the literature. Based on
different types of features utilized, we classify these methods into two categories, including radio-
metric based methods, and radio signal strength (RSS) based methods.
7.2.1 Radio-metric Based Device Fingerprinting
In [73], Brik et al. proposed the Passive RAdio-metric Device Identification System (PARADIS)
utilizing modulation domain radio-metrics, such as carrier frequency error, I/Q offset, etc. The
experimental results show that these device dependent radio-metrics can effectively differentiate
devices. However, this method requires a training phase to collect the fingerprints of legitimate
nodes. Nguyen et al. [275] further proposed an unsupervised clustering method based on non-
parametric Bayesian method and infinite Gaussian mixture model. Without knowing the number of
devices, this method can automatically identify different devices by clustering their emitted packets
into different clusters. Our method also builds upon a non-parametric Bayesian framework for
unsupervised clustering. However, our method not only considers device dependent radio-metrics,
but also takes other device independent features into consideration to greatly improve the device
identification performance.
7.2.2 RSS Based Device Fingerprinting
Compared with radio-metric features, RSS feature is much easier to obtain, which makes RSS a
popular feature for device fingerprinting. Faria et al. [235] demonstrated strong correlations between
RSS signals and the physical location of devices, and proposed to use signalprint, a vector of RSS
values measured by surrounding Access Points (APs), to identify wireless devices for detecting
spoofing attacks. Sheng et al. [287] extended [235] and applied Gaussian mixture model to identify
clusters of the RSS readings. Chen et al. [225] used RSS and K-means cluster analysis. In both [287]
and [225], the number of clusters needs to be predefined. Later, Yang et al. [297] proposed two
cluster-based mechanisms that can automatically determine cluster numbers.
7.3 Features for Device Fingerprinting 122
However, the aforementioned methods [235, 287, 225, 297] only work in a static network (e.g., each
device is fixed in a specific location) and may raise a large number of false alarms in a mobile
network. The RSS profiles may change over time due to the nature of wireless device mobility. To
capture the RSS time-dependent property, Yang et al. [296] proposed the DEMOTE system that
partition the RSS trace of a node identity into two separate RSS traces, in which one trace is related
to a genuine node, and the other is related to a potential attacker. If the correlation between the
two traces is lower than a threshold, an alarm is alerted. They focused on two-class situations where
one genuine node and one attacker share a single identity (e.g., MAC address). This solution may
not be applicable to situations with multiple attackers sharing the same identity. Zeng et al. [298]
proposed a reciprocal channel variation-based identification (RCVI) technique to detect spoofing
attacks in mobile wireless networks. RCVI applies location de-correlation and reciprocal channel
variation to detect the original devices of all packets. However, this method assumes a bidirectional
communication between the genuine and the victim nodes. Therefore, it is not a completely passive
detection and requires senders to send the RSS information, which may cause unnecessary network
overheads.
Our paper also focuses on dynamic mobile networks. We observe that the above RSS based solution
for mobile networks share two more limitations. First, they both assume implicitly that wireless
devices and access points (AP) communicate periodically, and hence high sample rate location
features (e.g., RSS, TDOA) could be extracted. Second, they both consider device identification
(e.g., MAC address) into their fingerprinting process. The use of forgeable user identity information
may make the methods vulnerable to advanced spoofing attacks. For example, an attacker may
inject packets with randomly assigned device MAC addresses into the wireless network. This attack
will be hard to be detected if these MAC-addresses related victim devices are evaluated separately.
In contrary, our method takes the low sampling rate case into consideration. In addition, we neglect
the forgeable user identity information in our fingerprinting framework.
7.3 Features for Device Fingerprinting
Device fingerprinting means utilizing a set of unique features of devices that when exploited can be
used to differentiate wireless devices. Fingerprinting features can be classified in several ways. For
example, it can be categorize as time-dependent or time-independent features. As the name says
that some of the features varies over the time, whereas the others remain unchanged. There can be
device dependent and device-independent features as well. There can be transmitter fingerprinting
and receiver fingerprinting. The transmitter fingerprints are different than receiver’s radio-metric
parameters’ such as received power and are unique to the transmitter and not altered by the channel
condition and receiver structure.
In this section we briefly discuss about notable features that can be exploited for iHMRF based device
fingerprinting. Typically some common features of signal measurements/classifications are: angle-
of-arrival (AOA), received signal strength (RSS), time-of-arrival (TOA) and frequency-of-arrival
7.3.1 Time Measurement 123
(FOA). However, sometimes difference measurement features are well suited for creating traces for
particular applications. For example, time-difference-of-arrival (TDOA), frequency-difference-of-
arrival (FDOA), differential received signal strength (DRSS), phase shift difference (PSD) etc.
7.3.1 Time Measurement
The time required for a signal to travel from the transmitter (client or node) to the receiver (anchor or
access point) is directly proportional to the distance between them. The time-of-arrival (TOA) and
time-difference-of-arrival (TDOA) follows this principle. Propagation time measurement requires
synchronization between transmitter and receiver and knowledge of transmission and reception times
at one position. On the other hand, time difference measurements eliminates need for node to be
synchronized to anchors, but requires synchronization between anchors and doesn’t directly give
the distance between transmitter and receiver. The trilateration, conversion of the observations
to distances, from TOA or TDOA is done by d = cτ , where d is the distance, τ is the observed
time of flight (transmit time - receive time), and c is the propagation speed. The distance (from
observations) related to positions
dm = |(x, y) − (xm, ym)|2 ,m = 1, 2, 3, (7.1)
where (x, y) is the client position, (x1, y1), (x2, y2), and (x3, y3) are anchor positions, and |(x, y)|2 =√
x2 + y2. Here we have three non-linear equations with two unknowns, and it can be shown that
there is a single solution. Solving the equations requires more advanced algorithm unless linearization
technique applied. Using two observation points, TDOA can be calculated by
d = d1 − d2 = |(x, y) − (x1, y1)|2 − |(x, y) − (x2, y2)|2 .
The key sources of time measurement errors are: 1) synchronization error due imperfect reference
clock, measurement error such as error to determine the exact time of arrival of the signal and
signal fading (i.e., multipath), and environmental errors (e.g., non-line-of-sight propagation) that
adds delay not related to distance.
7.3.2 Frequency Measurement
Measuring ∆f , the difference between the carrier frequency of the received signal and the one of the
transmitted signal, can provide estimation about the device’s whereabout. The frequency difference
is a strong feature since each wireless transmitter has its own oscillator, and each oscillator creates
a unique carrier frequency. Frequency shift of the received signal is related to the velocity vector of
the transmitter relative to the receiver. Note that this mobility of transmitter introduces Doppler
Effect in the signal that smears signal frequency that can be measured. Frequency difference are
7.3.3 Phase Shift Difference Measurement 124
Device 1 Device 2
Figure 7.1: Illustration of phase shift difference for constellation of QPSK symbols of twotransmitters
more commonly used and obtained from Cross Ambiguity Function
C (∆f,∆t) =
∫ T
0
x (t)x∗ (t+ ∆t) e−jπ∆ftdt. (7.2)
It differs from time dependent features in that the frequency/phase shift feature observation points
must be in relative motion with respect to each other and the source, and FDOA can be calculated
by
f = f1 − f2 =v1λ
cos θ1 −v2λ
cos θ2. (7.3)
A major drawback of this measurement feature is that large amounts of data must be moved between
observation points or to a central position to do the cross-correlation that is necessary to estimate the
frequency shift. Other common source of frequency measurement errors are: 1) imperfect frequency
reference, 2) measurement errors such as noise, multipath etc., and non-stationary nature of the
frequency.
7.3.3 Phase Shift Difference Measurement
On top of aforementioned method, one can differentiate devices by looking into device’s I-Q phase
characteristic. Ideally the phase shift from one constellation to a neighbor one is 180 for BPSK
modulation and 90 for QPSK modulation. I-Q phase characteristics are different for I-phase and
Q-phase. The constellation may deviate from original position due to hardware variability, and
different devices have different constellations. Therefore, this feature can be measured and used as
classifier as well. Figure 7.1 shows an illustrative example of device signal constellations.
In this example we used QPSK as modulation of choice and considered feature extracted from the
constellation of QPSK. In QPSK, four symbols with different phases are transmitted where each
symbol represents two bits. Mathematically the transmitted symbol can be represented as
si (t) =
√
2Es
Tcos(
2πfct+ (2n− 1)π
4
)
, (7.4)
7.3.4 Angle of Arrival Measurement 125
where Es is the transmission power, T is symbol period, fc is the carrier frequency, and n is the
index for the four possible constellations. By changing n, we can vary the phases of the signal,
creating four phases π4 , 3π
4 , 5π4 , and 7π
4 . In the ideal case, the phase shift from one symbol to its
neighbor is 90. However, the transmitter amplifiers for I-phase and Q-phase might be different.
Consequently, the degree shift can have some variances.
7.3.4 Angle of Arrival Measurement
The direction of the nodes (or clients or devices) relative to the access points (or anchor) is equal to
the observed received angle-of-arrival (AOA or DOA), that can be used to create trace of device by
calculating the position of the nodes, or determining the angle of the position of node relative to the
access point. This process is called ‘triangulation’ where a minimum of two anchors and reference
coordinate are needed and can be calculated by two linear equations
y = tan θ1x+ (y1 − tan θ1x1) ,
y = tan θ2x+ (y2 − tan θ2x2) , (7.5)
where θ are angle between device and anchor and (x1, y1) and (x2, y2) are locations of the two
anchors.
Features Time Independent Time Dependent
Device DependentFrequency-of-arrival (FOA) Radio Signal Strength (RSS)
I/Q Offset Signal Noise Ratio (SNR)
Device Independent
Time-Difference-Of-Arrival (TDOA)
Phase Shift Difference (FSD) Time-Of-Arrival (TOA)
Carrier Frequency Offset (CFO) Angle-Of-Arrival (AOA)
Frequency-Difference-Of-Arrival (FDOA)
Table 7.1: Device Fingerprinting Features
Possible source of AOA errors are reference error (what is east?), measurement error for thermal
noise, environmental error (non-line-of-sight propagation).
7.3.5 Radio Signal Strength (RSS) Measurement
In free space signal power decays exponentially with distance that can be roughly estimated by
received signal strength. Translation of RSS measurement to distance requires knowledge of the
transmit power (i.e., reference value) and Knowledge of the relationship between distance and power
7.3.5 Radio Signal Strength (RSS) Measurement 126
decay (propagation model)
Pr (d) = P0 + 10n log10
(
d
d0
)
+Xσ, (7.6)
where P0 is the received power at reference distance d0 and Pr is the observed received power, d is
the distances, and n is the path loss exponent. The trilateration from RSS is done in the same way
as time measurement, except the conversion of the observations to distances is done by
d0 = d10
(
P0−Pr10n
)
. (7.7)
Differential RSS measurements eliminate need for transmit power knowledge and can provide im-
proved performance in correlated shadowing. The key limitations of this feature are: 1) Imperfect
knowledge of the transmit power or antenna gain, 2) measurement error such as signal fading (i.e.,
multipath), interference, and thermal noise, and 3) Environmental errors (e.g., non-line-of-sight
propagation) such as shadowing, biases the resulting distance estimate, and Imperfect knowledge of
the propagation exponent (model error).
Interestingly, the channel gain can be used as trait as well. The amplitude of received signal is
proportional to the channel gain, Ap. The general consensus is that the signals transmitted from the
same device over a short duration tend to have similar amplitude or effect of channel, even though
the absolute value of the amplitude is generally unknown. If the channel is Rayleigh faded multipath
channel, the channel gain can be expressed as
Ap∼= d−β |h|, (7.8)
where |h| is the fading component that is normally distributed with N (0, σ2h), d is distance from a
transmitted device to the sensing device, β is the path loss exponent. Thus the received signal gain
Ap can be described by the distribution
Ap ∼ N (0, d−2βσ2h). (7.9)
A notable difference is that by looking into channel characteristics only does not infer the locations
of devices directly, rather Ap as one more feature for the identification.
The aforementioned features are generic to most radio technologies. There are few other features
that can be used for specific technologies. For example, second-order cyclostationary feature of
OFDM signal can be used for identification.
7.4 Problem Formulation 127
7.4 Problem Formulation
Suppose we are given a sequence of N packet feature vectors (x1, s1, t1), · · · , (xN , sN , tN), where
xi ∈ Rp, si ∈ Rd, p and d refer to the numbers of time-independent and time-dependent features,
respectively, and ti refers to the arrival time of the ith packet on an access point. The goal is to
identity the sequence of hidden states (device labels): z1, · · · , zN , where zi ∈ 1, 2, · · · , C refers
to the hidden state of the packet feature vector (xi, si, ti), and C refers to the total number of
hidden states. There may exist some tis, in which the time distance between ti and ti+1 is large,
and the dependence between si and si+1 may be highly degraded because of the low collection rate.
The number C of hidden states is unknown and will be estimated using nonparametric Bayesian
techniques.
Figure 7.2: Features extraction from packets
The process of feature extraction is shown in Figure 1. Suppose multiple access points (APs) are
deployed across the network environment, which collect and send traffic information to a centralized
server, called a wireless appliance (WA). Each AP reports the RSS measurement for each packet
received, as well as other device dependent features, such as frequency difference and phase shift
difference. WA receives all the information and creates a fingerprint feature vector for each packet.
Note that, there may be some duplicated features reported by APs, such as frequency differences of
the repeated packets received by different APs. We will randomly select and keep one version, since
for device dependent features all different versions should exhibit similar patterns.
Several assumptions and constraints are stated as follows:
1. There is no training data about the fingerprints of legitimate devices available. The problem
will be addressed in a completely unsupervised manner.
2. The collection rate of RSS measurements may be unstable. Sometimes the collection rate will
be low, e.g., some devices are in standby status and there are no communications between the
devices and access points. Sometimes the collection rate will be high, e.g., the device users are
using calling services, sending text messages, or serving internet.
7.5 Theoretical Backgrounds 128
3. The number of clients (devices) is unknown and dynamic. Current clients may leave the
network and new clients may join the network in any time.
4. A wireless network may have a large number of concurrent clients. We will need to evaluate
the impact of the number of concurrent clients on the fingerprinting performance.
5. It is not allowed to add any additional out-band message exchanges. The problem will be
addressed using passive detection strategies.
6. Attackers have the ability to adjust transmission powers to increase localization uncertainties.
7. Attackers have the ability to masquerade as a large number of clients. Hence, we will not trust
device identity information and only consider device dependent features for fingerprinting.
7.5 Theoretical Backgrounds
This section introduces two basic statistical models, including Hidden Markov Random Field (HMRF)
and infinite Gaussian Mixture Model (iGMM). These two models provide theoretical fundamentals
to Infinite Hidden Markov Random Field (iHMRF) that will be applied to do wireless device finger-
printing.
7.5.1 Hidden Markov Random Field
Suppose we have a set of observations (x1, s1), · · · , (xN , sN ), where each observation (xi, si) has
p features (xi ∈ Rp) and d spatial coordinates (si ∈ Rd). Denote X = x1, · · · ,xN and S =
s1, · · · , sN. The objective is to infer the latent variables Z = z1, · · · , zn based on X and S,
where zi ∈ C, and C = 1, · · · , C denotes the set of class labels.
Hidden Markov Random Field (HMRF) can be described as a two-layer hierarchical model, including
the latent layer Z and the observation layer X. For the latent layer Z, HMRF considers spatial
dependencies between the observations Z. Nearby variables will have higher correlations than distant
variables. The neighborhood relationship is decided based on their closeness on spatial coordinates
s1, · · · , sn, such as by the K-nearest neighbors rule. This so-called Markov property can be
formulated as
p(zi = c|N(zi); γ) =1
Z(γ)exp
(
−∑
c∈Ci
Vc(zi = c,N(zi)|β)
)
, (7.10)
where Z(β) refers to a normalization constant, β is called the inverse temperature of the model,
N(zi) refers to the neighbors of zi, and Ci refers to the set of cliques, each of which contains zi as a
member. A clique c is defined as any set of variables such that all the variables in c are neighbors
to each other. Vc(·) is called clique potential, which is a measure of the consistence of the variables
7.5.2 Infinite Gaussian Mixture Model 129
in c. A clique potential Vc(Z|β) can be defined as
Vc(Z|γ) = β∏
i,j∈c
δ(zi − zj). (7.11)
The joint distribution p(Z|β) of an HMRF model is
p(Z) =∏
i
p(zi|N(zi); γ) =1
Z(γ)exp
(
−∑
c∈CVc(Z|β)
)
, (7.12)
where Z(β) is a normalization constant.
For the observation layer, HMRF defines the conditional distribution p(X|Z) as
p(X|Z; Θ) =
N∏
i=1
p(xi|zi; Θzi), (7.13)
p(xi|zi; Θzi) = N (xi|µzi
,Σzi), (7.14)
where each observation xi follows a Gaussian distribution conditioned on the latent variable zi.
Each class is related to a distinct Gaussian distribution, and we have totally C Gaussian mixtures.
Denote the parameters Θ = ΘcCc=1, and Θc = µc,Σc.
7.5.2 Infinite Gaussian Mixture Model
Infinite Gaussian Mixture Model (iGMM), also named Dirichlet Process Gaussian Mixture Model
(DPGMM), is an extension of the traditional Gaussian Mixture Model (GMM) to support an finite
number of Gaussian mixtures. Denote X = x1, · · · ,xN as observations, and Z = z1, · · · , zN as
latent class labels, where zi ∈ Ci = 1, · · · , C. Note that, different from HMRF, spatial coordinates
(attributes) are not considered here. iGMM can be defined as
vc|α ∼ Beta(1, α), c = 1, · · · ,∞, (7.15)
Θc|G0 ∼ G0, c = 1, · · · ,∞, (7.16)
xi|zi = c; Θc ∼ N (µc,Σc), (7.17)
zi|π(v) ∼ Multi(π(v)), (7.18)
where πc(v) = vc
∏c−1i=1 (1− vi). To interpret this model, we can look at its data generating process:
1. Draw vc|α ∼ Beta(1, α), c = 1, 2, · · · ,
2. Draw Θc = µc,Σc|G0 ∼ G0, c = 1, 2, · · · ,
3. For the ith data point
(a) Draw zi|v1, v2, · · · ∼Multi(π(v)),
7.6 Infinite Hidden Markov Random Field (iHMRF) 130
(b) Draw xi|zi = c ∼ N (, µc,Σc).
Particularly, step 1 samples a countably infinite set of random variables v from a beta distribution
Beta(1, α), where α is a hyper-parameter. The prior probabilities π(v) can then be calculated as
πc(v) = vc
c−1∏
i=1
(1 − vi), c = 1, 2, · · ·. (7.19)
Step 2 samples the model parameters Θc for each mixture c from a base distribution G0, which is
defined as
Σc ∼ InverseWishartυ0(Λ0), (7.20)
µc ∼ N (µ0,Σc/K0), (7.21)
where υ, µ0,Λ0 are the hyper-parameters. Steps 1 and 2 are called the stick-breaking construction
of a dirichlet process (DP). Given the prior probabilities π(v) and the Gaussian distribution param-
eters Θ1,Θ2, · · · , the last step (Step 3) is to i.i.d. sample N observations xi, zi, i = 1, 2, · · · , N .
For each point i, step 3.1 samples its class label from Multi(π(v)), and step 3.2 samples its features
xi from the corresponding Gaussian distribution N (µc,Σc).
Figure 7.3: Graphical Model Representation of iGMM
7.6 Infinite Hidden Markov Random Field (iHMRF)
Given the data set X = x1, · · · ,xN, S = s1, · · · , sN, and T = t1, t2, · · · , tN, with the unknown
class labels Z = z1, · · · , zN. The iHMRF model can be represented by a graphical model as shown
in Figure 7.4. Each node represents a random variable (or vector), and each dot represents a hyper-
parameter. The filled nodes refer to observations and blank nodes refer to latent variables. Basically,
we first use spatio-temporal features (s1, t1), · · · , (sN , tN ) to build a neighborhood graph for the
7.6 Infinite Hidden Markov Random Field (iHMRF) 131
latent state variables z1, · · · , zN, in which states zi and zj are connected by an undirected edge
if they are spatial temporal neighbors. Each latent state variable zi will emit an observation xi.
The iHMRF model is designed by this manner. According to the key property of a hidden Markov
random field, the hidden states should be consistent if they are neighbors to each other. However,
two neighbor nodes zi and zj could be assigned different cluster labels if their emission observations
xi and xj belong to two different Gaussian distributions. The iHMRF model can be defined as
follows:
Definition 1 Infinite Hidden Markov Random Field (iHMRF)
α|λ1, λ2 ∼ Gamma(λ1, λ2) (7.22)
βc|α ∼ Beta(1, α), c = 1, · · · ,∞, (7.23)
Θc|G0 ∼ G0, c = 1, · · · ,∞, (7.24)
xi|zi = c; Θc ∼ N (µc,Σc), (7.25)
zi|π(β) ∼ Multi(π(β)), (7.26)
p(Z) =
N∏
i=1
p(zi|π(β), zi,N(zi)), (7.27)
p(zi|π(β), zi,N(zi)) = p(zi = c|π(β))
×p(zi = c|zi,N(zi); γ), (7.28)
where Θc|G0 stands for:
Σc ∼ InverseWishartυ0(Λ0), (7.29)
µc ∼ N (g0,Σc/η0), (7.30)
and
p(zi = c|N(zi); γ) =1
Z(γ)exp
(
−∑
c∈Ci
Vc(zi = c,N(zi); γ)
)
, (7.31)
where λ1, λ2, γ, υ0,g0,Λ0, η0 are hyper-parameters.
Compared with HMRF and iGMM, the iHMRF model has three major advantages: First, iHMRF
is able to capture Gaussian mixtures information and spatial dependencies between latent variables
ziNi=1 concurrently, through Equations (7.25) and (7.28). As a result, iHMRF tends to decide the
value of zi both based on its neighbors and its closest Gaussian mixture. When conflicts occur, that
means the class labels of its spatial neighbors are not consistent with its closest Gaussian mixture, we
can adjust the inverse temperature parameter γ to decide the weight we put on each side. A smaller
value of γ implies that the model will favor more on the Gaussian mixtures information. In the
extreme when γ = 0, the model will degenerate and become equivalent to iGMM. Second, iHMRF is
able to automatically estimate the number of class labels (clusters), since Dirichlet Process (DP) is
used as the prior distribution for zi and xi. Third, iHMRF is robust to transmission power changes.
7.7 Incremental Variational Inference for the IHMRF Model 132
Figure 7.4: Graphical Model Representation of iHMRF
When a device changes its transmission power, it tends to increase the spatial entropy and makes
its spatial trajectory more highlighted than those of other devices. We observe that iHMRF inherits
the advantages of both HMRF and iGMM.
Based on the above iHMRF model specification, the fingerprinting problem can be reformulated as
a maximum-a-posterior (MAP) problem. It is to estimate the latent variables z1, · · · , zN, such
that their joint posterior probability based on the observations x1, · · · ,xN can be maximized:
z1, · · · , zN = argminz1,··· ,zN
p(z1, · · · , zN |x1, · · · ,xN ). (7.32)
Because the wireless device environment under study is a streaming environment, it is more ap-
propriate to do incremental inference (or classification). We will introduce efficient incremental
techniques in the next section 7.7.
7.7 Incremental Variational Inference for the IHMRF Model
Inference for the iHMRF model can be conducted based on variational inference, Markov chain
Monte Carlo (MCMC), and other methods. In this paper, we are focused on variational inference,
because it is computationally more scalable than MCMC techniques, and hence more applicable to
wireless streaming environment. Denote Φ = Z,Θ,v as the set of all latent random variables,
and θ = γ, λ1, λ2, υ0,g0,Λ0 as the set of hyper-parameters. The objective is to infer the latent Φ
given the observations X and hyper-parameters θ. Because it is intractable to calculate the posterior
p(Φ|X, θ), variance inference is applied to approximate the posterior with a parametric family of
7.7 Incremental Variational Inference for the IHMRF Model 133
factorized distributions q(Φ|X, θ) of the form
q(Φ|X, θ) = q(Z)q(α;λ1, λ2)
C−1∏
c=1
q(βc; ζc,1, ζc,2)
×C∏
c=1
q(µc,Σc; υc, ηc, gc, Λc). (7.33)
Denote the variational Free Energy functional as
F (q;X, θ) =
∫
q(Φ; θ) logp(Φ|X, θ)
q(Φ; θ)dΦ, (7.34)
which is a lower bound of the original log-evidence ln p(X|θ). The optimal solution based on the
parametric family can be obtained by maximizing the Free Energy functional:
minimizeθ
F (q(θ);X, θ), (7.35)
where the variational parameters to be estimated include θ = λ1, λ2, ζc,1, ζc,2, υc, ηc,gc,Λc, Cc=1.
These parameters can be optimized iteratively by coordinate accent until convergence to a local
optimum. The results have been derived by Chatzis et.al. [223].
In this section, we will focus on incremental inference, instead of the above offline inference (7.35).
Incremental inference is more suitable for a streaming environment as existing in our device fin-
gerprinting problem. Assume that we have a buffer bucket with a limited size (e.g., N) to store
the streaming observations. When the bucket is full, it will be processed and all the observations
in the bucket will be classified. Then the bucket is cleaned and is ready to accept new incoming
observations. We may consider multiple buckets in the process line, such that when one bucket
is being processed, other buckets are ready to store new incoming observations. Denote a bucket
data as B(i) = (x(i)1 , s
(i)1 , t
(i)1 ), · · · , (x
(i)N , s
(i)N , t
(i)N ), where i refers to the bucket sequence number.
The incremental inference problem is to process the incoming buckets B(1),B(2), · · · incrementally.
We consider a similar strategy as used in iGMM [265, 243], and propose an incremental inference
framework for iHMRF. The key components are summarized as follows:
1. Compression Phase: When the observations have been classified to different clusters, each
cluster is separated into a number of microclusters that tend to have consistent cluster labels,
even when the clusters have been reformed due to the process of new bucket data. For each
microcluster, its sufficient statistics are stored and the data points inside are discarded to save
memory space and improve computational efficiency.
2. Model Building Phase: The incremental inference will be conducted based on microclusters,
instead of data points. Some microclusters are allowed to be isolated data points.
3. Incremental Batch Update Phase: The incremental model updates based on the new bucket
and previous buckets need not to start from scratch. The model information estimated based
7.7.1 Model Building Phase 134
on previous buckets will be considered to improve the incremental update efficiency.
The technical details of the above three components are discussed in Sections 7.7.1, 7.7.2, and 7.7.3,
respectively.
7.7.1 Model Building Phase
This phase assumes that the observations in the current buckets have already been grouped to a
set of microclusters. When this phase is first run (as the initialization step), each observation will
be regarded a microcluster. For later iterations, the microclusters are generated from the previous
iterations (see Section 7.7.2). Denote A as a specific microcluster, nA as the cluster size, and
xA = 1nA
∑
xi∈A xi. The model building phase is to solve the following constrained optimization
problem
minimizeq(Φ;θ)
∫
q(Φ; θ) logp(W |X ; θ)
q(Φ; θ)dΦ
subject to q(zi) = q(zj), if ∃A s.t. zi, zj ∈ A,
(7.36)
where q(Φ; θ) is a factorized parametric form as defined in 7.33. Notice the difference the above
problem (7.36) and the traditional offline problem (7.35). New constraints are defined such that the
data points in a same microcluster must have identical class labels. Because each microcluster is now
summarized by its sufficient statistics, the computational efficiency is greatly improved. The above
problem can be optimized iteratively by coordinate accent until convergence to a local optimum.
The solution for each iteration can be obtained as
ζc,1 = 1 +∑
A
nAq(A = c) (7.37)
ζc,2 = 〈α〉 +C∑
k=c+1
∑
A
nAq(A = k) (7.38)
wc =∑
A
nAq(A = c) (7.39)
xc =
∑
A nAq(A = c)xA
wc(7.40)
Ξ =∑
A
nAq(A = c)(xA − xc)(xA − xc)T (7.41)
q(A = c) ∝ p(A = c|(N)(A); γ)πc(β)p(xA|Θc), (7.42)
where N(A) refers to the neighbors of the micro-cluster A, which are defined similar to those based
on data points. Here, we use the spatial center point of a microcluster to represent its spatial
location, with sA = 1nA
∑
si∈A si, and use the center time to represent its time domain location, with
tA = 1nA
∑
ti∈A ti. Note that, only the solution components that are different from the traditional
offline solution are presented above. Readers are referred to [223] for the estimation of the other
model parameters that have the same result as the offline iHMRF model, including ζc, Λc, υc, ηc,
and gc.
7.7.2 Compression Phase 135
7.7.2 Compression Phase
This phase focuses on the generation of microclusters. The microclusters will be generated such
that the data points in each microcluster tend to be located in a same cluster, even when the overall
clusters have been reformed due to the process of new bucket data. To address this challenge, a
straightforward strategy is to generate multiple candidate clusters from different ways and then look
for the micoclusters, each of which never overlaps with more than one candidate cluster concurrently.
However, this strategy has two potential deficiencies: First, it is computationally expensive since the
number of different groups increases exponentially with the data size; Second, it does not consider
the behavior of future data points. An optimized strategy is to predict up to ∆ future points based
on the empirical distribution estimated from existing data (x1,x2, · · · ,xT ):
p(xT+1, · · · ,xT+∆) =
T+∆∏
i=T+1
1
T
T∑
t=1
δ(xi − xt). (7.43)
We define a modified Free Energy functional by taking expectation on ∆ unobserved future points
as
F (q;X, θ) =
∫
dxT+1, · · · , dxT+∆F (q;X, θ)
·p(xT+1, · · · ,xT+∆). (7.44)
The solution by maximizing the above modified Free Energy functional can be obtained as
ζc,1 = 1 + (1 +∆
T)∑
A
nAq(A = c) (7.45)
ζc,2 = 〈α〉 + (1 +∆
T)
C∑
k=c+1
∑
A
nAq(A = k) (7.46)
wc = (1 +∆
T)∑
A
nAq(A = c) (7.47)
xc = (1 +∆
T)
∑
A nAq(A = c)xA
wc(7.48)
Ξ = (1 +∆
T)∑
A
nAq(A = c)(xA − xc)
(xA − xc)T (7.49)
q(A = c) ∝ p(A = c|(N)(A); γ)πc(β)p(xA|Θc), (7.50)
To conduct the compression phase, we first apply the model building phase to generate clusters.
Then for each candidate cluster, we split it into two clusters along its principal component, and
refine the clusters based on the above update rules 7.45, until convergence. The gain on the free
energy function is denoted as ∆F (q;X, θ). The cluster with the largest ∆F (q;X, θ) will be selected
as the final splitting cluster. Iterate the process until convergence, e.g., the gain ∆F (q;X, θ) is
7.7.3 Incremental Batch Update Phase 136
smaller than a predefined threshold or the consumed memory is greater than the memory space
limit.
7.7.3 Incremental Batch Update Phase
This phase assumes that all previous bucket data have been processed, and we have obtained the
estimate variational parameters ηc,Λc, υc,gc, ζc,1:2, λ1:2, wc, xc,ΞcCc=1. Suppose a new bucket data
have been arrived, and it is necessary classify the new bucket data points and update all existing
clusters. Denote the new bucket data as (x(n)1 , s
(n)1 , t
(n)1 ), · · · , (x
(n)N , s
(n)N , t
(n)N ). The incremental
Batch update phase can be described as
ζc,1 = ζc,1 +
N∑
i=1
q(z(n)i = c) (7.51)
ζc,2 = ζc,2 +
C∑
k=c+1
N∑
i=1
q(z(n)i = k) (7.52)
wc = wc +
N∑
i=1
q(z(n)i = c) (7.53)
xc =xcwc +
∑Ni=1 q(z
(n)i = c)x
(n)i
wc(7.54)
Ξ = Ξ +
N∑
i=1
q(z(n)i = c)(x
(n)i − xc)
(x(n)i − xc)
T (7.55)
q(z(n)i = c) ∝ p(z
(n)i = c|N(zi))πc(β)p(x
(n)i |Θc). (7.56)
The basic idea is to apply Equation (7.56) to estimate q(z(n)i ), and apply Equations (7.51) to (7.55)
to update the variational parameters ζc,1:2, wc, xc, and Ξ. The other parameters that are consistent
with the offline iHMRF model are then updated by the equations derived in [223].
7.8 Simulation Result
This section presents an extensive simulation study to validate the effectiveness and efficiency of our
proposed techniques, compared with existing solutions, such as Gaussian Mixture Model (GMM) and
infinite Gaussian Mixture model (iGMM) [275]. For our fingerprinting framework, we studied the
performances of two inference algorithms, including the offline variational inference algorithm [223]
and our proposed online (incremental) inference algorithm.
7.8.1 Simulation Setup 137
7.8.1 Simulation Setup
The simulation data generator includes two components. The first component is the generation of
time-independent features. The same simulator design as used in [275] were applied to generate time-
independent features. Basically, a number of devices will be chosen randomly in an area of 40×40 in
the time-independent feature space, with variances of the clusters chosen random in the range from
0 to 1. We considered two time-independent features, such that the data can be easily visualized.
The second component is the generation of time dependent features. We considered RSS features
and assumed that the collected RSS features have been triangulated to three dimensional spatial
coordinates. This is appropriate for mobile devices, because for different time periods users may
travel to different spatial regions and different Access Points (AP) will be able to collect the related
RSS traces data. By converting the RSS features to spatial coordinates, we do not need to consider
the issue of missing values for different access points. We used UdelModels to generate mobile device
traces data, which is a widely used simulator for generating human trajectory data [260]. Changes
of transmission power were simulated by shifting a trace segment with a randomly selected distance
and direction.
We considered four major metrics to evaluate the effectiveness of our framework, including precision,
recall, F-measure, and rand index (IR). These metrics are defined based on true positive rate (TP),
false positive rate (FP), false native rate (FN), and true negative rate (TN), as interpreted in Table
7.2. These metrics are defined as Precision = TP/(TP + FP );Recall = TP/(TP + FN);F −
Measure = Precision×RecallPrecision+Recall ; and Rand− Index(RI) = TP+TN
TP+TN+FN+FP .
Table 7.2: Definition of TP, FP, FN, and TN
Same Cluster Different Clusters
Same Class TP FNDifferent Classes FP TN
We used UdelModels to generate four simulation datasets to cover a variety of scenarios, including
indoor and outdoor environments. The basic features of these data sets are summarized in the
following table 7.3. For each setting, we generated five different versions, in order to calculate the
uncertainty (standard deviation) of the classification performance.
Table 7.3: Simulation Data Settings
Description # of Penetrations (Peds) # of Cars
1 Building 10 Floors 5, 10, 15 5, 10, 15Real City (Chicago9B1k) 5, 10, 15 5, 10, 15
We compared our framework with two existing approaches, including GMM and iGMM. For our
framework, we employed two inference algorithms, including the offline variational inference algo-
rithm for iHMRF [223], abbreviated as iHMRF-VI, and our proposed incremental inference algo-
7.8.1 Simulation Setup 138
(a) Chicago9B1k Data with Only Pedestrians (b) Chicago9B1k Data with Unstable RSS Rates
Figure 7.5: Spatial Distribution of Simulation Data
rithm, abbreviated as Inc-iHMRF-VI. For GMM, it is required to predefine the number of clusters. In
our simulation study, we set the value as the true number of clusters (devices), in order to study the
best performance that a GMM model could achieve. iGMM is a nonparametric method. Although
it still needs to set the number of clusters, iGMM is able to automatically determine the number of
clusters. Therefore, we randomly set the initial cluster number. All the other hyperparameters were
set such that the corresponding parameters are uniform-distributed. Similar strategies were used
for the nonparametric methods iHMRF-VI and Inc-iHMRF-VI. One more setting in both iHMRF
and Inc-iHMRF-VI is to define spatio-temporal neighborhood relationships. We defined neighbors
as the data points that are 5 nearest spatial neighbors to each other and have the time stamp dis-
tance smaller than 50. These settings can be loosely decided and we observed that the resulting
performance is not rapidly varied. We set the memory bound and the bucket size of Inc-iHMRF-VI
to 2000 and 2000, respectively. We observed similar patterns based on different settings of these two
parameters.
For the simulation data, we considered two scenarios, including indoors and outdoors. For indoors,
we generated simulation data with the number of devices 5, 10, and 15, and the sample rate one
reading every 20 seconds. The results are shown in Table 7.4. For outdoors, we simulated mobile
traces of a real downtown area in Chicago with 5, 10, and 15 penetrations. The results are shown
in Table 7.5. The results on the scenarios with 5, 10, and 15 cars are shown in table 7.6. Table 7.7
shows the results with concurrent pedestrians and cars. From all these results, we observe that our
framework based on the iHMRF model outperformed GMM and iGMM in the majority of cases,
especially compared with iGMM. Recall that the GMM method used the true number of clusters
(devices) as the initial setting. Its according performance should represent the close-to-the-best
performance of general clustering algorithms based on time-independent features.
7.8.1 Simulation Setup 139
Table 7.4: Simulation Results Based on UdelModels with 1 Building 10 Floors
Methods # of Devices Precision Recall F-Measure Relative Index (RI)
iHMRF-VI 5 0.97 (0.02) 0.91 (0.13) 0.93 (0.07) 0.96 (0.04)10 0.72 (0.13) 0.81 (0.13) 0.76 (0.11) 0.93 (0.04)15 0.73 (0.09) 0.82 (0.06) 0.77 (0.07) 0.96 (0.01)
Inc-iHMRF-VI 5 0.88 (0.10) 0.94 (0.05) 0.91 (0.05) 0.95 (0.02)10 0.65 (0.28) 0.85 (0.14) 0.72 (0.23) 0.90 (0.09)15 0.51 (0.13) 0.79 (0.08) 0.62 (0.12) 0.92 (0.02)
iGMM-VI 5 0.86 (0.09) 0.44 (0.15) 0.57 (0.15) 0.80 (0.09)10 0.73 (0.11) 0.43 (0.10) 0.54 (0.10) 0.91 (0.03)15 0.56 (0.01) 0.30 (0.07) 0.38 (0.06) 0.92 (0.01)
GMM-EM 5 0.91 (0.15) 0.85 (0.22) 0.86 (0.16) 0.90 (0.10)10 0.72 (0.14) 0.83 (0.13) 0.77 (0.11) 0.93 (0.04)15 0.64 (0.11) 0.77 (0.06) 0.70 (0.08) 0.94 (0.01)
Table 7.5: Simulation Results Based on UdelModels - Chicago9Blk - with Pedestrians and Cars
Methods # of Devices Precision Recall F-Measure Relative Index (RI)
iHMRF-VI 5 Peds, 5 Cars 0.99 (0.01) 0.98 (0.01) 0.99 (0.01) 0.99 (0.01)10 Peds, 10 Cars 0.91 (0.10) 0.99 (0.10) 0.95 (0.05) 0.99 (0.01)15 Peds, 15 Cars 0.90 (0.09) 0.97 (0.02) 0.94 (0.05) 0.99 (0.01)
Inc-iHMRF-VI 5 Peds, 5 Cars 0.98 (0.02) 1.00 (0.00) 0.99 (0.01) 0.99 (0.01)10 Peds, 10 Cars 0.80 (0.13) 0.97 (0.04) 0.87 (0.08) 0.96 (0.02)15 Peds, 15 Cars 0.57 (0.07) 0.92 (0.08) 0.70 (0.07) 0.93 (0.02)
iGMM-VI 5 Peds, 5 Cars 0.90 (0.12) 0.29 (0.05) 0.44 (0.07) 0.80 (0.0610 Peds, 10 Cars 0.67 (0.08) 0.31 (0.06) 0.42 (0.06) 0.89 (0.02)15 Peds, 15 Cars 0.63 (0.06) 0.29 (0.06) 0.40 (0.06) 0.92 (0.01)
GMM-EM 5 Peds, 5 Cars 0.92 (0.13) 0.89 (0.06) 0.89 (0.07) 0.95 (0.03)10 Peds, 10 Cars 0.69 (0.08) 0.79 (0.11) 0.73 (0.09) 0.93 (0.03)15 Peds, 15 Cars 0.69 (0.12) 0.78 (0.06) 0.72 (0.08) 0.95 (0.02)
Table 7.6: Simulation Results Based on UdelModels - Chicago9Blk - with Only Cars
Methods # of Devices Precision Recall F-Measure Relative Index (RI)
iHMRF-VI 5 Cars 0.95 (0.03) 0.59 (0.08) 0.72 (0.06) 0.89 (0.02)10 Cars 0.83 (0.09) 0.55 (0.05) 0.66 (0.05) 0.93 (0.01)15 Cars 0.68 (0.08) 0.53 (0.09) 0.59 (0.08) 0.94 (0.01)
Inc-iHMRF-VI 5 Cars 0.89 (0.12) 0.98 (0.02) 0.93 (0.02) 0.97 (0.04)10 Cars 0.73 (0.11) 0.77 (0.09) 0.75 (0.06) 0.93 (0.02)15 Cars 0.56 (0.08) 0.83 (0.06) 0.66 (0.07) 0.93 (0.02)
iGMM-VI 5 Cars 0.82 (0.08) 0.30 (0.07) 0.44 (0.07) 0.82 (0.02)10 Cars 0.65 (0.10) 0.32 (0.07) 0.43 (0.08) 0.89 (0.01)15 Cars 0.55 (0.06) 0.29 (0.05) 0.38 (0.05) 0.92 (0.01)
GMM-EM 5 Cars 0.91 (0.12) 0.87 (0.13) 0.89 (0.12) 0.95 (0.05)10 Cars 0.79 (0.07) 0.81 (0.09) 0.89 (0.08) 0.95 (0.02)15 Cars 0.73 (0.04) 0.79 (0.09) 0.76 (0.06) 0.96 (0.01)
7.8.2 Impacts of Instable RSS Collection Rates 140
Table 7.7: Simulation Results Based on UdelModels - Chicago9Blk - with Only Pedestrians
Methods # of Devices Precision Recall F-Measure Relative Index (RI)
iHMRF-VI 5 Peds 0.98 (0.04) 0.83 (0.13) 0.90 (0.09) 0.96 (0.03)10 Peds 0.92 (0.08) 0.80 (0.13) 0.85 (0.10) 0.97 (0.02)15 Peds 0.91 (0.05) 0.86 (0.05) 0.88 (0.04) 0.98 (0.00)
Inc-iHMRF-VI 5 Peds 0.86 (0.10) 0.92 (0.08) 0.88 (0.07) 0.95 (0.03)10 Peds 0.71 (0.08) 0.89 (0.07) 0.79 (0.05) 0.95 (0.03)15 Peds 0.61 (0.08) 0.92 (0.02) 0.72 (0.06) 0.95 (0.01)
iGMM-VI 5 Peds 0.82 (0.12) 0.31 (0.05) 0.44 (0.06) 0.85 (0.01)10 Peds 0.73 (0.11) 0.36 (0.08) 0.48 (0.10) 0.92 (0.01)15 Peds 0.63 (0.05) 0.35 (0.06) 0.45 (0.05) 0.94 (0.00)
GMM-EM 5 Peds 0.73 (0.15) 0.90 (0.07) 0.80 (0.11) 0.91 (0.05)10 Peds 0.69 (0.12) 0.84 (0.09) 0.75 (0.11) 0.94 (0.03)15 Peds 0.68 (0.11) 0.86 (0.04) 0.75 (0.08) 0.96 (0.02)
However, we did notice that as shown in table 7.6, when the mobile devices are vehicles, the GMM’s
performance was comparable to our methods. But our methods still outperformed iGMM. This
pattern is potentially related to the assumption of the iHMRF model. That is, data points that are
spatially and temporally close tend to have consistent class labels. Vehicles are moving mush faster
than pedestrians and tend to have lower sample rates and have more overlaps on their spatial traces.
When devices have more overlaps spatially and temporally, the overlapped spatial trace features can
not be well used to distinguish different mobile devices anymore. However, there still exist some
trace segments that are not overlapped together, which can be regarded as useful information for
the classification process. It potentially explains why iHMRF’s performance was degraded in this
situation but still performed better than iGMM.
In overall all, both iHMRF-VI and Inc-iHMRF-VI achieved comparable accuracies, but iHMRF-VI
performed slightly better. This can be interpreted as the results of data compression by the use of
microclusters in Inc-iHMRF-VI. For all the simulation data sets, the average data size is around 8000
observations. In our implementation, we set the memory bound to 2000 observations. That means,
we compressed 8000 observations into 2000 microclusters, which greatly reduced the computational
cost and the required memory size, but with slight sacrifices of the accuracy.
7.8.2 Impacts of Instable RSS Collection Rates
We evaluated the impacts of instable RSS collection rates baesd on the ChicagoBlk pedestrians
data set. We randomly selected 50 percent of devices, segmented each selected device trace into
eight segments, and then randomly removed 50 percent of the segments. This process leads to
discontinuous RSS trace data. The classification results based on those modified data are shown in
table 7.8, and a visualization of the generated simulation data is shown in Figure 7.5. We observe
that iHMRF-VI and Inc-iHMRF-VI performed the best in the majority of cases, which is consistent
with our observations in previous results. However, by comparing Table 7.8 and Table 7.8, we observe
that unstable RSS rates slightly degraded the accuracies. This is potentially due to the reduction of
7.8.3 Impacts of Transmission Power Changes 141
Table 7.8: Unstable RSS Rates (UdelModels - Chicago9Blk - with Only Pedestrians)
Methods # of Devices Precision Recall F-Measure Relative Index (RI)
iHMRF-VI 5 Peds 0.91 (0.11) 0.77 (0.08) 0.83 (0.05) 0.93 (0.02)10 Peds 0.96 (0.05) 0.82 (0.11) 0.88 (0.08) 0.98 (0.02)15 Peds 0.84 (0.10) 0.83 (0.07) 0.83 (0.07) 0.98 (0.01)
Inc-iHMRF-VI 5 Peds 0.91 (0.17) 0.88 (0.15) 0.89 (0.15) 0.97 (0.04)10 Peds 0.77 (0.13) 0.86 (0.09) 0.81 (0.11) 0.95 (0.03)15 Peds 0.62 (0.10) 0.92 (0.02) 0.73 (0.07) 0.95 (0.02)
iGMM-VI 5 Peds 0.82 (0.13) 0.32 (0.07) 0.46 (0.09) 0.83 (0.02)10 Peds 0.71 (0.10) 0.31 (0.05) 0.43 (0.07) 0.91 (0.01)15 Peds 0.62 (0.09) 0.33 (0.06) 0.43 (0.06) 0.94 (0.01)
GMM-EM 5 Peds 0.75 (0.16) 0.90 (0.06) 0.81 (0.10) 0.90 (0.06)10 Peds 0.67 (0.08) 0.81 (0.07) 0.73 (0.04) 0.93 (0.02)15 Peds 0.71 (0.04) 0.82 (0.06) 0.76 (0.03) 0.96 (0.00)
samples size, since we have removed 50 percent of observations from 50 percent randomly selected
devices. However, as long as each segment is still composed of spatial and temporally adjacent data
points, the iHMRF model can be applied to capture the corresponding autocorrelations.
7.8.3 Impacts of Transmission Power Changes
Table 7.9: Change of Transmission Power (UdelModels - Chicago9Blk - with Only Pedestrians)
Methods # of Devices Precision Recall F-Measure Relative Index (RI)
iHMRF-VI 5 Peds 0.98 (0.02) 0.70 (0.07) 0.82 (0.06) 0.94 (0.02)10 Peds 0.95 (0.06) 0.77 (0.06) 0.85 (0.06) 0.97 (0.01)15 Peds 0.93 (0.04) 0.79 (0.05) 0.85 (0.02) 0.98 (0.00)
Inc-iHMRF-VI 5 Peds 0.76 (0.14) 0.98 (0.03) 0.85 (0.09) 0.93 (0.04)10 Peds 0.74 (0.12) 0.88 (0.08) 0.80 (0.09) 0.96 (0.02)15 Peds 0.58 (0.08) 0.86 (0.69) 0.69 (0.06) 0.95 (0.01)
iGMM-VI 5 Peds 0.83 (0.13) 0.31 (0.05) 0.45 (0.06) 0.85 (0.02)10 Peds 0.72 (0.11) 0.35 (0.07) 0.47 (0.09) 0.92 (0.01)15 Peds 0.65 (0.07) 0.35 (0.04) 0.45 (0.05) 0.94 (0.01)
GMM-EM 5 Peds 0.74 (0.11) 0.89 (0.04) 0.81 (0.07) 0.91 (0.04)10 Peds 0.63 (0.13) 0.83 (0.08) 0.71 (0.11) 0.93 (0.03)15 Peds 0.69 (0.04) 0.85 (0.04) 0.76 (0.02) 0.96 (0.00)
7.8.4 Comparisons on Precision, Recall, and F-Measure
Studies have been shown that attackers may hide their actual locations by periodically changing
the transmission powers of their mobile devices [258]. To simulate this behavior, we used the
ChicagoBlk pedestrians data set. Fifty percent of devices were selected, the trace of each selected
device was segmented into 8 same length pieces, and fifty percent of these pieces were shifted to
random directions with random spatial distances. The corresponding classification results are shown
7.8.5 Comparison on Time Costs 142
in Table 7.9. We observe that the changes of transmission power did not have significant impacts on
the accuracies. One potential interpretation is that the changes of transmission power will increase
the spatial entropy and hence make the devices’ corresponding traces more separated from other
traces. This will reduce the potential overlaps between device traces, and could even help improve
the accuracies of iHMRF-VI and Inc-iHMRF-VI.
7.8.5 Comparison on Time Costs
We evaluated the time costs of the four algorithms on three data sets, including ”1 Building 10
Floors” (7224 observations), ”Chicago9B1k with 10 Pedestrians and 10 Cars” (4525 observations),
and ”Chicago9B1k with 10 Pedestrians” (6000 observations). We set bucket size to 2000. That
means, the data will be processed bucket by bucket, 2000 observations each time. The results
are summarized in Figure 7.6. The X axis refers to the titles of the three data sets and the Y
axis refers to running duration (seconds). We can observe that our proposed incremental inference
algorithm Inc-iHMRF-VI is much more efficient than the offline inference algorithm iHMRF-VI. Our
algorithm Inc-iHMRF-VI is even faster than iGMM. This indicates an significant improvement on
the computational efficiency. The savings on time cost by Inc-iHMRF-VI will become greater when
the data size increases. Note that, GMM has the lowest time cost. However, since GMM does not
need to automatically estimate the number of clusters. Its time complexity should be much smaller
than iGMM and iHMRF.
1 Building 10 Floors Chicago9B1k−Peds−Cars Chicago9B1k−Peds0
20
40
60
80
100
120
140
160
iHMRF−VIiGMMGMMInc−iHMRF−VI
Figure 7.6: Comparison on Time Costs (Seconds)
7.8.6 A Case Study on Detecting Masquerade Attacks
This section presents a case study on masquerade attacks detection, which is one of the most dan-
gerous attack types. A masquerade attack refers to the attacking behavior where an attacker im-
personates an authorized user of a system by using a faked identity (e.g., MAC address) in order to
7.9 Conclusion 143
gain access to unauthorized personal resources. In order to simulate this attack behavior, we used
the ChicagoBlk pedestrians data set and the 1-Building-10-Floors data set, and randomly selected
k clusters and set their cluster identities into an identical cluster identity. By using fingerprinting
techniques, this type of attackers can be identified if we discover that multiple clusters share the
same identity information. Here k refers to the number of masquerade devices. We considered differ-
ent settings of k, from 3 to 6, and evaluated the related detection rates based on different detection
methods. The results are summarized in Table 7.8.6. The results indicate that our framework (by
either iHMRF-VI or Inc-iHMRF-VI) achieved the highest detection rate in most cases. The GMM
method performed slightly than Inc-iHMRF-VI and iHMRF-VI. However, here we used the true
number of clusters as the initial setting for the GMM method. In real applications, where the actual
number is unknown, the GMM method will perform much worse.
Peds Cars Att. iHMRF Inc-iHMRF iGMM GMM# # # -VI -VI
10 10 3 0.98 0.81 0.46 0.7610 10 4 0.97 0.84 0.55 0.8110 10 5 0.97 0.87 0.62 0.8410 10 6 0.97 0.88 0.67 0.8615 15 3 0.97 0.86 0.62 0.8515 15 4 0.97 0.85 0.62 0.8515 15 5 0.97 0.86 0.64 0.8615 15 6 0.97 0.87 0.66 0.87
Table 7.10: Detection Rates for MasqueradeAttacks Based on UdelModels - Chicago9B1k -
Pedestrains
Peds Cars Att. iHMRF Inc-iHMRF iGMM GMM# # # -VI -VI
10 0 3 0.71 0.86 0.44 0.7610 0 4 0.85 0.89 0.72 0.9410 0 5 0.89 0.91 0.73 0.8810 0 6 0.93 0.98 0.87 0.9515 0 3 0.80 0.60 0.36 0.6815 0 4 0.85 0.86 0.56 0.8715 0 5 0.88 0.86 0.65 0.8815 0 6 0.95 0.91 0.78 0.93
Table 7.11: Detection Rates for MasqueradeAttacks on UdelModels - Chicago9B1k - 1 Building
10 Floors
7.9 Conclusion
Device fingerprinting is a fundamental problem for wireless network security. Passive fingerprinting
techniques are effective since they are designed based on device-dependent features (e.g., RSS, AOD,
and TOA) that attackers can not manipulate. However, existing solutions can only support either
time-dependent or time-independent features, but no methods can handle both. This paper presents
the first unified fingerprinting approach based on infinite hidden Markov random field (iHMRF). It
is able to model both time-independent and time-dependent features concurrently and is able to
automatically detect the number of devices. We present a novel incremental classification algorithm
that is suitable for a streaming environment with limited memory and computational resources. Ex-
tensive numerical analysis further validated the effectiveness and efficiency of our proposed approach.
For our future work, we are planning to evaluate the performance of our proposed approach in real
life devices. We will also extend our approach to handle other related wireless security problems,
such as the identification of primary and secondary users to prevent dynamic spectrum access and
malicious behavior attacks in cognitive radio networks.
7.9 Conclusion 145
Figure 7.8: Visualization for the UdelModels - Chicago9B1k Data with Pedestrians and Cars
Chapter 8 147
Chapter 8
Achievements andFuture Work
8.1 Achievements
In this thesis, I presented a number of efficient algorithms for mining large spatio-temporal data for a
variety of application domains, such as medical imaging, urban traffic prediction, weather forecasting,
and social networks. First, we proposed a generalized local statistical model for spatial outlier
detection, which is more accurate and computationally efficient than existing methods (Chapter
3). Second, we developed a reduced space dimension reduction model combined with an artificial
Student-t based random buffering process for detecting outliers in non-numerical data (Chapter 4).
Third, we presented a robust spatio-temporal random effects model and designed efficient algorithms
that can do robust spatio-temporal prediction in near-linear time (Chapter 5). Third, we developed
a generic hidden Markov model based approach to inferring hidden human activities that associate
with residential energy consumption data collected from smart meters (Chapter 6). Finally, we
presented a novel application of infinite Markov random field to the passive device fingerprinting
problem in the wireless security field and developed a new online learning algorithm for the streaming
environment in wireless networks (Chapter 7).
Spatial Outlier Detection (Chapter 3 and Chapter 4)
Spatial novelty patterns, or spatial outliers, are those observations whose characteristics are markedly
different from their spatial neighbors. There are two major branches of spatial outlier detection
(SOD) methodologies, including the global Kriging based and the local Laplacian smoothing based.
The former approach was designed based on robust statistics and the popular Kriging framework.
This approach is very effective, but has a low efficiency with a time complexity O(N4), where N is the
data cardinality. The latter approach applied Laplacian smoothing to eliminate spatial dependencies
between observations, and then converted the SOD problem into a general outlier detection problem.
8.1 Achievements 148
This approach has the time complexity of O(N2), but it implicitly assumes that the observations
modified by Laplacian smoothing are identically and independently distributed (i.i.d) and follow a
Gaussian distribution. In addition, these approaches were designed for numerical attributes, and
the large scale SOD problems for other data types, such as count, binary, and categorial attributes,
have not been well explored. We considered two open problems as follows:
1. Will the Laplacian smoothing process generate i.i.d. Gaussian observations?
2. How can numerical SOD methods be generalized to non-numerical data types, such as count,
binary, and categorial data?
To address the first problem, we theoretically and empirically validated the effectiveness of Laplacian
smoothing for the elimination of spatial autocorrelations, based on popular autocorrelation settings
(e.g., gaussian and exponential kernels). This work provides fundamental support for the family of
local based methods. However, we also discovered a side effect of Laplacian smoothing in that this
process generates an extra spatial autocorrelation variation to the data due to the convolution effects
between measurement errors. To capture this extra variability, we proposed a Generalized Local
Statistical (GLS) framework, and designed two improved forward and backward SOD methods [308],
which outperformed existing SOD methods in a number of simulation and real data sets.
We addressed the second problem by using generalized spatial linear models, which map observations
of different data types to latent numerical variables via a link function. Existing SOD techniques can
then be applied to latent numerical variables. In our optimized design, we first applied a Bayesian
generalized spatial linear model to capture spatial correlations for different data types, such as count,
binary, ordinal, and nominal. We then integrated an additional “error buffer” component based on
Student-t distribution to capture large variations caused by outliers. After that, we considered a
latent reduced-rank spatial Kriging model and designed an approximate inference algorithm that
has a linear time complexity. We have also proposed solutions to spatial categorical SOD [303],
multivariate SOD [314,300], local spatial outlier cluster detection [307], spatial anomaly trajectory
detection in a transportation network [310], and an entropy based method for assessing the number
of spatial outliers [312].
Robust Prediction for Large Spatio-Temporal Data (Chapter 5)
The spatio-temporal datasets being collected nowadays are usually in Gigabyte or even Terabyte
scale. In existing related work, only a limited number of methods have the ability to conduct efficient
spatio-temporal prediction in linear time, but these methods are still limited to Gaussian data. It
is challenging to deal with massive spatiotemporal datasets that are noisy and non-Gaussian. One
effective direction to address this challenge is the generalization of existing methods to make them
more robust when a small portion of data objects deviate from the distribution assumption. We
consider two open problems as follows
1. Is it possible to conduct robust offline spatio-temporal prediction in near linear time?
8.1 Achievements 149
2. Is it possible to conduct robust online spatio-temporal prediction in near linear time?
We proposed a robust version of the Spatio-Temporal Random Effects (STRE) model, namely the
Robust STRE (R-STRE) model. The regular STRE model is a recently proposed statistical model
for large spatio-temporal data that has a linear order time complexity. However, the STRE model
has been shown sensitive to outliers or anomaly observations. Our R-STRE model is more resilient
to outliers or other small departures from model assumptions. Specifically, the R-STRE model
assumes that the measurement error follows a heavy-tailed distribution, such as Huber and Laplace
distributions, instead of a traditional Gaussian distribution. This extension leads to non-analytical
solutions to inferences, such as smoothing, filtering, and forecasting. We proposed near-linear-time
primal-dual interior point algorithms to calculate the Maximum-A-Posterior (MAP) estimates and
applied Laplace approximations to calculate the uncertainty estimates (variance-covariance matrix)
for robust inferences. The theoretical properties of the proposed R-STRE model and its connection
with the regular STRE model were also explored. We also developed a related robust prediction
framework for large spatial data using integrated Gaussian and Laplace approximation techniques
[318].
The preceding approach only provides a solution to the spatio-temporal smoothing problem, which
aims to predict missing vlaues in historical data. It can not be directly applied to conduct efficient
online filtering and forecasting. In order to address this problem, we proposed an alternative ap-
proach [317] using backward and forward message passing to support incremental inference. This
approach can be efficiently implemented using a state-of-the-art approximate inference technique,
named expectation propagation, combined with a Student-t distribution to model the measurement
error. One of the main challenges using approximate inference is to address the high dimension-
ality challenge, that is, the posterior distribution of a large number of latent variables needs to
be approximated. We proposed a novel approximate inference approach, which approximates the
model into the form (we called approximate R-STRE model) that separates the high dimensional
latent variables into groups, and then estimates the posterior distributions of different groups of
variables separately in the framework of Expectation Propagation. We presented theoretical evalu-
ations to show that our solution based on the approximate R-STRE model becomes equivalent to
the traditional R-STRE model when the degree of freedom of the Student-t distribution is set to
infinite.
Energy Disaggregation (Activity Analysis) using Smart Meter Data (Chapter 6)
Sustainability and design of sustainable technologies have become urgent and important priority for
cities given the unprecedented level of resource demand - water, energy, transit, healthcare, public
safety - to every imaginable service that makes a city attractive and desirable. With the widespread
deployment of smart grid and smarter cities, smart meters have been installed in households and
industry buildings to measure the aggregated resource consumptions (e.g., power, water, and gas).
Energy disaggregation aims to disaggregate smarter meter data into the energy consumptions of
individual appliances, such as washer, refrigerator, laptop, and lighting. Studies have shown that
providing users appliance-level energy information can cause users to save significant amount of
8.1 Achievements 150
energy. Smart meter data usually has a low sampling frequency (e.g., one reading per 15 minutes
or 1 hour). This special feature makes most existing disaggregation techniques inappropriate, such
as Independent Component Analysis (ICA) used in audio source separation. Energy disaggregation
based on low frequency data is an emerging field that starts from early 2010. As one of the pioneers,
we collaborate with a group at IBM Research and work on three open problems as follows:
1. Is it possible to disaggregate lower frequency water smart meter data?
2. Is it possible to disaggregate lower frequency power smart meter data?
To address the first problem, we proposed a general statistical framework that disaggregates water
consumptions on coarse granular smart meter readings by modeling fixture characteristics, household
behavior, and activity correlations [304]. This framework is composed of six components, including
event extraction, model selection and training, parallel activity detection, parallel size estimation,
hidden activity identification, and consumption decomposition. We showed that if the event extrac-
tion is accurate, and the stochastic model is accurately selected and trained, our framework will lead
to a maximum-a-posterior solution for the disaggregation. Also, each component can be customized
for different application scenarios, which makes the framework very flexible and applicable. This
framework has been used in the first smarter city project of the united states, deployed by IBM to
the city of Dubuque in 2011. By a recent study conducted based a controlled group of 152 houeholds
and a noncontrolled group of 151 households for 9 weeks, the smarter city project achieved water
savings of 89,090 gallons (6.6%) for 9 weeks and 151 households.
To address the second problem (power energy disaggregation), we explored an alternative strategy
based on an energy disaggregation approach via discriminative sparse coding (DDSC) [306]. DDSC
aims to learn a unique disaggregation model for all households. We observed that this strategy is in-
appropriate, because different households may have different appliances and power usage habits. We
first reformulated DDSC as a generative statistical model, and then proposed a improved Bayesian
version of DDSC [319], in which we integrated household dependent information into the disaggrega-
tion framework, such as appliance information and power usage behaviors. Based on the improved
model, we proposed an efficient disaggregation algorithm by using variational inference techniques.
Wireless Device Fingerprinting (Chapter 7)
Wireless device fingerprinting is a fundamental problem for wireless security. Existing solutions have
been focused on either spatio-temporal independent features (e.g., phase shift difference, frequency
difference) or spatio-temporal dependent features (e.g., radio signal strength (RSS), time difference
of arrival (TDOA)). However, no works have been done to consider all useful features concurrently.
We presented a unified framework for the fingerprinting problem based on infinite hidden Markov
random field. Our framework is able to model both spatio-temporal independent and spatio-temporal
dependent features and is able to automatically detect the number of devices. We proposed the first
incremental classification algorithm for the iHMRF model that is suitable for the wireless streaming
environment that has limited memory and computational resources.
8.2 Future Work 151
8.2 Future Work
This section discusses important directions of the following topics for future work, including spatial
and spatio-temporal outlier detection, spatio-temporal anomalous cluster detection, energy disag-
gregation, and wireless device fingerprinting.
8.2.1 Spatial and Spatio-Temporal Outlier Detection
We have presented two generic solutions to the problems of numerical and non-numerical spatial
outlier detection. However, there are still a limited number of methods that can effectively and effi-
ciently detect outliers from large scale multivariate mixed type spatial datasets and spatio-temporal
datasets. For multivariate mixed type spatial datasets, there are three main challenges: 1) How to
model spatial correlations between mixed type attributes; 2) How to model large variations caused
by outliers; 3) How to detect outliers in a near linear time cost?
To address the first challenge, the mixed-type attributes can be mapped to latent numerical random
variables that are multivariate Gaussian in nature. Each attribute is mapped to a corresponding
latent numerical variable via a specific link function, such as a logit function for binary attributes,
and a log function for count attributes. Using link functions to model attributes of different types
is one of the most popular strategies for modeling non-numerical data. Based on this strategy, the
dependency between mixed type attributes is modeled by dependencies between their latent numer-
ical random variables using a variance-covariance matrix. To address the second challenge we may
employ a similar idea used in our approach for non-numerical outlier detection. An additional error
buffer component based on heavy tailed distributions, such as Student-t and Laplace distributions,
can be incorporated to capture large variations caused by anomalies. For the third challenge, we
can apply fixed rank or knot based dimension reduction techniques. Because the inference of the
resulting model is analytically intractable, approximate inference techniques need to be used, such
as interior point methods, Gaussian approximation, Laplace approximation, variational inference,
and expectation propagation.
For the problem of spatio-temporal outlier detection, one important subproblem is the detection of
non-numerical univariate outliers. There are two spatio-temporal models that can be used, including
the spatio-temporal random effects model (STRE) model and the spatio-temporal Kriging (STK)
model. In our work, we have demonstrated the effectiveness and efficiency of the detection methods
designed based on heavy tailed distributions. Here, we can consider a similar framework and add an
additional random variables that follow a heavy tailed distribution to absorb large variations caused
by outliers. One additional challenge is to consider both spatial and temporal correlations, which
make the design of efficient approximate inference algorithms more difficult. Another important
problem is the detection of multivariate mixed type outliers. We can use a similar strategy discussed
above for multivariate spatio-temporal data. Lastly, currently all the proposed algorithms are focused
on the detection of two-sided outliers. However, one-sided outliers are also very important. We are
planning to further extend our proposed algorithms such that one-sided outliers can also be identified.
8.2.2 Spatio-Temporal Anomalous Cluster Detection 152
8.2.2 Spatio-Temporal Anomalous Cluster Detection
Anomalous spatial cluster detection is different from general spatial outlier detection. The latter
focuses on the detection of isolated outliers, but the former focuses on the detection of a group
of outliers that are spatial neighbors to each other. The additional constraint of spatial affinity
between outliers makes spatial cluster detection more challenging. In previous years, a number of
works have been done for the detection of spatial clusters. One of the major approaches is the
so-called spatial scan statistics. This approach is very effective, but the computational cost is very
high and it is unable to detect irregularly shaped spatial clusters. To address these two challenges,
Neill et.al. [77–82] proposed fast subset scan and Bayesian scan statistics approaches.
Based on our previous work, there are several directions that can be explored to further improve the
performance of existing methods. First, the family of scan statistics based approaches assumes that
the trend of the data has been eliminated in advance. However, this may not be practical in a dy-
namic environment, in which the trend model needs to fitted concurrently with the process of spatial
cluster detection. In this situation, the scan statistics based approaches may be in inappropriate.
This challenge can be potentially addressed by using our proposed models based on heavy tailed
distributions. In our previous approaches, we assume that the error-buffer random variables that
follow a heavy tailed distribution are i.i.d. In order to identify the group of outliers that are spatial
neighbors to each other, we can further add an additional latent layer to these error-buffer random
variables, that forms a latent Markov random field, which uses the so-called Markov Property to
have consistent labels (normal or outlier) for neighbor observations.
The second direction is to consider the detection of anomalous clusters in heterogenous data. For
example, in order to detect social unrest events from social media, which includes spatial attributes,
temporal attributes, and many other types of attributes, it is necessary to extend traditional spatial
clustering techniques to support spatial, temporal, textual, and graph data attributes together.
Two potential strategies may be explored to address this challenge. The first strategy is to integrate
topic model and spatio-temporal random effects (or Kriging) model into an unified framework,
which makes is possible to model both spatio-temporal and textual data. The second strategy is
to integrate sliding window based techniques and scan statistics together. For each sliding window,
the potential spatial clusters and graph clusters are identified. Then for adjacent sliding windows,
the clusters that share the same locations, or textual content, or graph nodes are connected, and
then we are able to monitor the evolution patterns of clusters, and detect or even forecast significant
spatio-temporal clusters.
8.2.3 Energy Disaggregation
Energy disaggregation based on smarter meter data is a relatively new research topic that began
in last five years. Smart meter data usually has a low sampling frequency (e.g., one reading per 15
minutes or 1 hour). This special feature makes most traditional time series disaggregation techniques
inappropriate, such as the Independent Component Analysis (ICA) technique used in audio source
8.2.4 Wireless Device Fingerprinting 153
separation. In our previous work, we have presented an effective approach to the disaggregation of
water data based on HMM. For our future work, we are interested to explore two directions.
First, there are usually multiple types of energy consumption data that need to be disaggregated,
such as water, gas, and power, and these different types of data may have significant correlations
to each other. For example, a shower activity may concurrently consume both water and gas (or
power) resources, and a washer or diswasher usage may concurrently consume both water and
power resources. To address this challenge, we can extend the traditional factorial hidden Markov
model (FHMM) and consider an improved model, namely Parallel Factorial Hidden Markov Model
(P-FHMM). P-FHMM models each type of energy data via an individual FHMM model. The
multiple FHMM models are coupled together by considering dependencies between appliances of
different energy types. For example, washer-dryer will concurrently consumer power, water, and
gas; dishwaser will consume both power and water; and shower and toliet uses are closely correlated
with restroom lightening. In order to consider household dependent features, a Bayesian version of
the P-FHMM model (BP-FHMM) can be designed, in which users will be able to add household
dependent features (e.g., appliances profiles and energy usage habits) as priors of the state transition
and emission distribution parameters. Structured variational inference algorithms can be applied to
do multi-energy disaggregation based on P-FHMM and BP-FHMM models.
Second, in order to learn a FHMM model, it is required to collect a training set in advance for each
household that has the disaggregated consumption data of individual appliances, because different
households may have different appliances and consumption behaviors. However, this may be imprac-
tical since the collection of energy consumption data is not only label-extensive but also may cause
a lot of concerns on human privacy leaks. Therefore, it is important to developed a semi-supervised
model that can be trained based on the data collected from a limited number of households, and can
be directly applied the other households as well. Specifically, the goal is to learn a model based on
a small set of labeled data but a huge set of unlabeled data. An alternative approach is to consider
active learning techniques. Specifically, we first learn a model based on a small set of training data.
Then we apply the trained model to identify a small set of new streamed data to ask users to provide
labels, such that the model can be adapted to the change of appliances or users’ energy consumption
behaviors. Lastly, the proposed algorithms are mostly focused the temporal dimension. However,
smarter meter data also has the important spatial dimension. We are interested to study the energy
disaggregation problem based in both spatial and temporal dimensions.
8.2.4 Wireless Device Fingerprinting
As discussed in Chapter 7, we proposed an infinite hidden Markov random field model (iFHMM)
to capture correlations between both time-independent and time-dependent features. The iFHMM
model helps make spatial and temporal neighbors with a consistent cluster (device) label, but it
does not have the ability to differentiate the objects that are spatially far away, but temporally
close, using different cluster labels. For the problem of device fingerprinting, if two objects are in
this situation, they tend to origin from two different devices, because a device can not occur in
8.3 Published Papers 154
two different places at the same time. In order to address this challenge, the FHMM model can be
extended to consider the preceding additional constraint. In addition, the iFHMM model is unable to
explicitly model the dynamic pattern (e,g., spatial trajectory of RSS features), since it only has the
spatial infinity constraint. An alternative approach is to consider Spatial Temporal Kalman Filtering
(STKF) model, which has been widely used for predicting spatial trajectories. Another interesting
research problem is to apply the iFHMM or STKF model and hypothesis testing based techniques
to detect if a device changes locations within a specified time interval, for example, moving from
inside building to outside building.
8.3 Published Papers
1. Feng Chen, Jing Dai, Bingsheng Wang, Sambit Sahu, Milind Naphade, Chang-Tien Lu,“Activity
Analysis Based on Low Sample Rate Smart Meters,” Proceedings of the 17th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (ACM-KDD), pages
240-248, 2011 (Acceptance rate: 17.5%)
2. Feng Chen, Chang-Tien Lu, Arnold P. Boedihardjo,“GLS-SOD: A Generalized Local Statis-
tical Approach for Spatial Outlier Detection,” Proceedings of the 16th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining (ACM-KDD), pages 1069-1078,
2010 (Acceptance rate: 13%)
3. Feng Chen, Chang-Tien Lu, Arnold P. Boedihardjo,“On Locally Linear Classification by Pair-
wise Coupling,” Proceedings of the IEEE International Conference on Data Mining (IEEE
ICDM), pages 749-754, 2008 (Acceptance rate: 19%)
4. Feng Chen and Chang-Tien Lu, ”Nearest Neighbor Query,” Encyclopedia of Geographical
Information Science (1st Edition),
5. Feng Chen, Jaime Arredondo, Rupinder Paul Khandpur, Chang-Tien Lu, David Mares,
Dipak Gupta, and Naren Ramakrishnan,” Spatial Surrogates to Forecast Social Mobilization
and Civil Unrests,” Position Paper in CCC Workshop on “From GPS and Virtual Globes to
Spatial Computing-2012,” Washington, D.C., Sep 2012
6. Yang Chen, Feng Chen, Jing Dai, T. Charles Clancy, ”Student-t Based Robust Spatio-
Temporal Prediction,” the IEEE International Conference on Data Mining (IEEE ICDM),
2012 (Full paper, Acceptance rate 10.7%)
7. Xutong Liu, Feng Chen, Chang-Tien Lu, ”Robust Inference and Outlier Detrection for Large
Spatial Data Sets,” the IEEE International Conference on Data Mining (IEEE ICDM), 2012
(Full paper, Acceptance rate 10.7%)
8. Bingsheng Wang, Feng Chen, Haili Dong, Arnold Boedihardjo, and Chang-Tien Lu, ”Low-
Sample-Rate Water Consumption Disaggregation via Sparse Coding with Extended Discrimi-
8.3 Published Papers 155
native Dictionary,” the IEEE International Conference on Data Mining (IEEE ICDM), 2012
(Short paper, Acceptance rate 20%)
9. Jing Dai, Feng Chen, Sambit Sahu, Milind Naphade,“Regional Behavior Change Detection
via Local Spatial Scan,” Proceedings of the 18th ACM SIGSPATIAL International Conference
on Advances in Geographic Information Systems (ACM SIGSPATIAL GIS), 2010
10. Xutong Liu, Chang-Tien Lu, Feng Chen,“Spatial Outlier Detection: Random Walk Based
Approaches,” Proceedings of the 18th ACM SIGSPATIAL International Conference on Ad-
vances in Geographic Information Systems (ACM SIGSPATIAL GIS), 2010 (Acceptance
rate: 21%)
11. Xutong Liu, Chang-Tien Lu, Feng Chen,“Spatial Categorical Outlier Detection: Pair Corre-
lation Function Based Approach,” Proceedings of the 19th ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL GIS),
to appear 2012
12. Qiben Yan, Ming Li, Feng Chen, Tingting Jiang, Wenjing Lou, Chang-Tien Lu, ”Optimal
Network Traffic Surveillance in Cognitive Radio Networks,” The 32nd IEEE International
Conference on Computer Communications (IEEE INFOCOM), 2013 (Acceptance rate 17%)
13. Arnold P. Boedihardjo, Chang-Tien Lu, Feng Chen,“A Framework for Estimating Complex
Probability Density Structures in Data Streams,” Proceedings of the ACM 17th Conference on
Information and Knowledge Management (ACM CIKM), pages 619-628, 2008 (Acceptance
rate: 17%)
14. Chang-Tien Lu, Arnold P. Boedihardjo, David Dai, Feng Chen,“HOMES: Highway Opera-
tions and Monitoring and Evaluation System,” ACM 16th International Conference on Ad-
vances in Geographic Information Systems (ACM SIGSPATIAL GIS), Poster Paper, pages
529-530, 2008
15. Dechang Chen, Chang-Tien Lu, Yufeng Kou, Feng Chen,“On Detecting Spatial Outliers,”
Journal of Geoimformatica, vol. 12, pages 455-475, 2008
16. Qifeng Lu, Feng Chen, Kathleen Hancock,“On Path Anomaly Detection in a Large Trans-
portation Network,” Journal of Computers, Environment and Urban Systems, vol. 33, pages
448-462, 2009
17. Yao-Jan Wu, Feng Chen, Chang-Tien Lu, Brian Smith, Yang Chen,“Traffic Flow Prediction
for Urban Network using Spatial Temporal Random Effects Model,” the 91st Annual Meeting
of the Transportation Research Board (TRB), to appear 2012
18. Jing Dai, Ming Li, Sambit Sahu, Milind Naphade, Feng Chen,“Multi-granular Demand Fore-
casting in Smarter Water,” Proceedings of the 13th International Conference on Ubiquitous
Computing (Ubicomp), Poster Paper, 2011
8.3 Published Papers 156
19. Xutong Liu, Chang-Tien Lu, and Feng Chen,“An Entropy-Based Method for Assessing the
Number of Spatial Outliers,” IEEE International Conference on Information Reuse and Inte-
gration (IRI), pages 244-249, 2008 Springer-Verlag, pages 776-781, 2008
Appendix A 157
Appendix A
Appendix
A.1 Estimated Bound
Theorem 3 presents an upper bound of the absolute correlation function |ρ(ω∗i , ω
∗j ;θθθ)|. The prop-
erties of this upper bound function are demonstrated in Figures [A.1-A.5], where we consider five
representative cases with c = 6, 11, 15, 2, 40, respectively. The X axis refers to the row difference
between sj and si: row(sj) − row(si). The Y axis refers to the column difference between sj and
si : col(sj) − col(si). The Z axis refers to the absolute correlation value. Each figure includes two
surfaces. The surface with colored (yellow to red) map refers to the surface calculated by the esti-
mated upper bound function. The surface in gray color scale refers to the surface calculated by the
true correlation function (see equation 3.17). These results demonstrate that the estimated upper
bound function is a tight upper bound of the true absolute correlation function |ρ(ω∗i , ω
∗j ;θθθ)|.
A.2 Definition of Matrices M and E
The matrices M and E are defined as follows:
M = CTC,E = CTa,
where a =[
−U− 1
21 H1η0,0, · · · ,0
]T
, η0 refers to the initial value and is predefined, and
C =
U− 1
21 0 · · · 0 0
−U− 1
22 H2 U
− 12
2 · · · 0 0...
......
...
0 0 · · · U− 1
2
T−1 0
0 0 · · · −U− 1
2
T HT U− 1
2
T
.
A.2 Definition of Matrices M and E 158
Figure A.1: The comparison between the true correlation |ρ(ω∗i , ω
∗j ;θθθ)| and the estimated bound
function. Here, K = 12, c = 6.
Figure A.2: The comparison between the true correlation |ρ(ω∗i , ω
∗j ;θθθ)| and the estimated bound
function. Here, K = 12,c = 11.
A.2 Definition of Matrices M and E 159
Figure A.3: The comparison between the true correlation |ρ(ω∗i , ω
∗j ;θθθ)| and the estimated bound
function. Here, K = 12,c = 15.
Figure A.4: The comparison between the true correlation |ρ(ω∗i , ω
∗j ;θθθ)| and the estimated bound
function. Here, K = 12,c = 20.
A.3 Proof of Theorem 2 160
Figure A.5: The comparison between the true correlation |ρ(ω∗i , ω
∗j ;θθθ)| and the estimated bound
function. Here, K = 12,c = 40.
It can be readily derived that
1
2
T∑
t=1
(ηt − Htηt−1)TU−1
t (ηt − Htηt−1)
=1
2(a + Cη)T (a + Cη) =
1
2ηT Mη + ETη +
1
2aT a.
A.3 Proof of Theorem 2
The dual function g(ω) is defined as
g(ω) = infη,ξ,r
L(η, ξ, r,ω) = infη,r,ξ
1Tϕ(r) +1
2ηT Mη + ETη +
1
2ξTΛξξ + ωT (r + OSη + Oξ − Z).
Fixing r and solving the system of liner questions
∇L(η, ξ, r,ω)
∇η= Mη + ST OTω + E = 0,
∇L(η, ξ, r,ω)
∇ξ= Λξξ + OTω = 0,
A.3 Proof of Theorem 2 161
the optimal η∗ and ξ∗ have the closed forms
η∗ = −M−1(ST OTω + E),
ξ∗ = −Λ−1ξ OTω.
Substituting η∗ and ξ∗, the L function becomes
L∗(r,ω) = 1Tϕ(r) + ωT (r − Z) −1
2ωT OΛξO
Tω −1
2(ST OTω + E)TM−1(ST OTω + E) + const.
The dual function g(ω) can be reformulated as
g(ω) = infrL∗(r,ω)
= −ωT Z −1
2ωT (OSM−1ST OT + OΛξO
T )ω − ωT OSM−1E +∑
t,n
infrtn
(ϕ(rtn) + ωtnrtn) + const.
Let infrtn
(ϕ(rtn) + ωtnrtn) = −suprtn
(−ϕ(rtn) − ωtnrtn) = −φ∗(ωtn), where φ∗(ωtn) is defined as
φ∗(ωtn) = suprtn
(−ωtnrtn − ϕ(rtn)) =
ω2tn
2, |ωtn| ≤ κ
∞, otherwise.
Case 1: If rtn > κ,
φ∗(ωtn) = suprtn>κ
(
−ωtnrtn − rtnκ+1
2κ2
)
=
−ωtnκ−1
2κ2, ωtn > −κ
∞, otherwise.
Case 2: If rtn < −κ
φ∗(ωtn) = suprtn<−κ
(
−ωtnrtn + rtnκ+1
2κ2
)
=
ωtnκ−1
2κ2, ωtn < κ
∞, otherwise.
Case 3: If |rtn| ≤ κ
φ∗(ωtn) = sup|rtn|≤κ
(
−ωtnrtn −1
2r2tn
)
=
ω2tn
2|ωtn| ≤ κ
∞, otherwise.
It is concluded that the dual function is
g(ω) = −ωT Z −1
2ωT O(SM−1ST + Λ−1
ξ )OTω − ωT OSM−1E +1
2ωTω + const,
A.4 Proof of Theorem 3 162
when |ω| ≤ κ1; and −∞, otherwise.
A.4 Proof of Theorem 3
The dual function g(ω) is defined as
g(ω) = infη,ξ,r
L(η, ξ, r,ω)
= infη,r,ξ
1T ‖r‖1 +1
2ηTMη + ETη +
1
2ξTΛξξ + ωT (r + OSη + Oξ − Z).
Fixing r, similar to the Huber distribution case, the optimal η∗ and ξ∗ have the closed forms
η∗ = −M−1(ST OTω + E),
ξ∗ = −Λ−1ξ OTω.
Substituting η∗ and ξ∗, the dual function can be reformulated as
g(ω) = −ωT Z−1
2ωT (OSM−1ST OT + OΛξO
T )ω − ωT OSM−1E + infr
(
‖r‖1 + ωT r)
+ const.
It can be readily proved that
infr
(
‖r‖1 + ωT r)
=
0, −1 ≤ ω ≤ 1
−∞, otherwise.
It is concluded that the dual function
g(ω) = −ωT Z −1
2ωT O(SM−1ST + Λ−1
ξ )OTω − ωT OSM−1E + const,
when 1 ≤ ω ≤ 1; g(ω) = −∞, otherwise.
A.5 Offline Inference Solution for iHMRF 163
A.5 Offline Inference Solution for iHMRF
The variational parameters are estimated as follows:
ζc,1 = 1 +
N∑
i=1
q(zi = c) (A.1)
ζc,2 =λ1
λ2
+
C∑
k=c+1
N∑
i=1
q(zi = k) (A.2)
λ1 = λ1 + C − 1 (A.3)
λ2 = λ2 −C−1∑
c=1
(ψ(ζc,2) − ψ(ζc,1 + ζc,2)) (A.4)
wc =N∑
i=1
q(zi = c) (A.5)
xc =
∑Ni=1 q(zi = c)xi
wc(A.6)
Ξ =
N∑
i=1
q(zi = c)(xi − xc)(xi − xc)T (A.7)
υc = υc + wc (A.8)
ηc = ηc + wc (A.9)
gc =υcgc + wcxc
υc(A.10)
Λc =υcwc
υc + wc(gc − xc)(gc − xc)
T + Λc
+Ξc (A.11)
q(xi = c) ∝ p(xi = c|(N)(xi); γ)πc(β)p(xi|Θc) (A.12)
πc(β) = exp
(
c−1∑
k=1
ψ(ζk,2) − ψ(ζk,1 + ζk,2)
)
× exp (ψ(ζc,1) − ψ(ζc,1 + ζc,2)) (A.13)
p(xi|Θc) = exp
−1
2log |
Λc
2| +
1
2
d∑
k=1
ψ(υc + 1 − k
2)
× exp
−1
2υc(xi − gc)
T Λ−1c (xi − gc)
× exp
−d
22π −
d
2ηc
(A.14)
BIBLIOGRAPHY 164
Bibliography
[1] Shekhar, S. and Huang, Y. Co-location Rules Mining: A Summary of Results. In Proc. Spatio-
temporal Symposium on Databases, 2001.
[2] Chawla, S., Shekhar, S., Wu, W-L, and Ozesmi, U. Modelling spatial dependencies for mining
geospatial data: An introduction. In Harvey Miller and Jiawei Han, editors, Geographic data
mining and Knowledge Discovery (GKD), 1999.
[3] Akyildiz, I., Su, W., Sankarasubramaniam, Y., and Cayirci, E. A survey on sensor networks. In
Communications Magazine, IEEE, vol. 40, issue 8, pages 101–114, 2002.
[4] Shekhar, S., Zhang, P., Huang, Y. and Vatsavai, R.R. Trends in spatial data mining. In:
Kargupta, H., Joshi, A. (Eds.), Data Mining: Next Generation Challenges and Future Directions,
AAAI/MIT Press. pp. 357-380, 2003.
[5] Culler, D., Estrin, D., and Srivastava, M. A survey on sensor networks. In Overview of sensor
networks. IEEE Computer, vol. 37, issue 8, pages 41–49, 2004.
[6] Zhao, F. and Guibas, L. A survey on sensor networks. In Wireless sensor networks: an infor-
mation processing approach, Morgan Kaufmann Pub, 2004.
[7] Arora, A., Dutta, P., Bapat, S., Kulathumani, V., Zhang, H., Naik, V., Mittal, V., Cao, H.,
Demirbas, M., Gouda, M., Choi, Y., Herman, T., Kulkarni, S., Arumugam, U., Nesterenko, M.,
Vora, A., and Miyashita, M. A line in the sand: a wireless sensor network for target detection,
classification, and tracking. In Journal of Computer Network, vol. 46, issue 5, pages 605–634,
2004.
[8] Li, D., Wong, K., Hu, Y.H., and Sayeed, A. Detection, classification, and tracking of targets. In
Signal Processing Magazine, IEEE, vol. 19, issue 2, pages 17–29, 2002.
[9] Brennan, S.M., Mielke, A.M., Torney, D.C., and Maccabe, A.B.. Radiation detection with
distributed sensor networks. In Computer, vol. 37, issue 8, pages 57–59, 2004.
[10] Cui, Y., Wei, Q., Park, H., and Lieber, C. Nanowire nanosensors for highly sensitive and
selective detection of biological and chemical species. In Science, vol. 293, issue 5533, pages
1289–1292, 2001.
BIBLIOGRAPHY 165
[11] Hills, R. Sensing for danger. In Science and Technology Review, July/August 2001.
[12] Caron, Y., Makris, P., and Vincent, N. A method for detecting artificial objects in natural
environments. In Proceedings 16th International Conference on Pattern Recognition, vol. 1, pages
600–603, IEEE Comput. Soc., 2002
[13] Geman, D. and Jedynak, B. An active testing model for tracking roads in satellite images. In
IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, pages 1–14, 1996.
[14] Pozo, D., Olmo, F., and Alados-Arboledas, L. Fire detection and growth monitoring using a
multitemporal technique on AVHRR mid-infrared and thermal channels. In IEEE Remote Sensing
of Environment, vol. 60, issue 2, pages 111–120, 1997.
[15] Strickland, R. and Hahn, H. Wavelet transform methods for object detection and recovery. In
IEEE Trans. Image Process, vol. 6, issue 5, pages 724–735, 1997.
[16] Tan, H. and Zhang, Y. An energy minimization process for extracting eye feature based on
deformable template. In Lecture Notes in Computer Science, vol. 3852, pages 663–672, 2006.
[17] Zhong, Y., Jain, A., and Dubuisson-Jolly, M.P. Object tracking using deformable templates.
In IEEE Trans. Pattern Anal. Mach. Intell., vol. 2, issues 5, pages 544–549, 2000.
[18] Braams, J., Pruim, J., Freling, N., Nikkels, P., Roodenburg, J., Boering, G., Vaalburg, W., and
Vermey, A. Detection of lymph node metastases of squamous-cell cancer of the head and neck
with FDG-PET and MRI. In Journal of Nuclear Medicine, vol. 36, issues 2, pages 211–216, 1995.
[19] James, D., Clymer, B.D., and Schmalbrock, P. Texture detection of simulated microcalcification
susceptibility effects in magnetic resonance imaging of breasts. In Journal of Magnetic Resonance
Imaging, vol. 13, issues 6, pages 876–881, 2001.
[20] McInerney, T. and Terzopoulos, D. Deformable models in medical image analysis: a survey. In
Medical Image Analysis, vol. 1, issues 2, pages 91–108, 1996.
[21] Moon, N., Bullitt, E., van Leemput, K., and Gerig, G. Automatic brain and tumor segmen-
tation. In Proceedings of the 5th International Conference on Medical Image Computing and
Computer-Assisted Intervention-Part I, pages 372–379, 2002.
[22] Heffernan, R., Mostashari, F., Das, D., Karpati, A., Kulldorff, M., and Weiss, D. Syndromic
surveillance in public health practice. New York City. Emerging Infectious Diseases, vol. 10, issues
5, pages 858–864, 2004.
[23] Rotz, L. and Hughes, J. Advances in detecting and responding to threats from bioterrorism
and emerging infectious disease. In Nature Medicine, pages 130–136, 2004.
[24] Wagner, M., Tsui, F., Espino, J., Dato, V., Sittig, D., Caruana, R., Mcginnis, L., Deerfield, D.,
Druzdzel, M., and Fridsma, D. The emerging science of very early detection of disease outbreaks.
In Journal of Public Health Management and Practice, vol. 7, issues 6, pages 51–59, 2001.
BIBLIOGRAPHY 166
[25] Szor, P. The art of computer virus research and defense. Addison-Wesley Professional, 2005.
[26] Szewczyk, R., Osterweil, E., Polastre, J., Hamilton, M., Mainwaring, A., and Estrin, D. Habitat
monitoring with sensor networks. In Communications of the ACM, vol. 47 , issue 6, pages: 34–40,
June 2004.
[27] Gilbert, R. Statistical methods for environmental pollution monitoring. Wiley, 1987
[28] Marshall, C., Best, N., Bottle, A., and Aylin, P. Statistical issues in the prospective monitoring
of health outcomes across multiple units. In Journal of the Royal Statistical Society, vol. 167,
issue 3, pages 541–559, 2004.
[29] Zhang, Y., Meratnia, N., and Havinga, P. Outlier Detection Techniques for Wireless Sensor
Networks - A Survey. In Communications Surveys and Tutorials, IEEE, vol. 2, issue 2, pages 159
- 170, 2010.
[30] Schuler, R.E. The Smart Grid: A Bridge between Emerging Technologies, Society, and the
Environment. National Academy of Engineering (NAE), vol. 40, 2010.
[31] IBM Smarter Planet. http://www.ibm.com/smarterplanet/us/en/
[32] Gilardi, N., Kanevski, M., Maignan, M., and Mayoraz, E. Environmental and Pollution Spatial
Data Classification with Support Vector Machines and Geostatistics. Greece, ACAI’99, pages
43-51, July 1999.
[33] Inan, H.I, Aydinoglu, A.C., and Yomralioglu T. Spatial Classification of Land Parcels In Land
Adminstration Systems. In International Conference on Spatial Data Infrastructures, 2010.
[34] Koperski, K. and Han, J. Discovery of spatial association rules in geographic information
databases. In Advances in Spatial Databases, Proc. of 4th International Symposium, SSD’95,
pages 47–66, Portland, Maine, USA, 1995.
[35] Koperski, K., Adhikary, J., and Han, J. Spatial data mining: Progress and challenges. In
Workshop on Research Issues on Data Mining and Knowledge Discovery(DMKD’96), pages 1–10,
Montreal, Canada, 1996.
[36] Babcock, B, Babu, S., Datar, M, Motwani, R., and Widom, J. Models and issues in data stream
systems. In ACM, editor, Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems: PODS 2002: Madison, Wisconsin, June 3–5,
2002, pages 1–16, New York, NY 10036, USA, 2002. ACM Press.
[37] Arias-Castro, E., Candes, E.J., and Durand, A. Detection of an anomalous cluster in a network.
In The Annals of Applied Statistics, Jan 2010.
[38] Arias-Castro, E, Donoho, D, and Huo, X. Near-optimal detection of geometric objects by fast
multiscale methods. In IEEE Transaction Information Theory, vol. 51, issue 7, pages 2402–2405,
2005.
BIBLIOGRAPHY 167
[39] Hall, P. and Jin, J. Innovated higher criticism for detecting sparse signals in correlated noise.
In Annals of Statistics, vol. 38, 2009. To appear.
[40] Arias-Castro, E, Cand‘es, E. J., Helgason, H., and Zeitouni, O. Searching for a trail of evidence
in a maze. In Annals of Statistics, vol. 36, issue 4, pages 1726–1757, 2008.
[41] Arias-Castro, E., Cand‘es, E. J., and Durand, A. Detection of an abnormal cluster in a network.
In The Bulleting of the Internation Statistical Association, Durban, South Africa, 2009.
[42] Babu, S., and Widom, J. Continuous queries over data streams. SIGMOD Rec., 30(3):109–120,
2001.
[43] Greenwald, M., and Khanna, S. Space-efficient online computation of quantile summaries. In
SIGMOD ’01: Proceedings of the 2001 ACM SIGMOD international conference on Management
of data, pages 58–66. ACM Press, 2001.
[44] Gao, L. and Wang, X.S. Continually evaluating similarity-based pattern queries on a streaming
time series. In SIGMOD ’02: Proceedings of the 2002 ACM SIGMOD international conference
on Management of data, pages 370–381. ACM Press, 2002.
[45] Banerjee, Sudipto and Gelfand, Alan E. and Finley, Andrew O. and Sang, Huiyan Gaussian
predictive process models for large spatial data sets. In Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 70-4, pages 1467–9868. 2008.
[46] Finley, Andrew O. and Sang, Huiyan and Banerjee, Sudipto and Gelfand, Alan E. Improving
the performance of predictive process modeling for large datasets. In Comput. Stat. Data Anal.,
53-8, pages 2873–2884, 2008.
[47] Hulten, G., Spencer, L., and Domingos, P. Mining time-changing data streams, June 14 2001.
[48] Domingos, P., and Hulten, G. Mining high-speed data streams. In Knowledge Discovery and
Data Mining, pages 71–80, 2000.
[49] Street, W.N. and Kim, Y.S. A streaming ensemble algorithm (sea) for large-scale classification.
In KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 377–382. ACM Press, 2001.
[50] Wang, H.X., Fan, W., Yu, P.S., and Han, H. Mining concept-drifting data streams using en-
semble classifiers. In Pedro Domingos, Christos Faloutsos, Ted SEnator, Hillol Kargupta, and Lise
Getoor, editors, Proceedings of the ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-03), pages 226–235, New York, August 24–27 2003. ACM
Press.
[51] Guha, S., Mishra, N., Motwani, R., and O’Callaghan, L. Clustering data streams. pages
359–366, 2000.
BIBLIOGRAPHY 168
[52] Aggarwal, C.C. A framework for diagnosing changes in evolving data streams. In Proceedings of
the 2003 ACM SIGMOD international conference on Management of data, pages 575–586. ACM
Press, 2003.
[53] Cressie, N.A Statistics for Spatial Data. Wiley, 1993.
[54] Schabenberger, O. and Gotway, C. A. Statistical Methods for Spatial Data Analysis. Boca
Raton: Chapman and Hall-CRC, Boca Raton, Florida, 2005.
[55] Tobler, W. R. Cellular geography. In Philosophy in Geography, pages 379–386. Dordrecht,
Holland. Dordrecht Reidel Publishing Company, 1979.
[56] Shekhar, S., Lu, C.-T. and Zhang, P. A Unified Approach to Spatial Outliers Detection. In
Journal of GeoInformatica, vol. 7, pages 139–166, 2003.
[57] Lu, C.-T., Chen, D. and Kou, Y. Algorithms for Spatial Outlier Detection. In Proceedings of
the 3rd IEEE International Conference on Data Mining, pages 597–600, 2003.
[58] Chen, D, Lu, C.-T., Kou, Y.F, and Chen, F. On Detecting Spatial Outliers. In Journal of
Geoinformatica, vol. 12, pages 455–475, 2008.
[59] Militino, A.F., Palacios, M.B., and Ugarte, M.D. Outliers detection in multivariate spatial
linear models. In Journal of Statistical Planning and Inference, vol. 136, issues 1, pages 125–146,
2006.
[60] Hu, T. and Sung, S.Y. A trimmed mean approach to finding spatial outliers. In Journal of
Intelligent Data Analysis, vol. 8, issue 1, pages 79–95, 2004.
[61] Sun, P. and Chawla, S. On Local Spatial Outliers. In Journal of Intelligent Data Analysis,
pages 209–216, 2004.
[62] Christensen, R., Johnson, W. and Pearson, L.M. Covariance function diagnostics for spatial
linear models. In Math. Geol., vol. 25, pages 145–160, 1993.
[63] Cerioli, A. and Riani, M. The ordering of spatial data and the detection of multiple outliers.
In Journal Computational Graphical Statistics, vol. 8, pages 239–258, 1999.
[64] Militino, A.F., Palacios, M.B. and Ugarte, M.D. Outlier detection in multivariate spatial linear
models. In Journal of Statistical Planning and Inference, vol. 136, pages 125–146, 2006.
[65] Atkinson, A.C. and Riani, M. Robust Diagnostics Regression Analysis. Springer Series in
Statistics, 2000.
[66] Boyd, S. and Vanderberghe, L. Convex Optimization. Cambridge Univ. Press, 2004.
[67] Glaz J., Naus, J.I., and Wallenstein, S. Scan Statistics. Springer, 2001.
[68] Iyengar, V. S. On detecting space-time clusters. In Proceedings ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 587-592, 2004.
BIBLIOGRAPHY 169
[69] Caidan Zhao, Liang Xie, Xueyuan Jiang, Lianfen Huang, and Yan Yao A PHY-layer Authen-
tication Approach for Transmitter Identification in Cognitive Radio Networks. In PInternational
Conference on Communications and Mobile Computing, pages 154-158, 2012.
[70] Y. Zhao, J. H. Reed, S. Mao, and K. K. Bae Overhead Analysis for Radio Environment
Map-enabled Cognitive Radio Networks. In 1st IEEE Workshop on Networking Technologies for
Software Defined Radio Networks, pages 18-25, 2012.
[71] Bratus, Sergey and Cornelius, Cory and Kotz, David and Peebles, Daniel Active behavioral
fingerprinting of wireless devices. In Proceedings of the first ACM conference on Wireless network
security, pages 56–6, 2008.
[72] R. Chen, and J.M. Park Ensuring Trustworthy Spectrum Sensing in Cognitive Radio Networks.
In 1st IEEE Workshop on Networking Technologies for Software Defined Radio Networks, pages
110-119, 2009.
[73] Brik, Vladimir and Banerjee, Suman and Gruteser, Marco and Oh, Sangho Wireless Device
Identification with Radiometric Signatures. In Mobicom, 2008.
[74] Kulldorff, M. A spatial scan statistic. In Communications in Statistics: Theory and Methods,
vol. 26, pages 1481-1496, 1997.
[75] Kulldorff, M. Prospective time period geographic disease surveillance using a scan statistic. In
Journal of the Royal Statistical Society, vol. A164, pages 61-72, 2001.
[76] Kulldorff, M., Heffernan, R., Hartman, J., Assuncao, R., and Mostashari, F. A space-time
permutation scan statistic for disease outbreak detection. In PLoS Medicine, vol. 2, pages 216-
224, 2005.
[77] Neill, D. B. and Moore, A. Rapid Detection of Significant Spatial Clusters. In Proceedings
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 256-
265, 2004.
[78] Neill, D. B. and Moore, A. Detection of spatial and spatio-temporal clusters. Ph.D. thesis,
Carnegie Mellon University, Department of Computer Science, Technical Report CMU-CS-06-
142, 2006.
[79] Neill, D. B., Moore, A, and Cooper, G.F. A Bayesian spatial scan statistic. In Y. Weiss, et al.,
eds. Advances in Neural Information Processing Systems, vol. 18, pages 1003-1010, 2006.
[80] Neill, D. B., Moore, A, Sabhnani, M., and Danel, K. Detection of emerging space-time clusters.
In Proceedings of the 11th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,
pages 218–227, 2005.
[81] Neill, D. B. and Cooper, G.F. A multivariate Bayesian scan statistic for early event detection
and characterization. In Machine Learning, vol. 79, pages 261–282, 2010.
BIBLIOGRAPHY 170
[82] Neill, D. B. Fast subset scan for spatial pattern detection. In Journal of the Royal Statistical
Society (Series B: Statistical Methodology), vol. 74(2), pages 337–360, 2012.
[83] Barnett, V. and Lewis, T. Outliers in statistical data. 3rd ed. John Wiley and Sons, 1994.
[84] Agrawal, R., Gunopulos, D., and Raghavan, P. Automatic subspace clustering of high dimen-
sional data for data mining applications. In Proceedings ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 94-105, 1998.
[85] Ester, M., Kriegel, H.P., Sander, J., and Xu, X.W. A density-base algorithm for discovering-
clusters in large spatial databases. In Proceedings ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 44-49, 1996.
[86] Harel, D. and Koren, Y. Clustering spatial data using random walks. In Proceedings ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 281-286,
2001.
[87] Wang, W., Yang, J., and Muntz, R.R. STING: a statistical information grid approach to spatial
data mining. In Proceedings 23rd Conference on Very Large Databases, pages 186-195, 1997.
[88] Kulldorff, M., Huang, L., and Konty, K. A scan statistic for continuous data based on the
normal probability model. In International Journal of Health Geographics, vol. 8, pages 58, 2009.
[89] Wu, M.X., Song, X.Y., Jermaine, C., Ranka, S., and Gums, J. A LRT framework for fast
spatial anomaly detection. In Proceedings ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 887-896, 2009.
[90] Huang, L., Huang, L., Tiwari, R., Zuo, J., Kulldorff, M., and Feuer, E. Weighted normal
spatial scan statistic for heterogeneous population data. In Proceedings Journal of the American
Statistical Association, vol. 104, pages 886-898, 2009.
[91] Janeja, V. P. and Atluri, V. Random walks to identify anomalous free-form spatial scan win-
dows. In Proceedings IEEE Transactions on Knowledge and Data Engineering, vol. 20, pages
1378-1392, 2008.
[92] Agarwal, D., McGregor, A., Phillips, J.M., Venkatasubramanian, S., and Zhu, Z.Y. Spatial scan
statistics: approximations and performance study. In Proceedings ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 24-33, 2006.
[93] Beckmann, M., Kriegel, H.P., Schneider, R., and Seeger, B. The R∗-tree: an efficient and robust
access method for points and rectangles. In Proceedings ACM SIGMOD International Conferences
on Management of Data, vol. 136, pages 322-331, 1990.
[94] Militino, A.F., Palacios, M.B., and Ugarte, M.D. Robust trend parameters in a multivariate
spatial linear model. Test, vol. 12, pages 101–113, 2003.
[95] Militino, A.F. and Ugarte, M.D. Assessing the covariance function in geostatistics. In Statistics
Probability Letter, vol. 52, pages 199–206, 2001.
BIBLIOGRAPHY 171
[96] T. Hastie and R. Tibshirani. Discriminant analysis by gaussian mixtures. In Journal of the
Royal Statistical Society, (Series B), vol. 58, pages 155–176, 1996.
[97] Schulmeister, B. and Wysotzki, F. Assessing the covariance function in geostatistics. In Machine
Learning and Statistics: the Interface, New York, JohnWiley and Sons, Inc, pages 133–151, 1997..
[98] Lu, B.L. and Ito, M. Task decomposition and module combination based on class relations:
a modular neural network for pattern classification. In IEEE Transaction on Neural Networks,
10(5), 1999.
[99] Kim, T.K. and Kittler, J. Locally Linear Discriminant Analysis for Multimodally Distributed
Classes for Face Recognition with a Single Model Image. In IEEE Transaction on Pattern Analysis
and Machine Intelligence, vol. 27(3), pages 318–327, 2005.
[100] Zhu, M.L. and Martinez, A.M. Subclass Discriminant Analysis. In IEEE Transaction on
Pattern Analysis and Machine Intelligence, vol. 28(8), pages 1274-1286, 2006
[101] J.J Wu, H. Hui, W. Peng, and J. Chen. Local Decomposition for Rare Class Analysis. In
Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD), pages 814– 823, 2007.
[102] Geibel, P., Brefeld, U., and Wysotzki, F. Perceptron and SVM learning with generalized cost
models. In Journal of Intelligent Data Analysis, vol. 8(5), pages 439-455, 2004.
[103] Lu, B.L., Wang, K.A., Utiyama, M., and Isahara, H. A part-versus-part method for massively
parallel training of support vector machines. In Proceedings of International Joint Conference on
Neural Networks (IJCNN), vol. 1, pages 735–740, 2004.
[104] Cheng, H.B., Tang, P.N., and Jin, R. Localized Support Vector Machine and Its Efficient
Algorithm. In Proceedings of the Seventh SIAM International Conference on Data Mining, 2007.
[105] Hastie, T. and Tibshirani, R. Classification by pairwise coupling. In The Annals of Statistics,
vol. 26(1), pages 451–471, 1998.
[106] Wu, T.F., Lin, C.J., and Weng, R.C. Probability Estimates for Multi-class Classification by
Pairwise Coupling. In In The Journal of Machine Learning Research, vol. pages 975–1005, 2004.
[107] Friedman, J.H. On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. In Journal of
Data Mining and Knowledge Discovery, vol. 1(1), pages 55–77, 1997.
[108] Fraley, C., Raftery, A.E. Model-based clustering, discriminant analysis, and density estimation.
In Journal of the American Statistical Association, pages 611–631, 2002.
[109] Bashir, S. and Carter, E.M. High breakdown mixture discriminant analysis. In Journal of
Multivariate Analysis, vol. 93, pages 102–111, 2005.
[110] Theodoridis, S. and Mavroforakis, M. Reduced Convex Hulls: A Geometric Approach to
Support Vector Machines. In IEEE Signal Processing Magazine, vol. 24(3), pages 119–122, 2007.
BIBLIOGRAPHY 172
[111] Vapnik, V.N. The nature of statistical learning theory Springer-Verlag, New York, 1995.
[112] Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer, 2001.
[113] Aurenhammer, F. Voronoi Diagrams - A Survey of a Fundamental Geometric Data Structure
In Journal of ACM Computing Surveys, 23:345-405, 1991.
[114] Newman, D., Hettich, S., Blake, C., and Merz, C. Uci repository of machine learning databases,
1998.
[115] Chang, C.C. and Lin, C.J. LIBSVM : a library for support vector machines, 2001.
[116] Brazdil, P. and Gama, J. Statlog datasets. http://www.liacc.up.pt/ML/statlog/datasets.html.
[117] Neal, R.M. Delve datasets. http://www.cs.utoronto.ca/ delve/data/datasets.html.
[118] Aggarwal, C.C. Redesigning distance functions and distance-based applications for high di-
mensional data. SIGMOD Record, 30(1), March 2001.
[119] Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., and Park, J.S. Fast algorithms for
projected clustering. In Proceedings of the 1999 ACM SIGMOD International Conference on
Management of Data, pages 61–72, Philadelphia, Pennsylvania, United States, June 1-3 1999.
[120] Aggarwal, C.C. and Yu, P.S. Outlier detection for high dimensional data. In Proceedings of
the 2001 ACM SIGMOD International Conference on Management of Data, pages 37–46, Santa
Barbara, California, United States, May 2001.
[121] Barnett, V. and Lewis, T. Outliers in Statistical Data. John Wiley, New York, 1994.
[122] Berchtold, S., ohm, C.B., and Kriegal, H.-P. The pyramid-technique: Towards breaking the
curse of dimensionality. In Proceedings of the 1998 ACM SIGMOD International Conference on
Management of Data, pages 142–153, Seattle, Washington, United States, June 1998.
[123] Breunig, M.M., Kriegel, H.-P., Ng, R. T., and Sander, J. Lof: Identifying density-based local
outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of
Data, pages 93–104, Dallas, Texas, United States, May 14-19 2000.
[124] Cerioli, A. and Riani, M. The ordering of spatial data and the detection of multiple outliers.
Journal of Computational and Graphical Statistics, 8(2):239–258, June 1999.
[125] Chan, P.K., Fan, W., Prodromidis, A.L., and Stolfo, S.J. Distributed data mining in credit
card fraud detection. IEEE Intelligent Systems, 14(6):67–74, 1999.
[126] Chan, W.S. and Liu, W.N. Diagnosing shocks in stock markets of southeast asia, australia,
and new zealand. Mathematics and Computers in Simulation, 59(1-3):223–232, 2002.
BIBLIOGRAPHY 173
[127] Conci, A. and Proenca, C.B.. A system for real-time fabric inspection and industrial decision.
In Proceedings of the 14th International Conference on Software Engineering and Knowledge En-
gineering, pages 707–714, Ischia, Italy, July 15-19 2002.
[128] Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. A density-based algorithm for discovering
clusters in large spatial databases with noise. In the Second International Conference on Knowledge
Discovery and Data Mining, pages 226–231, Portland, Oregon, United States, August 2-4 1996.
[129] Guttman, I. Linear Models: An Introduction. John Wiley, New York, 1982.
[130] Haining, R. Spatial Data Analysis in the Social and Environmental Sciences. Cambridge
University Press, 1993.
[131] Haslett, J., Brandley, R., Craig, P., Unwin, A., and Wills, G. Dynamic Graphics for Ex-
ploring Spatial Data With Application to Locating Global and Local Anomalies. The American
Statistician, 45:234–242, 1991.
[132] Hinneburg, A., Aggarwal, C.C., and Keim, D.A. What is the nearest neighbor in high dimen-
sional spaces? In Proceedings of 26th International Conference on Very Large Data Bases, pages
506–515, Cairo, Egypt, September 10-14 2000.
[133] Jin, W., Tung, A.K.H., and Han, J. Mining top-n local outliers in large databases. In Proceed-
ings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 293–298, San Francisco, California, United States, August 26-29 2001.
[134] Knorr, E.M. and Ng, R.T. Algorithms for mining distance-based outliers in large datasets. In
Proceedings of the 24th International Conference on Very Large Data Bases, pages 392–403, New
York City, NY, United States, August 24-27 1998.
[135] Liu, H., Jezek, K.C., and O’Kelly, M.E.. Detecting outliers in irregularly distributed spatial
data sets by locally adaptive and robust statistical analysis and gis. International Journal of
Geographical Information Science, 15(8):721–741, 2001.
[136] Lu, C.T., Chen, D., and Kou, Y. Detecting spatial outliers with multiple attributes. In
Proceedings of the 15th International Conference on Tools with Artificial Intelligence, pages 122–
128, Sacramento, California, United States, November 3-5 2003.
[137] Lu, C.T., Chen, D., and Kou, Y. Algorithms for spatial outlier detection. In Proceedings of
the Third IEEE International Conference on Data Mining, pages 597–600, Melbourne, Florida,
United States, November 19-22 2003.
[138] Lu, C.T. and Liang, L. R. Wavelet fuzzy classification for detecting and tracking region
outliers in meteorological data. In Proceedings of the 12th Annual ACM International Workshop
on Geographic Information Systems, pages 258–265, Washington DC, United States, November
12-13 2004.
BIBLIOGRAPHY 174
[139] Luc, A. Local indicators of spatial association: Lisa. Geographical Analysis, 27(2):93–115,
1995.
[140] Mkhadri, A. Shrinkage parameter for the modified linear discriminant analysis. Pattern
Recognition Letters, 16(3):267–275, 1995.
[141] Ng, R. T. and Han, J. Efficient and effective clustering methods for spatial data mining.
In Proceedings of the 20th International Conference on Very Large Data Bases, pages 144–155,
Santiago de Chile, Chile, September 12-15 1994.
[142] Panatier, Y. VARIOWIN: Software for Spatial Data Analysis in 2D. Springer-Verlag, New
York, 1996.
[143] Prastawa, M., Bullitt, E., Ho, S., and Gerig, G. A brain tumor segmentation framework based
on outlier detection. Medical Image Analysis, 9(5):457–466, 2004.
[144] Preparata, F. P. and Shamos, M. I. Computational Geometry - An Introduction. Springer,
1985.
[145] Ramaswamy, S., Rastogi, R., and Shim, K. Efficient algorithms for mining outliers from large
data sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management
of Data, volume 29, pages 427–438, Dallas, Texas, United States, May 16-18 2000.
[146] Ruts, I. and Rousseeuw, P. J. Computing depth contours of bivariate point clouds. Computa-
tional Statistics and Data Analysis, 23(1):153–168, 1996.
[147] Shekhar, S. and Chawla, S. A Tour of Spatial Databases. Prentice Hall, 2002.
[148] Shekhar, S., Lu, C., and Zhang, P. A unified approach to detecting spatial outliers. GeoInfor-
matica, 7(2):139–166, 2003.
[149] Shekhar, S., Lu, C.T., and Zhang, P. Detecting graph-based spatial outliers: algorithms and
applications (a summary of results). In Proceedings of the Seventh ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 371–376, San Francisco, California,
United States, August 26-29 2001.
[150] Tipping. M. E. and Bishop, C. M. Mixtures of probabilistic principal component analysers.
Neural Computation, 11(2):443–482, 1999.
[151] Tobler, W. Cellular geography. In Philosophy in Geography, pages 379–386, Dordrecht, Hol-
land, 1979. Dordrecht Reidel Publishing Company.
[152] Wong, W.-K., Moore, A., Cooper, G., and Wagner, M. Rule-based anomaly pattern detection
for detecting disease outbreaks. In the Eighteenth National Conference on Artificial Intelligence,
pages 217–223, Edmonton, Alberta, Canada, July 28 - August 1 2002.
[153] Xu, L. Bayesian ying-yang machine, clustering and number of clusters. Pattern Recognition
Letters, 18(11-13):1167–1178, 1997.
BIBLIOGRAPHY 175
[154] Yamanishi, K., Takeuchi, J.-I., Williams, G., and Milne, P. On-line unsupervised outlier
detection using finite mixtures with discounting learning algorithms. Data Mining and Knowledge
Discovery, 8(3):275–300, 2004.
[155] Zanero, S. and Savaresi, S. M. Unsupervised learning techniques for an intrusion detection
system. In Proceedings of the 2004 ACM Symposium on Applied Computing, pages 412–419,
Nicosia, Cyprus, March 14-17 2004.
[156] Zhang, T., Ramakrishnan, R., and Livny, M. Birch: an efficient data clustering method for
very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on
Management of Data, pages 103–114, Montreal, Quebec, Canada, June 4-6 1996.
[157] Zhao, J., Lu, C.-T., and Kou, Y. Detecting region outliers in meteorological data. In Proceed-
ings of the 11th ACM international Symposium on Advances in Geographic Information Systems,
pages 49–55, New Orleans, Louisiana, United States, 2003.
[158] Hardin, J. and Rocke, D.M. The Distribution of Robust Distances. In Journal of Computa-
tional and Geographical Statistics, pages 928–946, 2005.
[159] Goldberg, Y., Zakai, A., Kushnir, D., and Ritov, Y. Manifold Learning: The Price of Normal-
ization In Journal of Machine Learning Research, vol. 9, pages 1909–1939, 2008.
[160] Belkin, M., Niyogi, P., and Sindhwani, V. Manifold Regularization: a Geometric Framework
for Learning from Labeled and Unlabeled Examples In Journal of Machine Learning Research,
vol. 7, pages 2399–2434, 2006.
[161] Belkin, M. and Niyogi, P. Laplacian Eigenmaps for Dimensionality Reduction and Data
Representation In Neural Computation, vol. 15, issues 6, pages 1373–1396, 2001.
[162] Liu, X.T., Lu, C.T., and Chen, F. Spatial Outlier Detection: Random Walk Based Ap-
proaches In Proceedings of the 18th ACM SIGSPATIAL International Conference on Advances in
Geographic Information Systems (ACM SIGSPATIAL GIS), San Jose, California, November 2-5,
2010
[163] K. Arrigo, G. Dijken, and S. Bushinsky, “Primary production in the southern ocean, 1997-
2006,” Journal of Geophysical Research, no. 113:C08004, 2008.
[164] C. Park, W. Bridewell, and P. Langley, “Integrated systems for inducing spatio-temporal
process models.” AAAI, M. Fox and D. Poole, Eds. AAAI Press, 2010.
[165] N. Cressie and C. Wikle, Statistics for Spatio-Temporal Data. Wiley, 2011, iSBN 978-
0471692744.
[166] T. Shi and N. Cressie, “Global statistical analysis of MISR aerosol data: A massive data
product from NASAs Terra satellite,” Environmetrics, vol. 18, pp. 665–680, 2007.
[167] H.P. Cao, N. Mamoulis, and D.W. Cheung, “Discovery of periodic patterns in spatiotemporal
sequences,” IEEE Trans. on Know. and Data. Eng. (TKDE), vol. 19, no. 4, pp. 453–467, 2007.
BIBLIOGRAPHY 176
[168] M. Celik, S. Shekhar, J.P. Rogers, and J.A. Shine, “Mixed-drove spatiotemporal co-occurrence
pattern mining,” IEEE Trans. on Know. and Data. Eng. (TKDE), vol. 20, no. 10, pp. 1322–1335,
2008.
[169] Y. Chen, K. Chen, and M. A. Nascimento, “Effective and efficient shape-based pattern detec-
tion over streaming time series,” IEEE Trans. on Know. and Data. Eng. (TKDE), vol. 24, no. 2,
pp. 265–278, Feb. 2012.
[170] Y. Huang, L. Zhang, and P.H. Zhang, “A Framework for Mining sequential patterns from
spatio-temporal event data sets,” IEEE Trans. on Know. and Data. Eng. (TKDE), vol. 20, no. 4,
pp. 433–448, 2008.
[171] J. Oh and K.D Kang, “A Predictive-Reactive Method for improving the robustness of real-time
data services,” IEEE Trans. on Know. and Data. Eng. (TKDE), to appear, March 2012.
[172] C. Tang and A. Zhang, “Cluster analysis for gene expression data: a survey” IEEE Trans. on
Know. and Data. Eng. (TKDE), vol. 6, no. 11, pp. 1370–1386, 2004.
[173] J. Abernethy, T. Evgeniou, O. Toubia, and J.P. Vert, “Eliciting consumer preferences using
robust adaptive choice questionnaires” IEEE Trans. on Know. and Data. Eng. (TKDE), vol. 2,
no. 2, pp. 145–155, 2007.
[174] P.-N. Tan, M. Steinbach, V. Kumar, C. Potter, S. Klooster, and A. Torregrosa, “Finding
spatio-temporal patterns in earth science data,” Proc. KDD Workshop Temporal Data Mining,
2001.
[175] H. Yang, S. Parthasarathy, and S. Mehta, “A generalized framework for mining spatio-temporal
patterns in scientific data,” KDD, pp. 716–721, 2005.
[176] V. Malbasa and S. Vucetic, “Spatially regularized logistic regression for disease mapping on
large moving populations.” KDD, pp. 1352–1360, 2011.
[177] W. Liu, Y. Zheng, S. Chawla, J. Yuan, and X. Xing, “Discovering spatio-temporal causal
interactions in traffic data streams,” KDD, pp. 1010–1018, 2011.
[178] A. Aravindakshan, K. Peters, and P. A. Naik, “Spatiotemporal allocation of advertising bud-
gets,” Journal of Marketing Research, vol. 49, no. 1, pp. 1–14, 2012.
[179] X. Du, R. Jin, L. Ding, V. E. Lee, and J. H. T. Jr., “Migration motif: a spatial - temporal
pattern mining approach for financial markets,” KDD, 2009, pp. 1135–1144.
[180] M. Katzfuss and N. Cressie, “Spatio-temporal smoothing and EM estimation for massive
remote-sensing data sets,” Journal of Time Series Analysis, vol. 32, no. 4, pp. 430–446, 2010.
[181] N. Cressie and C. Wikle, “Fixed rank filtering for spatial-temporal data,” Journal of Compu-
tational and Graphical Statistics, vol. 19, no. 3, pp. 724–745, 2010.
BIBLIOGRAPHY 177
[182] R.E. Kalman, “A new approach to linear filtering and prediction problems,” Trans. of the
ASME–Journal of Basic Engineering, vol. 82, no. Series D, pp. 35–45, 1960.
[183] B. Anderson, Adaptive Control. Oxford: Pergamon Press, 1984.
[184] H. Huang and N. Cressie, “Spatio-temporal prediction of snow water equivalent using the
Kalman filter,” Computational Statistics and Data Analysis, vol. 22, pp. 159–175, 1996.
[185] K. Mardia, C. Goodall, E. Redfern, and F. Alonso, “The Kriged Kalman filter,” Environmental
and Ecological Statistics, vol. 14, pp. 5–25, 1998.
[186] C. Wikle and N. Cressie, “A dimension-reduced approach to space-time Kalman filtering,”
Biometrika, vol. 86, pp. 815–829, 1999.
[187] N. Cressie and C. Wikle, “Space-time Kalman filter,” Encyclopedia of Environmetrics, vol. 4,
pp. 2045–2049, 2002.
[188] G. Johannesson, N. Cressie, and H. Huang, “Dyanmic multi-resolution spatial models,” Envi-
ronmental and Ecological Statistics, vol. 14, pp. 5–25, 2007.
[189] S. Ghosh, P. Bhave, J. Davis, and H. Lee, “Spatio-temporal analysis of total nitrate concen-
trations using dynamic statistical models,” Journal of the American Statistical Association, vol.
105, pp. 538–551, 2010.
[190] H. Lopes, E. Salazar, and D. Gamerman, “Spatial dynamic factor analysis,” Bayesian Analysis,
vol. 3, pp. 759–792, 2009.
[191] J. Luttinen and A. Ilin, “Variational Gaussian-process factor analysis for modeling spatio-
temporal data,” NIPS, pp. 1177–1185, 2009.
[192] V. Berrocal, A. Gelfand, and D. Holland, “A spatio-temporal downscaler for output from
numerical models,” Journal of Agrecultural, Biological, and Environmental Statistics, vol. 15, pp.
176–197, 2010.
[193] V. J. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence
Review, vol. 22, p. 2004, 2004.
[194] P. Jylanki, J. Vanhatalo, and A. Vehtari, “Gaussian process regression with a student-t likeli-
hood,” Journal of Machine Learning Research, p. Accept for Publication, pp. 3227–3257, 2011.
[195] S. Rosset, “Robust boosting and its relation to bagging,” KDD, pp. 249–255, 2005.
[196] R. Maronna, R. Martin, and V. Yohai, Robust Statistics: Theory and Methods. John Wiley
Sons, Ltd, 2006.
[197] J. Durbin and S. J. Koopman, “Monte Carlo maximum likelihood estimation for non-Gaussian
state space models,” Biometrika,, vol. 84, pp. 669–684, 1997.
BIBLIOGRAPHY 178
[198] W. Hastings, “Monte Carlo sampling methods using Markov chain and their applications,”
Biometrika, vol. 57, pp. 57–97, 1970.
[199] B. Jungbacker and S. J. Koopman, “Monte Carlo estimation for nonlinear non-Gaussian state
space models,” Biometrika, vol. 94, pp. 827–839, 2007.
[200] Y. Ruan and P. Willett, “Practical fusion of quantized measurements via particle fltering,”
Proc. IEEE Aerosp. Conf., pp. 1967C1978, 2003.
[201] O. Bar-Shalom and A. J. Weiss, “DOA estimation using one-bit quantized measurements,”
IEEE Trans. Aerosp. Electron. Syst., vol. 38, no. 3, pp. 868C884, 2002.
[202] N. M. Blachman, Noise and its Effect on Communication. New York: McGraw-Hill, 1966.
[203] M. Svensn and C. M. Bishop, “Robust Bayesian mixture modelling,” Neurocomputing, vol. 64,
pp. 235–252, 2005.
[204] M. A. Gandhi and L. Mili, “Robust Kalman filter based on a generalized maximum-likelihood-
type estimator.” IEEE Trans. on Signal Processing, vol. 58, no. 5, pp. 2509–2520, 2010.
[205] A. Y. Aravkin, B. M. Bell, J. V. Burke, and G. Pillonetto, “An l1 -Laplace robust Kalman
smoother.” IEEE Trans. Automat. Contr., vol. 56, no. 12, pp. 2898–2911, 2011.
[206] F. Chen, Y. Chen, C.-T. Lu., and Y.-J. Wu. , “Robust fixed rank prediction for large spatio-
temporal data,” Technical Report, 2012. http://filebox.vt.edu/users/chenf/rfrstp-techrpt.pdf
code: http://filebox.vt.edu/users/chenf/rfrstp-package.zip
[207] D. Nychka, C. Wikle, and J. Royle, “Multiresolution models for nonstationary spatial covari-
ance functions,” Statistical Modeling, vol. 2, pp. 315–331, 2002.
[208] Y.-J. Wu, F. Chen, C. Lu, B. Smith, and Y. Chen, “Traffic flow prediction for urban network
using spatio-temporal random effects model,” 91st Annual Meeting of the Transportation Research
Board (TRB), 2012.
[209] Charu C. Aggarwal. Redesigning Distance Functions and Distance-Based Applications for
High Dimensional Data. SIGMOD Record, 30(1), March 2001.
[210] Charu C. Aggarwal. A framework for diagnosing changes in evolving data streams. In Proceed-
ings of the 2003 ACM SIGMOD international conference on Management of data, pages 575–586.
ACM Press, 2003.
[211] Charu C. Aggarwal, Cecilia Magdalena Procopiuc, Joel L. Wolf, Philip S. Yu, and Jong Soo
Park. Fast algorithms for projected clustering. In SIGMOD 1999, Proceedings ACM SIGMOD
International Conference on Management of Data, June 1-3, 1999, Philadelphia, Pennsylvania,
USA, pages 61–72. ACM Press, 1999.
[212] Charu C. Aggarwal and Philip S. Yu. Outlier detection for high dimensional data. In Proceed-
ings of the 2001 ACM SIGMOD International Conference on Management of Data, volume 30.
ACM, 2001.
BIBLIOGRAPHY 179
[213] Takeshi S Aitoh, Tomoyuki O Saki, Ryosuke K Onishi, and Kazunori S Ugahara. Current
sensor based home appliance and state of appliance recognition. SICE Journal of Control Mea-
surement and System Integration, 3(2):086–093, 2010.
[214] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury. A survey on wireless multimedia sensor
networks. Computer Netw., 51(4):921–960, 2007.
[215] Mario Berges, Ethan Goldman, H Scott Matthews, and Lucio Soibelman. Learning systems
for electric consumption of buildings. Computing in Civil Engineering, 143(1):1–10, 2009.
[216] Mario E. Berges, Ethan Goldman, H. Scott Matthews, and Lucio Soibelman. Enhancing
electricity audits in residential buildings with nonintrusive load monitoring. Journal of Industrial
Ecology, 14(5):844–858, 2010.
[217] Havard Rue, Sara Martino, and Nicolas Chopin Approximate Bayesian inference for latent
Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Sta-
tistical Society: Series B (Statistical Methodology, 71-2, 319–392, 2009.
[218] Havard Rue and Leonhard Held Gaussian Markov Random Fields: Theory and Applications.
Monographs on Statistics and Applied Probability, 2005.
[219] Havard Rue and Leonhard Held Expectation Propagation for approximate Bayesian inference.
UAI, 362-369, 2001.
[220] V. Berrocal, A.E. Gelfand, and D.M. Holland. A spatio-temporal downscaler for output from
numerical models. Journal of Agrecultural, Biological, and Environmental Statistics, 15:176–197,
2010.
[221] Christopher M. Bishop and Markus Svensn. Robust bayesian mixture modelling. Neurocom-
puting, 64:235–252, 2005.
[222] C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines. Online, 2001.
[223] Sotirios P. Chatzis and Gabriel Tsechpenakis. The infinite hidden markov random field model.
Trans. Neur. Netw., 21:1004–1014, June 2010.
[224] S. Chawla, S. Shekhar, W-L Wu, and U. Ozesmi. Modelling spatial dependencies for mining
geospatial data: An introduction. In Harvey Miller and Jiawei Han, editors, Geographic data
mining and Knowledge Discovery (GKD), 1999.
[225] Yingying Chen, Wade Trappe, and Richard P. Martin. Detecting and localizing wireless
spoofing attacks. In Proceedings of the Fourth Annual IEEE Communications Society Conference
on Sensor, Mesh and Ad Hoc Communications and Networks, SECON 2007, Merged with IEEE
International Workshop on Wireless Ad-hoc and Sensor Networks (IWWAN), June 18-21, 2007,
San Diego, pages 193–202. IEEE, 2007.
BIBLIOGRAPHY 180
[226] Yueguo Chen, Ke Chen, and Mario A. Nascimento. Effective and efficient shape-based pattern
detection over streaming time series. IEEE Trans. on Knowl. and Data Eng., 24(2):265–278,
February 2012.
[227] N. Cressie and C.K. Wikle. Space-time kalman filter. Encyclopedia of Environmetrics, 4:2045–
2049, 2002.
[228] N. Cressie and C.K. Wikle. Fixed rank filtering for spatial-temporal data. Journal of Compu-
tational and Graphical Statistics, 19(3):724–745, 2010.
[229] N. Cressie and C.K. Wikle. Statistics for Spatio-Temporal Data. Wiley, 2011. ISBN 978-
0471692744.
[230] P. Domingos and G. Hulten. Mining high-speed data streams. In Knowledge Discovery and
Data Mining, pages 71–80, 2000.
[231] John R. Douceur. The sybil attack. In Revised Papers from the First International Workshop
on Peer-to-Peer Systems, IPTPS ’01, pages 251–260, London, UK, 2002. Springer-Verlag.
[232] Xiaoxi Du, Ruoming Jin, Liang Ding, Victor E. Lee, and John H. Thornton Jr. Migration
motif: a spatial - temporal pattern mining approach for financial markets. In KDD, pages 1135–
1144, 2009.
[233] Dubuque2.0. Inspiring sustainability, 2010.
[234] J. Durbin and S. J. Koopman. Monte carlo maximum likelihood estimation for non-gaussian
state space models. Biometrika,, 84:669–684, 1997.
[235] Daniel B. Faria and David R. Cheriton. Detecting identity-based attacks in wireless network
using signalprints. In Proceedings of the 2006 ACM Workshop on Wireless Security (WiSe ’06),
pages 43–52. ACM Press, September 2006.
[236] Chang-Tien Lu Yao-Jan Wu Feng Chen, Yang Chen. Robust fixed rank prediction for large
spatio-temporal data. Technical Report, 2012.
[237] James Fogarty, Carolyn Au, and Scott E. Hudson. Sensing from the basement: a feasibility
study of unobtrusive and low-cost home activity recognition. In Proceedings of the 19th annual
ACM symposium on User interface software and technology, UIST ’06, pages 91–100, 2006.
[238] Jon E. Froehlich, Eric Larson, Tim Campbell, Conor Haggerty, James Fogarty, and Shwetak N.
Patel. Hydrosense: infrastructure-mediated single-point sensing of whole-home water activity. In
Proceedings of the 11th international conference on Ubiquitous computing, Ubicomp ’09, pages
235–244, 2009.
[239] M.A. Gandhi and L. Mili. Robust kalman filter based on a generalized maximum-likelihood-
type estimator. IEEE Transactions on Signal Processing, 58:2509–2520, 2010.
BIBLIOGRAPHY 181
[240] Like Gao and X. Sean Wang. Continually evaluating similarity-based pattern queries on a
streaming time series. In SIGMOD ’02: Proceedings of the 2002 ACM SIGMOD international
conference on Management of data, pages 370–381. ACM Press, 2002.
[241] S.K. Ghosh, P.V. Bhave, J.M. Davis, and H. Lee. Spatio-temporal analysis of total nitrate
concentrations using dynamic statistical models. Journal of the American Statistical Association,
105:538–551, 2010.
[242] Thomer M. Gil and Massimiliano Poletto. Multops: a data-structure for bandwidth attack
detection. In Proceedings of the 10th conference on USENIX Security Symposium - Volume 10,
SSYM’01, pages 3–3, Berkeley, CA, USA, 2001. USENIX Association.
[243] Ryan Gomes, Max Welling, and Pietro Perona. Incremental learning of nonparametric bayesian
mixture models. In CVPR. IEEE Computer Society, 2008.
[244] Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile sum-
maries. In SIGMOD ’01: Proceedings of the 2001 ACM SIGMOD international conference on
Management of data, pages 58–66. ACM Press, 2001.
[245] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. pages 359–366,
2000.
[246] G. W. Hart. Nonintrusive appliance load monitoring. Proceedings of the IEEE, 80(12):1870–
1891, August 2002.
[247] J. Haslett, R. Brandley, P. Craig, A. Unwin, and G. Wills. Dynamic Graphics for Exploring
Spatial Data With Application to Locating Global and Local Anomalies. The American Statisti-
cian, 45:234–242, 1991.
[248] W.K. Hastings. Monte carlo sampling methods using markov chain and their applications.
Biometrika, 57:57–97, 1970.
[249] Alexander Hinneburg, Charu C. Aggarwal, and Daniel A. Keim. What is the nearest neighbor
in high dimensional spaces? In VLDB 2000, Proceedings of 26th International Conference on
Very Large Data Bases, pages 506–515, 2000.
[250] Victoria J. Hodge and Jim Austin. A survey of outlier detection methodologies. Artificial
Intelligence Review, 22:2004, 2004.
[251] H.C. Huang and N. Cressie. Spatio-temporal prediction of snow water equivalent using the
kalman filter. Computational Statistics and Data Analysis, 22:159–175, 1996.
[252] Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams,
June 14 2001.
[253] G. Johannesson, N. Cressie, and H.C. Huang. Dyanmic multi-resolution spatial models. En-
vironmental and Ecological Statistics, 14:5–25, 2007.
BIBLIOGRAPHY 182
[254] B. Jungbacker and S. J. Koopman. Monte carlo estimation for nonlinear non-gaussian state
space models. Biometrika, 94:827–839, 2007.
[255] P. Jylanki, J. Vanhatalo, and A. Vehtari. Gaussian process regression with a student-t likeli-
hood. Journal of Machine Learning Research, page Accept for Publication, 2011.
[256] Bishop, Christopher M. Pattern Recognition and Machine Learning (Information Science and
Statistics). Springer-Verlag New York, Inc., 2006.
[257] Kalman, Rudolph, and Emil. A New Approach to Linear Filtering and Prediction Problems.
Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960.
[258] Chris Karlof and David Wagner. Secure routing in wireless sensor networks: attacks and
countermeasures. Elsevier: Ad Hoc Networks, 1:293–315, 2003.
[259] M. Katzfuss and N. Cressie. Spatio-temporal smoothing and em estimation for massive remote-
sensing data sets. Journal of Time Series Analysis, 32(4):430–446, 2010.
[260] Jonghyun Kim, Vinay Sridhara, and Stephan Bohacek. Realistic mobility simulation of urban
mesh networks. Ad Hoc Netw., 7:411–430, March 2009.
[261] Younghun Kim, Thomas Schmid, Zainul M. Charbiwala, Jonathan Friedman, and Mani B.
Srivastava. Nawms: nonintrusive autonomous water monitoring system. In Proceedings of the 6th
ACM conference on Embedded network sensor systems, SenSys ’08, pages 309–322, 2008.
[262] E. Knorr and R. Ng. Algorithms for mining distance based outliers in large datasets. In
Proceedings of 24 th VLDB Conference, 1998.
[263] K. Koperski, J. Adhikary, and J. Han. Spatial data mining: Progress and challenges. In
Workshop on Research Issues on Data Mining and Knowledge Discovery(DMKD’96), pages 1–10,
Montreal, Canada, 1996.
[264] K. Koperski and J. Han. Discovery of spatial association rules in geographic information
databases. In Advances in Spatial Databases, Proc. of 4th International Symposium, SSD’95,
pages 47–66, Portland, Maine, USA, 1995.
[265] Kenichi Kurihara, Max Welling, and Nikos A. Vlassis. Accelerated variational dirichlet process
mixtures. In NIPS’06, pages 761–768, 2006.
[266] Wei Liu, Yu Zheng, Sanjay Chawla, Jing Yuan, and Xie Xing. Discovering spatio-temporal
causal interactions in traffic data streams. In KDD, pages 1010–1018, 2011.
[267] H.F. Lopes, E. Salazar, and D. gamerman. Spatial dynamic factor analysis. Bayesian Analysis,
3:759–792, 2009.
[268] Jaakko Luttinen and Alexander Ilin. Variational gaussian-process factor analysis for modeling
spatio-temporal data. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta,
editors, Advances in Neural Information Processing Systems 22, pages 1177–1185, 2009.
BIBLIOGRAPHY 183
[269] Vuk Malbasa and Slobodan Vucetic. Spatially regularized logistic regression for disease map-
ping on large moving populations. In KDD, pages 1352–1360, 2011.
[270] K.V. Mardia, C. Goodall, E.J. Redfern, and F.J. Alonso. The kriged kalman filter. Environ-
mental and Ecological Statistics, 14:5–25, 1998.
[271] R.A. Maronna, R.D. Martin, and V.J. Yohai. Robust Statistics: Theory and Methods. John
Wiley Sons, Ltd, 2006.
[272] David Moore, Colleen Shannon, Douglas J. Brown, Geoffrey M. Voelker, and Stefan Savage.
Inferring internet denial-of-service activity. ACM Trans. Comput. Syst., 24:115–139, May 2006.
[273] Hala Najmeddine, Khalil El Khamlichi Drissi, Christophe Pasquier, Claire Faure, Kamal Ker-
roum, Thierry Jouannet, Michel Michou, and Alioune Diop. Smart metering by using ”matrix
pencil”;. In Environment and Electrical Engineering (EEEIC), 2010 9th International Conference
on, pages 238 –241, may 2010.
[274] Neptune Technology Group. R900 RF Wall or Pit MIU Product Sheet, 2009.
[275] Nam Tuan Nguyen, Guanbo Zheng, Zhu Han, and Rong Zheng. Device fingerprinting to
enhance wireless security using nonparametric bayesian method. In INFOCOM, pages 1404–1412.
IEEE, 2011.
[276] D. Nychka, C. Wikle, and J.A. Royle. Multiresolution models for nonstationary spatial co-
variance functions. Statistical Modeling, 2:315–331, 2002.
[277] Y. Panatier. Variowin. Software For Spatial Data Analysis in 2D. New York: Springer-Verlag,
1996.
[278] Chunki Park, Will Bridewell, and Pat Langley. Integrated systems for inducing spatio-temporal
process models. In Maria Fox and David Poole, editors, AAAI. AAAI Press, 2010.
[279] Shwetak N. Patel, Thomas Robertson, Julie A. Kientz, Matthew S. Reynolds, and Gregory D.
Abowd. At the flick of a switch: Detecting and classifying unique electrical events on the residential
power line (nominated for the best paper award). volume 4717 of Lecture Notes in Computer
Science, pages 271–288. Springer, 2007.
[280] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient Algorithms for Mining Outliers from Large
Data Sets. In Proceedings of the 2000 ACM SIGMOD International Conference on Management
of Data, pages 427–438, 2000.
[281] S. J. Roberts. Novelty detection using extreme value statistics. IEE Proceedings-Vision Image
and Signal Processing, 146(3):124, 1999.
[282] Havard Rue. Fast sampling of gaussian markov random fields. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 63(2):325–338, 2001.
BIBLIOGRAPHY 184
[283] I. Ruts and P. Rousseeuw. Computing Depth Contours of Bivariate Point Clouds. In Compu-
tational Statistics and Data Analysis, 23:153–168, 1996.
[284] A.G. Ruzzelli, C. Nicolas, A. Schoofs, and G.M.P. O’Hare. Real-time recognition and profiling
of appliances through a single electricity sensor. In Sensor Mesh and Ad Hoc Communications
and Networks (SECON), 2010 7th Annual IEEE Communications Society Conference on, pages
1–9, june 2010.
[285] S. Shekhar and Y. Huang. Co-location Rules Mining: A Summary of Results. In Proc. Spatio-
temporal Symposium on Databases, 2001.
[286] S. Shekhar, C.T. Lu, and P. Zhang. Detecting Graph-Based Spatial Outlier: Algorithms and
Applications(A Summary of Results). In Proc. of the Seventh ACM-SIGKDD Int’l Conference on
Knowledge Discovery and Data Mining, Aug 2001.
[287] Yong Sheng, Keren Tan, Guanling Chen, David Kotz, and Andrew Campbell. Detecting 802.11
mac layer spoofing using received signal strength. In INFOCOM, pages 1768–1776. IEEE, 2008.
[288] T. Shi and N. Cressie. Global statistical analysis of misr aerosol data: A massive data product
from nasa’s terra satellite. Environmetrics, 18:665–680, 2007.
[289] W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea) for large-scale
classification. In KDD ’01: Proceedings of the seventh ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 377–382. ACM Press, 2001.
[290] Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Christopher Potter, Steven Klooster, and
Alicia Torregrosa. Finding spatio-temporal patterns in earth science data. Proc. KDD Workshop
Temporal Data Mining, 2001.
[291] Haixun Wang, Wei Fan, Philip S. Yu, and Han Han. Mining concept-drifting data streams
using ensemble classifiers. In Pedro Domingos, Christos Faloutsos, Ted SEnator, Hillol Kargupta,
and Lise Getoor, editors, Proceedings of the ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-03), pages 226–235, New York, August 24–27 2003.
ACM Press.
[292] C.K. Wikle and N. Cressie. A dimension-reduced approach to space-time kalman filtering.
Biometrika, 86:815–829, 1999.
[293] T. Woody. Smart water meters catch on in lowa. The New York Times New York City, 2010.
[294] Y-J. Wu, F. Chen, C.T. Lu, B. Smith, and Y. Chen. Traffic flow prediction for urban net-
work using spatio-temporal random effects model. In 91st Annual Meeting of the Transportation
Research Board (TRB), 2012.
[295] Hui Yang, Srinivasan Parthasarathy, and Sameep Mehta. A generalized framework for mining
spatio-temporal patterns in scientific data. In KDD, pages 716–721, 2005.
BIBLIOGRAPHY 185
[296] Jie Yang, Yingying Chen, and Wade Trappe. Detecting spoofing attacks in mobile wireless
environments. In SECON, pages 1–9. IEEE, 2009.
[297] Jie Yang, Yingying Chen, Wade Trappe, and Jay Cheng. Determining the number of attackers
and localizing multiple adversaries in wireless spoofing attacks. In INFOCOM, pages 666–674.
IEEE, 2009.
[298] Kai Zeng, Kannan Govindan, Daniel Wu, and Prasant Mohapatra. Identity-based attack
detection in mobile wireless networks. In INFOCOM, pages 1880–1888. IEEE, 2011.
[299] Yao-Jan Wu, Feng Chen, Lu Chang-Tien, Brian Smith, and Yang Chen. Traffic flow estimation
and prediction for urban network using spatial temporal random effects model. In the 91st Annual
Meeting of the Transportation Research Board (TRB). Accepted, 2012.
[300] Xutong Liu, Feng Chen, and Chang-Tien Lu. Fast multivariate spatial categorical outlier
detection based on pair correlations. Journal of Geoinformatica. Submitted, 2012.
[301] Xutong Liu, Feng Chen, and Chang-Tien Lu. Approximate inferences for large mix-type
spatio-temporal data. In IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE).
Submitted, 2012.
[302] Arnold P. Boedihardjo, Chang-Tien Lu, and Feng Chen. Fast adaptive kernel density estima-
tors for data streams. In ACM Transactions on Knowledge Discovery from Data (ACM-TKDD).
Submitted, 2012.
[303] Xutong Liu, Feng Chen, and Chang-Tien Lu. Spatial categorical outlier detection: pair cor-
relation function based approach. In Isabel Cruz and Divyakant Agrawal, editors, GIS, pages
465–468. ACM, 2011.
[304] Feng Chen, Jing Dai, Bingsheng Wang, Sambit Sahu, Milind R. Naphade, and Chang-Tien
Lu. Activity analysis based on low sample rate smart meters. In Chid Apt, Joydeep Ghosh, and
Padhraic Smyth, editors, KDD, pages 240–248. ACM, 2011.
[305] Xutong Liu, Chang-Tien Lu, and Feng Chen. Spatial outlier detection: random walk based
approaches. In Divyakant Agrawal, Pusheng Zhang, Amr El Abbadi, and Mohamed F. Mokbel,
editors, GIS, pages 370–379. ACM, 2010.
[306] J. Zico Kolter, Siddharth Batra, and Andrew Y. Ng. Energy disaggregation via discriminative
sparse coding. In John D. Lafferty, Christopher K. I. Williams, John Shawe-Taylor, Richard S.
Zemel, and Aron Culotta, editors, NIPS, pages 1153–1161. Curran Associates, Inc., 2010.
[307] Jing Dai, Feng Chen, Sambit Sahu, and Milind R. Naphade. Regional behavior change de-
tection via local spatial scan. In Divyakant Agrawal, Pusheng Zhang, Amr El Abbadi, and
Mohamed F. Mokbel, editors, GIS, pages 490–493. ACM, 2010.
[308] Feng Chen, Chang-Tien Lu, and Arnold P. Boedihardjo. GLS-SOD: a generalized local statisti-
cal approach for spatial outlier detection. In Proceedings of the 16th ACM SIGKDD international
BIBLIOGRAPHY 186
conference on Knowledge discovery and data mining, KDD ’10, pages 1069–1078, New York, NY,
USA, 2010.
[309] J. Van Gael, Y. W. Teh, and Z. Ghahramani. The infinite factorial hidden Markov model. In
Advances in Neural Information Processing Systems, volume 21, 2009.
[310] Qifeng Lu, Feng Chen, and Kathleen L. Hancock. On path anomaly detection in a large
transportation network. Journal of Computers, Environment and Urban Systems, 33(6):448–462,
2009.
[311] Chang-Tien Lu, Arnold P. Boedihardjo, Jing Dai, and Feng Chen. Homes: highway operation
monitoring and evaluation system. In Proceedings of the 16th ACM SIGSPATIAL international
conference on Advances in geographic information systems, GIS ’08, pages 85:1–85:2, New York,
NY, USA, 2008. ACM.
[312] Xutong Liu, Chang-Tien Lu, and Feng Chen. An entropy-based method for assessing the
number of spatial outliers. In IRI, pages 244–249. IEEE Systems, Man, and Cybernetics Society,
2008.
[313] Feng Chen, Chang-Tien Lu, and Arnold P. Boedihardjo. On locally linear classification by
pairwise coupling. In Proceedings of the 8th IEEE International Conference on Data Mining
(ICDM 2008), December 15-19, 2008, Pisa, Italy, pages 749–754. IEEE Computer Society, 2008.
[314] Dechang Chen, Chang-Tien Lu, Yufeng Kou, and Feng Chen. On detecting spatial outliers.
GeoInformatica, 12(4):455–475, 2008.
[315] Arnold P. Boedihardjo, Chang-Tien Lu, and Feng Chen. A framework for estimating complex
probability density structures in data streams. In James G. Shanahan, Sihem Amer-Yahia, Ioana
Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi, and Abdur Chowdhury,
editors, CIKM, pages 619–628. ACM, 2008.
[316] Jing Dai, Ming Li, Sambit Sahu, Milind Naphade, and Feng Chen. Multi-granular demand
forecasting in smarter water. In Proceedings of the 13th International Conference on Ubiquitous
Computing (Ubicomp). Poster Paper.
[317] Yang Chen, Feng Chen, Jing Dai, and T. Charles Clancy. Student-t Based Robust Spatio-
Temporal Prediction. To Appear In the IEEE International Conference on Data Mining (IEEE
ICDM), 2012.
[318] Xutong Liu, Feng Chen, Chang-Tien Lu Robust Inference and Outlier Detrection for Large
Spatial Data Sets To Appear In the IEEE International Conference on Data Mining (IEEE
ICDM), 2012
[319] Bingsheng Wang, Feng Chen, Haili Dong, Arnold Boedihardjo, and Chang-Tien Lu Low-
Sample-Rate Water Consumption Disaggregation via Sparse Coding with Extended Discriminative
Dictionary To Appear In the IEEE International Conference on Data Mining (IEEE ICDM), 2012
BIBLIOGRAPHY 187
[320] C. Varin, G. Host, and O. Skare. Pairwise likelihood inference in spatial generalized linear
mixed models Computational Statistics and Data Analysis, 49(4):1173C1191, 2005.
[321] Andrew O. Finley, Sudipto Banerjee, and Bradley P. Carlin. spBayes: An R Package for
Univariate and Multivariate Hierarchical Point-referenced Spatial Models In Journal of Statistical
Software,, 19 (04), 2007
[322] K. Pace and R. Barry. Sparse spatial autoregressions. In Statistics and Probability Letters,
33(3):291C297, 1997.
[323] DIGGLE, P. J., THOMSON, M. C., CHRISTENSEN, O. F., ROWLINGSON, B.. OBSOMER,
V., GARDON, J., and et al. Spatial Modelling and Prediction of Loa Loa Risk: Decision Making
Under Uncertainty. In Annals of Tropical Medicine and Parasitology, 101(6):499-509, 2007.
[324] Harrison, David, and Daniel L. Bubinfeld Hedonidc Housing Prices and the Demand for Clean
Air. In Journal of Envirnomental Economics and Management, 31:403-405, 1996.
[325] R. A. Dubin. Spatial autocorrelation and neighborhood quality. In Regional Science and Urban
Economics, 22(3):433C452, 1992.
[326] K. Das and J. Schneider. Detecting anomalous records in categorical datasets. In Proceedings
of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining,
KDD ’07, 220-229, 2007.