Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Pattern Recognition 108 (2020) 107355
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
Abnormal event detection in surveillance videos based on low-rank
and compact coefficient dictionary learning
Ang Li a , b , Zhenjiang Miao
a , b , Yigang Cen
a , b , ∗, Xiao-Ping Zhang
c , Linna Zhang
d , Shiming Chen
e
a Institute of Information Science, Beijing Jiaotong University, Beijing, 10 0 044, China b Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 10 0 044, China c Department of Electrical, Computer, and Biomedical Engineering, Ryerson University, Toronto, ON M5B 2K3, Canada d College of Mechanical Engineering, Guizhou University, Guiyang, 550025, China e School of Electrical & Electronic Engineering, East China Jiaotong University, Nanchang, 330013, China
a r t i c l e i n f o
Article history:
Received 30 July 2019
Revised 27 March 2020
Accepted 29 March 2020
Available online 11 July 2020
Keywords:
LRCCDL
Reconstruction cost
Abnormal event detection
Crowded scenes
Surveillance videos
a b s t r a c t
In this paper, a novel approach to abnormal event detection in crowded scenes is presented based on a
new low-rank and compact coefficient dictionary learning (LRCCDL) algorithm. First, based on the back-
ground subtraction and binarization of surveillance videos, we construct a feature space by extracting the
histogram of maximal optical flow projection (HMOFP) feature of the foreground from a normal training
frame set. Second, in the training stage, a new joint optimization of the nuclear-norm and l 2, 1 -norm is
applied to obtain a compact coefficient low-rank dictionary. Third, in the detection stage, l 2, 1 -norm opti-
mization is utilized to obtain the reconstruction coefficient vectors of the testing samples. Note that the
l 2, 1 -norm forces the reconstruction vectors of all the testing samples to compactly surround the same
center in the training stage, such that the reconstruction errors of abnormal testing samples are different
from those of normal ones. Finally, a reconstruction cost (RC) is introduced to detect abnormal frames.
Experimental results on both global and local abnormal event detection show the effectiveness of our
algorithm. Based on comparisons with state-of-the-art methods employing various criteria, the proposed
algorithm achieves comparable detection results.
© 2020 Elsevier Ltd. All rights reserved.
1
i
A
t
c
l
m
m
s
l
a
i
m
M
z
t
e
i
m
e
m
p
t
d
t
v
m
m
h
0
. Introduction
In recent years, abnormal event detection is a research hotspot
n the fields of computer vision (CV) and pattern recognition (PR).
s a result of the reduction of surveillance equipment cost and
he significant improvement of public safety awareness, it has be-
ome very common that surveillance cameras are applied in pub-
ic areas, such as train stations, airports, museums, stadiums, and
arkets. In most surveillance systems, cameras in public areas are
onitored by human operators who closely observe the monitor
creens to identify abnormal events. With the number of surveil-
ance equipment increasing, the demand for accurate, automated
nomaly detection methods increases since manual monitoring
s very inefficient. Additionally, spending long time on watching
onitors is tedious work. Moreover, the staff cannot always focus
∗ Corresponding author.
E-mail addresses: [email protected] (A. Li), [email protected] (Z.
iao), [email protected] (Y. Cen), [email protected] (X.-P. Zhang),
[email protected] (L. Zhang), [email protected] (S. Chen).
s
d
s
i
m
ttps://doi.org/10.1016/j.patcog.2020.107355
031-3203/© 2020 Elsevier Ltd. All rights reserved.
heir attention on the monitors, so it is easy to miss some anomaly
vents [1] .
Depending on the areas of occurrence of the abnormal behav-
ors of a crowd, the detection objects can be separated into two
ain classes, i.e., global abnormal events (GAE) and local abnormal
vents (LAE). GAE denotes that the whole detection scene is abnor-
al, and LAE denotes that the abnormal events occur in some local
arts of the detection scene.
The crowd usually has a high density in crowded scenes, and
raditional crowd analysis algorithms are usually confronted with
ifficult situations because of the serious overlapping of pedes-
rians. Depending on the different established models, the crowd
ideo analysis methods include three main classes: (1) microscopic
odeling such as frameworks based on the particle filters, (2)
acroscopic modeling based on low-level features such as the
patial-temporal gradient and optical flow, and (3) crowd events
etection [2,3] . Based on the developments in the related fields,
uch as data mining (DM), artificial intelligence (AI), computational
ntelligent (CI), soft computing (SC), image signal processing (ISP),
athematical modeling (MM), CV, and PR, the research on abnor-
2 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355
2
p
m
i
[
m
m
i
w
m
s
a
l
a
a
s
s
m
t
h
a
c
r
s
a
t
f
[
i
f
r
c
[
i
[
c
a
f
m
e
m
a
s
s
t
m
p
v
c
f
t
c
a
t
m
d
t
t
i
a
fi
w
mal event detection has shown a positive evolution in the last
decade [4] . In particular, researchers have proposed a substantial
amount of automated techniques for crowd analysis in the CV and
PR fields, such as tracking and video concept detection models and
models to estimate the density of people and to understand the
behaviors of crowds [5–7] .
Generally, based on treating the crowd as a single entirety in
a specific scene, researchers analyze the motion of the crowd and
update the status of the crowd as abnormal or normal depending
on the dynamics emanating from the entire crowd. Nevertheless, in
conditions where the motion of a crowd is random and the crowd
motion pattern is unstructured, the methods proposed for struc-
tured crowded scenes, such as [8] , show a lack of effectiveness [9] .
In addition, despite the considerable developments achieved in the
field of human activity analysis, the task of modeling and under-
standing the behaviors of a crowd remains immature.
Nowadays, the development of low-rank matrix theory is sig-
nificant and attracts more and more researchers’ attention [10–14] ,
and the theory is utilized in our work. In this paper, we propose a
novel solution to detect both global and local anomalies in surveil-
lance systems. Our main contributions in this paper are summa-
rized as follows:
(1) To remove the low variations and noise of objects in the
background, we extract the motion descriptor of the fore-
ground by integrating background subtraction with binariza-
tion of surveillance videos.
(2) In the training stage, to obtain a low-rank dictionary based
on the similarity of normal training samples and a compact
cluster of reconstruction coefficient vectors surrounding a
center in the meantime, we propose a new joint optimiza-
tion of the nuclear-norm and l 2, 1 -norm.
(3) In the detection stage, to obtain a large gap between the re-
construction errors of abnormal testing samples and those
of normal testing samples, we force the reconstruction coef-
ficient vectors of abnormal frames to distribute so that they
resemble those of normal ones by solving an l 2, 1 -norm op-
timization problem.
The work in this paper is the extension of the previous method
published in [15] . The improvements over [15] are as follows: (1)
To avoid the extra preprocess for denoising and enhance the ro-
bustness of our model, we improve the model by adding new
items representing noise in real situations and relative parame-
ters. What’s more, we present the process to solve the optimiza-
tion problems in a detailed way and analyze the parameters in
Algorithms 1 and 2 , which is not presented in [15] . (2) Except the
UMN dataset [16] , we add the PETS2009 dataset [17] in the experi-
ments to detect anomaly at the global scale, and the UCSD [18] and
CUHK Avenue [19] datasets to validate the effectiveness of our new
method of local abnormal event detection and localization. (3) We
provide a more comprehensive introduction. Also, a new section of
related work to elaborate the previous related works and another
new section of problem formulation and motivation to introduce
our solution and explain the rationality of our method are added
into our paper.
We organize the rest of this paper as follows. Section 2 briefly
reviews the related works. Section 3 describes the problem formu-
lation. Section 4 presents the algorithm we proposed for anomaly
detection in detail. We provide the experimental results of abnor-
mal event detection at both global and local scales and the com-
parisons with the state-of-the-art methods in Section 5 . Finally,
some conclusions are presented in Section 6 .
. Related work
In recent years, many works have been undertaken and much
rogress has been achieved in the area of video surveillance. Kos-
opoulos and Chatzis [20] described a pixel-level model by utiliz-
ng holistic visual behavior understanding methods. Mehran et al.
21] and Yen and Wang [22] introduced an anomaly detection
ethod in crowded scenes, which was named the social force
odel. In the social force model, based on optical flow analysis,
ndividuals were treated as moving particles, and the social force
as the interaction force between every two particles. Further-
ore, Zhang et al. [23,24] proposed an extended model named the
ocial attribute-aware force model and Chaker et al. [25] proposed
n unsupervised approach for crowd scene anomaly detection and
ocalization using a social network model. Lee et al. [26] devised
motion influence map algorithm to describe human activities
nd detect abnormal events. Amraee et al. [27] utilized the de-
criptor of the histogram of oriented gradient (HOG) and a Gaus-
ian model to detect anomalies. Depending on the spatial pyra-
id matching kernel (SPM)-based BoW model, Hung et al. [28] ex-
racted the SIFT feature to represent the motion of a crowd. Sand-
an et al. [29] proposed an unsupervised learning algorithm for
nomaly detection based on the fact that in general human per-
eption, normal events occur frequently while the rarely occur-
ing events are abnormal. By leveraging both labeled and unlabeled
egments, Tziakos et al. [30] discovered the projection subspace
ssociated with detectors to tackle the problem that the informa-
ion about abnormal events was not available and the labeled in-
ormation about normal events was limited. Haque and Murshed
31] presented an algorithm to detect abnormal events without us-
ng any motion or tracking feature. Shi et al. [32] proposed a model
or abnormal event detection utilizing the developed spatiotempo-
al co-occurrence Gaussian mixture models (STCOG). Based on the
haracteristics of the dynamics and density of a crowd, Yin et al.
33] introduced a method to increase the information content via
ncreasing the dimension of the motion feature. Mahadevan et al.
34] leveraged the dynamic texture of the normal behaviors of a
rowd to form a mixture model. Singh and Mohan [35] proposed
n approach for abnormal activity recognition based on graph
ormulation of video activities and graph kernel support vector
achine.
Some models to detect abnormal events using the concept of
ntropy are emerging. Taking advantage of the Gaussian mixture
odel (GMM) and the particle entropy, Gu et al. [36] presented
method to represent the distribution of the crowd in crowded
cenes. Due to random motion patterns of the crowd in abnormal
ituations, Lee et al. [37] described a general purpose human mo-
ion analysis (HMA) method based on statistics and entropy.
Low-level feature optical flow can reflect the relative distance of
oving objects in a specific scene at two different moments at the
ixel-level, which is useful and important in anomaly detection of
ideo surveillance. Wang and Snoussi [38] described a global opti-
al flow orientation histogram-based model. Based on the motion
eature denoted as the histogram of maximal optical flow projec-
ion (HMOFP), Li et al. [15,39–42] proposed models to describe the
rowd motion status and detect anomalies in crowded scenes. Patil
nd Biswa [43] utilized the histogram of the magnitude and orien-
ation of optical flow to capture the motion of a crowd. Further-
ore, Colque et al. [44] proposed a similar spatiotemporal motion
escriptor named the histogram of optical flow orientation, magni-
ude and entropy based on the information of optical flow and en-
ropy. Zhang et al. [45] presented an anomaly detection framework
ntegrating the motion feature in terms of optical flow and appear-
nce cues. Based on the divergence and curl of the optical flow
eld, Chen and Lai [46] proposed a divergence-curl-driven frame-
ork for the perception of crowd motion states.
A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 3
h
e
s
l
C
t
t
s
f
t
s
a
g
F
m
w
c
s
s
p
f
f
t
f
v
f
W
s
F
b
f
o
r
t
T
f
f
t
b
t
l
e
d
m
i
w
i
i
s
[
m
w
i
N
l
s
c
d
r
a
p
b
3
s
l
f
i
l
f
F
f
[
t
m
s
w
B
b
R
r
s
u
i
‖
v
m
m
p
t
n
w
t
t
l
a
c
t
b
t
b
t
t
i
j
p
b
r
t
c
r
w
4
4
r
a
H
Recently, the redundant dictionary-based sparse representation
as attracted ever-increasing attention and is different from most
xisting anomaly detection methods. By applying the sparse sub-
pace clustering, Ren and Moeslund [47] proposed a dictionary
earning-based algorithm to detect anomalies in crowded scenes.
ong et al. [48] described a reconstruction model based on a dic-
ionary and utilized the sparse reconstruction cost (SRC) to de-
ect abnormal events. For abnormal event detection in crowded
cenes, Yuan et al. [49] optimized a structured dictionary learning
ramework and sparse representation coefficients through an itera-
ive updating strategy. Moreover, to accomplish the dictionary con-
truction, some dictionary learning methods were presented, such
s the nonnegative matrix factorization (NMF) [50] , the K-SVD al-
orithm [51,52] , the latent dictionary learning (LDL) [53] , and the
isher discrimination dictionary learning (FDDL) [54] .
In recent times, in addition to the hand-crafted feature-based
ethods above, some researchers have established new frame-
orks in the field of anomaly detection with the general appli-
ation of deep learning-based methods. Huang et al. [55] pre-
ented a multimodal fusion scheme based on convolutional re-
tricted Boltzmann machines. Ravanbakhsh et al. [56] proposed a
lug-and-play convolutional neural network (CNN)-based method
or crowd motion analysis. Autoencoder models based on CNN
or abnormal event detection were introduced in [57–60] . Fur-
hermore, an end-to-end deep network based on an autoencoder
ramework was presented in [61,62] , which was called a full con-
olutional network (FCN). Liu et al. [63] proposed a new baseline
or anomaly detection named future frame prediction. In addition,
ang et al. [64] utilized the extreme learning machine (ELM) - a
ingle layer neural network to detect and localize abnormal events.
or anomaly detection, Sun et al. [65] introduced a neural network
ased model called online growing neural gas (online GNG) to per-
orm an unsupervised learning.
In these previous works, the hand-crafted feature-based meth-
ds can be classified into two types. The first type addresses the
epresentation of the motion descriptor of a crowd. The second
ype addresses the model to detect whether an event is normal.
hese methods focused solely on only one part of the detection
ramework. In other words, in the process of modeling, the in-
ormation of the motion descriptor was not made full use of. In
he deep learning-based methods, most anomaly detections are
ased on the reconstruction of regular training data. Even though
hese methods assume that abnormal events would correspond to
arger reconstruction errors due to the good capacity and gen-
ralization of a deep neural network, this assumption, however,
oes not necessarily hold. Therefore, reconstruction errors of nor-
al and abnormal events will be similar, resulting in less discrim-
nation [63] . On the other hand, these models perform extremely
ell in domains with large amounts of training data. With lim-
ted training data, however, they are prone to overfitting. This lim-
tation arises often in the abnormal event detection task where
carcity of real-world training examples is a major constraint
56] .
In this paper, in the process of constructing the detection
odel, we utilize the characteristics of the training samples, i.e.,
e propose an algorithm to conduct the low-rank dictionary learn-
ng based on the similarity of the features of the training data.
ote that the amount of training data is far less than that of deep
earning-based methods. In addition, we add the l 2, 1 -norm to con-
train the reconstruction coefficient vectors to obtain a compact
luster in the training stage. Different from the detection models
escribed in previous work, in the stage of detection, we force the
econstruction coefficient vectors of all the testing samples to have
similar distribution to the training samples. Thus, abnormal sam-
les have large reconstruction errors, and we can detect anomalies
y the value of the reconstruction cost.
. Problem formulation and motivation
Considering that anomaly detection is applied in different
cenes, we define the abnormal event detection problem as fol-
ows. Note that a sample denotes the motion feature of an original
rame in global abnormal event detection or one patch of a frame
n local abnormal event detection in this paper. To clarify the prob-
em conveniently, a sample denotes the motion feature vector of a
rame in this section.
Assume that we have a training frame set denoted as
= [ f 1 , f 2 , ..., f N 0 ] , where N 0 denotes the number of training
rames. The corresponding training sample set is denoted as H = H 1 , H 2 , ..., H N 0
] , where H i ∈ R
M denotes the motion feature vector
o describe a normal training sample, M is the dimension of the
otion feature. Suppose that we have a testing frame f t , where the
ubscript “t” denotes “testing”. To obtain the detection result of f t ,
e should design a discrimination function as follows:
f : f t → { normal, abnormal} (1)
ased on our previous works in [15,40–42] , this can be realized
y the sparse representation with an overcomplete dictionary D ∈
M×K .
Suppose that the motion feature of f t is H t , and its sparse rep-
esentation coefficient vector over D is z t . Then, H t can be recon-
tructed by ˆ H t = D z t . In general, the reconstruction cost can be
sed to determine whether a testing sample is abnormal or not. It
s usually expressed by the reconstruction error, i.e., ‖ ̂ H t − H t ‖ 2 = H t − D z t ‖ 2 . If the reconstruction cost is larger than a threshold
alue, f t is detected as an abnormal frame; otherwise, it is a nor-
al frame.
As we know, the distribution of the coefficient vectors of abnor-
al testing samples is different from that of normal testing sam-
les with a small reconstruction error. To enlarge the gap between
he reconstruction errors of abnormal testing samples and those of
ormal testing samples, i.e., to obtain larger reconstruction errors
hen testing samples are abnormal, we can force the reconstruc-
ion coefficient vectors of abnormal frames to distribute similar to
hose of normal ones by solving an l 2, 1 -norm optimization prob-
em. Thus, we can adopt the value of the reconstruction cost as
n abnormal event measurement to tackle the problem of binary
lassification.
More concretely, for an abnormal sample, when we calculate
he reconstruction coefficient vectors over the dictionary trained
y normal samples, we can force the obtained coefficient vectors
o be closer to the center of the normal samples’ coefficient vectors
y some special restraint conditions. In fact, according to the na-
ure of sparse representation, once a dictionary is trained, the dic-
ionary will represent any sample (normal or abnormal) as well as
t can, no matter what kind of samples are input. The difference is
ust the values of the reconstruction errors. For an abnormal sam-
le, the reconstruction error based on the dictionary trained only
y normal samples will be large. Now, we forced the sparse rep-
esentation coefficient vectors of the abnormal sample to be closer
o the center of the normal samples’ coefficient vectors. This will
ause a bad distortion to the reconstructed sample. As a result, the
econstruction error of the abnormal sample will be larger. In this
ay, the accuracy of abnormal event detection can be improved.
. Proposed method
.1. Motion feature extraction
The motion information of any two consecutive frames can be
eflected by the optical flow field, which describes the directions
nd amplitudes of the moving objects in a scene. In our paper, the
orn-Schunck (HS) method is adopted to obtain the optical flow
4 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355
Fig. 1. (a) The original frame. (b) The corresponding binary frame after the process
of background subtraction. (c) The process to compute the HMOFP feature.
l
a
p
h
t
t
t
i
t
c
l
l
c
t
m
w
s
r
t
n
a
n
a
∑
c
i
a
t
c
s
p
D
(
a
D
w
p
p
f
L
t
field of frame images. As shown in Fig. 1 (a) and (b), we first obtain
the optical flow field of the original frame. At the same time, the
background subtraction with the method of nonparametric kernel
density estimation [66] and the binarization operation are applied
on the original frame to obtain the corresponding binary frame.
Then, the optical flow vector’s amplitude in each pixel is modified
according to the binary frame, i.e., if the corresponding pixel’s gray
value in the binary frame is 255, the value of the optical flow vec-
tor remains unchanged; otherwise, the optical flow vector is set to
be a zero vector. This processing can eliminate the influence of low
variations and noise from the background, such as the optical flow
vectors caused by the change of illumination in the background ar-
eas. Based on the optical flow field, we use the optimized HMOFP
[15,39–42] as the motion feature descriptor, which is computed by
the binary frame. The whole process to obtain HMOFP is illustrated
in Fig. 1 (c).
As shown in Fig. 2 , we introduce two spatial bases Type A
[48] and Type B for the detection of global and local abnormal
events respectively. The relationship between Type A and Type B
is also shown in Fig. 2 . Spatial basis Type A is chosen to rep-
resent the global motion feature of a frame, and we can obtain
m 1 × n 1 patches of the frame. We compute the HMOFP feature of
each patch and concatenate the m 1 × n 1 feature vectors to con-
struct the total HMOFP of the whole frame. For LAE, the abnormal
detection is based on the image patches of a frame. Similar to the
detection of GAE, a patch is divided into m 2 × n 2 cells, and the
way to extract the HMOFP feature based on this basis is same as
that of the Type A basis. In our framework, when we deal with lo-
cal abnormal event detection, we treat a patch as a frame of small
size. In other words, we convert the local abnormal event detec-
tion to a global scale problem. Note that in the detection of LAE,
we only consider the patches in the foreground corresponding to
the locations where the pixels’ gray values are 255.
4.2. Anomaly detection based on LRCCDL
4.2.1. Training stage
Considering the initial training data set
T R = [ t r 1 , t r 2 , ..., t r N 0 ] (2)
where tr i (1 ≤ i ≤ N 0 ) denotes a single frame in global abnormal
event detection or a set of patches of the i th frame in local abnor-
mal event detection. Since local event abnormal detection can be
treated as a special kind of global abnormal event detection, we
introduce the anomaly detection of global abnormal events. The
corresponding feature pool of TR is H. We leverage the method
in [42] to obtain the optimized feature pool, which is denoted as
H
∗ ∈ R
M×K 0 ( K < N ) . H
∗ is such a set that the columns never uti-
0 0ized to represent the others in H are deleted. H
∗is a compact set
nd has a better ability to represent normal events. In the training
rocess, since the original frames in the initial training data set
ave a similar visual appearance except the areas of background,
he motion feature vectors of normal events are similar in the fea-
ure pool H
∗. Our dictionary leaning is completed based on such a
raining sample set, and the output dictionary of dictionary learn-
ng has a low-rank characteristic. Furthermore, we control the dis-
ribution of the reconstruction coefficient vectors and make them
ompact. Based on the previous work in [15] , we utilize the fol-
owing model, i.e., the low-rank and compact coefficient dictionary
earning (LRCCDL) method, to obtain a low-rank dictionary and a
ompact cluster of reconstruction coefficient vectors at the same
ime:
in
D,Z,E ‖
D ‖ ∗ + α‖
Z − C ‖ 2 , 1 + β‖
E ‖ 2 , 1
s . t . H
∗ = DZ + E (3)
here D ∈ R
M×K is the reconstruction dictionary with low-rank
tructure and K is the number of columns of D . Z ∈ R
K×K 0 is the
econstruction coefficient matrix. C ∈ R
K×K 0 is a cluster center ma-
rix, and each column of C is the mean vector of Z , which is de-
oted as c . E ∈ R
M×K 0 is the reconstruction error matrix, and αnd β are two regularization parameters. ‖ ‖ ∗ denotes the nuclear-
orm of a matrix, i.e., the sum of the matrix’s singular values,
nd it approximates the rank of the matrix. ‖ A ‖ 2 , 1 =
∑
j
‖ [ A ] : , j ‖ 2 =
j
√ ∑
i
( [ A ] i, j ) 2
is defined as the l 2, 1 -norm of matrix A and each
olumn of matrix A is encouraged to be zero [67] . The aim of ‖ D ‖ ∗s to restrict the low-rank structure of D . Each column in H
∗ has
corresponding vector in Z and E respectively. ‖ Z − C ‖ 2 , 1 makes
he reconstruction coefficient vectors of columns in H
∗ similar and
ompactly surround the center c , and ‖ E ‖ 2, 1 regularizes the recon-
truction error of each column in H
∗; thus, it is as close to zero as
ossible.
To solve (3), we first convert it as follows:
min
,Z,E, J 1 , J 2 ‖
J 1 ‖ ∗ + α‖
J 2 − C ‖ 2 , 1 + β‖
E ‖ 2 , 1
s . t . H
∗ = DZ + E, D = J 1 , Z = J 2 (4)
4) can be solved by solving the following equivalent problem, i.e.,
n augmented Lagrange multiplier (ALM) problem:
min
,Z,E, J 1 , J 2 , Y 1 , Y 2 , Y 3 ‖
J 1 ‖ ∗ + α‖
J 2 − C ‖ 2 , 1 + β‖
E ‖ 2 , 1
+ tr [Y T 1 ( H
∗ − DZ − E) ]
+ tr [Y T 2 (D − J 1 )
]+ tr
[Y T 3 (Z − J 2 )
]+
μ
2
(‖
H
∗ − DZ − E ‖
2 F + ‖
D − J 1 ‖
2 F + ‖
Z − J 2 ‖
2 F
)(5)
here Y 1 , Y 2 and Y 3 are Lagrange multipliers and μ > 0 is a
enalty parameter, and T denotes the operation of matrix trans-
osition. (5) can be solved by the inexact ALM algorithm [68] as
ollows. We can rewrite (5) as
= ‖
J 1 ‖ ∗ + α‖
J 2 − C ‖ 2 , 1 + β‖
E ‖ 2 , 1
+ tr [Y T 1 ( H
∗ − DZ − E) ]
+ tr [Y T 2 (D − J 1 )
]+ tr
[Y T 3 (Z − J 2 )
]+
μ
2
(‖
H
∗ − DZ − E ‖
2 F + ‖
D − J 1 ‖
2 F + ‖
Z − J 2 ‖
2 F
)(6)
To resolve (5), we differentiate (6) and update one variable at a
ime with the others fixed to their recent values.
Step 1: update J 1 .
∂L
∂ J 1 =
∂ ‖
J 1 ‖ ∗∂ J 1
+
∂ tr [Y T 2 (D − J 1 )
]∂ J 1
+
μ
2
∂ ‖
D − J 1 ‖
2 F
∂ J 1
=
∂ ‖
J 1 ‖ ∗∂ J
+ μ( J 1 − (D + Y 2 /μ))
1A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 5
Fig. 2. The Type A basis corresponding to the detection of GAE and the Type B basis corresponding to the detection of LAE.
J
Z
[
w
v
J
Algorithm 1 Solving Problem (3) by Inexact ALM.
Input: matrix H
∗ , initial dictionary D = U (where U comes from
[ U, ∑
, V ] = s v d( H
∗) ), parameter α, β
Output: D, C
Initialize: Z = 0, E = 0, J 1 = 0 , J 2 = 0 , Y 1 = 0 , Y 2 = 0 , Y 3 = 0 ,
μ = 10 −6 , μ̄ = 10 30 , ρ = 1 . 1 , ε = 10 −6
while not converged do
1. Fix others and update J 1 by Eq. (9)
2. Fix others and update Z by Eq. (11)
3. Fix others and update C using Z by Eq. (12)
4. Fix others and update J 2 by Eq. (15)
5. Fix others and update E by Eq. (18)
6. Fix others and update D by Eq. (20)
7. Update the multipliers
Y 1 = Y 1 + μ( H
∗ − DZ − E)
Y 2 = Y 2 + μ(D − J 1 )
Y 3 = Y 3 + μ(Z − J 2 )
8. Update the parameter μ by
μ = min (ρμ, μ̄)
9. Check the convergence conditions
‖ H
∗ − DZ − E ‖ ∞ < ε
and
‖ D − J 1 ‖ ∞ < ε
and
‖ Z − J 2 ‖ ∞ < ε
end while
E
D
t
W
n
a
a
e
l
= 0 (7)
Integrate J 1 and we can obtain
1
μ‖
J 1 ‖ ∗ +
1
2
‖
J 1 − (D + Y 2 /μ) ‖
2 F = 0 (8)
Therefore, we can obtain
1 = argmin
1
μ‖
J 1 ‖ ∗ +
1
2
‖
J 1 − (D + Y 2 /μ) ‖
2 F (9)
Step 2: update Z.
∂L
∂Z =
∂ tr [Y T 1 ( H
∗ − DZ − E) ]
∂Z +
∂ tr [Y T 3 (Z − J 2 )
]∂Z
+
μ
2
(∂ ‖
H
∗ − DZ − E ‖
2 F
∂Z +
∂ ‖
Z − J 2 ‖
2 F
∂Z
)
= −D
T Y 1 + Y 3 − μ( D
T H
∗ − D
T E + J 2 ) + μ( D
T D + I) Z
= 0 (10)
Therefore, we can obtain
= ( D
T D + I) −1 [D
T H
∗ − D
T E + J 2 + ( D
T Y 1 − Y 3 ) /μ]
(11)
Step 3: update C.
We update each column of C as follows:
C ] : , j =
1
K 0
∑
t [ Z ] : ,t (12)
here (:, j )denotes the j th column of matrix C . (12) implies that the
alue of each column of matrix C is equal.
Step 4: update J 2 .
∂L
∂ J 2 = α
∂ ‖
J 2 − C ‖ 2 , 1
∂ J 2 +
∂ tr [Y T 3 (Z − J 2 )
]∂ J 2
+
μ
2
∂ ‖
Z − J 2 ‖
2 F
∂ J 2
= α∂ ‖
J 2 − C ‖ 2 , 1
∂ J 2 + μ[ J 2 − (Z + Y 3 /μ) ]
= 0 (13)
Integrate J 2 and we can obtain
α
μ‖
J 2 − C ‖ 2 , 1 +
1
2
‖
J 2 − (Z + Y 3 /μ) ‖
2 F = 0 (14)
Therefore, we can obtain
2 = argmin
α
μ‖
J 2 − C ‖ 2 , 1 +
1
2
‖
J 2 − (Z + Y 3 /μ) ‖
2 F + C (15)
Step 5: update E.
∂L
∂E = β
∂ ‖
E ‖ 2 , 1
∂E +
∂ tr [Y T 1 ( H
∗ − DZ − E) ]
∂E
+
μ
2
∂ ‖
H
∗ − DZ − E ‖
2 F
∂E
= β∂ ‖
E ‖ 2 , 1
∂E + μ[ E − ( H
∗ − DZ + Y 1 /μ) ]
= 0 (16)
Integrate E and we can obtain
β
μ‖
E ‖ 2 , 1 +
1
2
‖
E − ( H
∗ − DZ + Y 1 /μ) ‖
2 F = 0 (17)
Therefore, we can obtain
= argmin
β
μ‖
E ‖ 2 , 1 +
1
2
‖
E − ( H
∗ − DZ + Y 1 /μ) ‖
2 F (18)
Step 6: update D.
∂L
∂D
=
∂ tr [Y T 1 ( H
∗ − DZ − E) ]
∂D
+
∂ tr [Y T 2 (D − J 1 )
]∂D
+
μ
2
(∂ ‖
H
∗ − DZ − E ‖
2 F
∂D
+
∂ ‖
D − J 1 ‖
2 F
∂D
)
= μD (Z Z T + I) −[Y 1 Z
T − Y 2 + μ( H
∗Z T + J 1 − E Z T ) ]
= 0 (19)
Therefore, we can obtain
=
[H
∗Z T + J 1 − E Z T + ( Y 1 Z T − Y 2 ) /μ
](Z Z T + I) −1 (20)
Our dictionary learning method, i.e., LRCCDL, is described in de-
ail in Algorithm 1 .
hen this iterative process ends, we obtain the low-rank dictio-
ary D and obtain the mean vector of Z , which is denoted as c , i.e.,
ny column of C .
We can solve optimization problem (3) via either inexact or ex-
ct ALM [68] . We choose the inexact ALM algorithm based on its
fficiency, and we outline the method in Algorithm 1 . We can uti-
ize the singular value thresholding operator method to solve the
6 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355
Fig. 3. (a) The curves of Con 1- Con 3 along with the number of iteration. (b) The curves of Con 4- Con 6 along with the number of iteration.
C
C
c
c
t
b
C
a
e
a
D
4
n
c
a
Z
t
n
r
t
s
b
m
w
i
Y
c
Z
Z
w
p
r
a
A
problem in Step 1 [69] . Step 4 and Step 5 are solved by the alter-
nating minimization algorithm in [70,71] .
The convergence of the exact ALM algorithm to solve the prob-
lem with a smooth objective function has been proven in [72] .
As a variation of exact ALM, inexact ALM is also widely used,
whose convergence has been well studied when the number of
blocks is at most two [68,73] . Up to now, to ensure the conver-
gence of inexact ALM with three or more blocks is still difficult
[12,70,73] . In Algorithm 1 , the objective function of problem (3)
is not smooth and there are six blocks, i.e., D, J 1 , Z, J 2 , E , and C
(Since [ C] : , j =
1 K 0
∑
t [ Z] : ,t , i.e., Eq. (12) in the paper, we mainly an-
alyze the convergence of the first five blocks). So it is difficult to
prove the convergence of Algorithm 1 in theory. However, there are
some guarantees to ensure the convergence of Algorithm 1 . Base
on the theoretical results in [74] , two conditions are sufficient for
the convergence of Algorithm 1 : (a) The dictionary D is full of col-
umn rank; (b) The gap between the solution obtained in each it-
eration after a certain number of iterations, which is denoted as
( D k , J 1 k , Z k , J 2 k ) at the k th iteration, and the ideal solution obtained
by minimizing the Lagrange function is monotonically decreasing,
which is denoted as arg min D, J 1 ,Z, J 2 L . The gap can be described as
ηk = ‖ ( D k , J 1 k , Z k , J 2 k ) − arg min D, J 1 ,Z, J 2 L ‖ 2
F . Condition (a) is easy to
be satisfied. In Theorem 1 of [70] , we have the result: For any op-
timal solution to the problem (3), we have Z ∗ ∈ span( D
∗T ), where
Z ∗, D
∗ are the optimal solution of Z and D respectively. This theo-
rem shows that the optimal solution Z ∗ of problem (3) always lies
within the subspace spanned by the rows of D
∗. This means that Z ∗
can be expressed by Z ∗ = P ∗Z̄ ∗, where P ∗ can be computed by or-
thogonalizing the columns of D
∗T . So problem (3) can be converted
into the following equivalent problem by replacing Z with P ∗Z̄ :
min
A, ̄Z ,E ‖
A ‖ ∗ + α∥∥Z̄ − C̄
∥∥2 , 1
+ β‖
E ‖ 2 , 1
s . t . H
∗ = A ̄Z + E (21)
where A = D P ∗. Based on the optimal solution ( A
∗, ̄Z ∗, E ∗) , we can
obtain the optimal solution of problem (3) by ( A
∗P ∗−1 , P ∗Z̄ ∗, E ∗) .
The number of the rows of Z̄ is at most the rank of D , so the suffi-
cient condition (a) can be satisfied. For the condition (b), although
to prove it strictly is not easy, the convexity of the Lagrange func-
tion could guarantee its validity to some extent [74] . Based on the
sufficient conditions (a) and (b), the convergence properties could
be well expected. What’s more, as shown in [73] , the inexact ALM
algorithm generally performs well in reality.
Moreover, we provide the process of training stage on dataset
UMN, which is also used in the experiments in the next section.
We choose one scene, i.e., the indoor scene, in the dataset to
demonstrate the convergence behavior of Algorithm 1 . For conve-
nience, we set variables as follows:
Con 1=( ‖ D i ‖ F −‖ D i−1 ‖ F ) / ‖ D i ‖ F , Con 2 = ( ‖ Z i ‖ F −‖ Z i −1 ‖ F ) / ‖ Z i ‖ F ,on 3=( ‖ E i ‖ F −‖ E i−1 ‖ F ) / ‖ E i ‖ F , Con 4= ‖ H
∗−DZ−E ‖ ∞
, Con 5= ‖ D−J 1 ‖ ∞
,
on 6= ‖ Z−J 2 ‖ ∞
. The following two figures are shown as the
onvergence analysis of Algorithm 1 .
Fig. 3 (a) shows that Con 1, Con 2, and Con 3 monotonically de-
rease to zeros after a certain number of iterations, which indicates
hat D, Z , and E have good convergence properties. From Fig. 3 (b),
ased on convergence conditions Con 4- Con 6, especially Con 5 and
on 6, we can infer that J 1 = D and J 2 = Z when the values of Con 5
nd Con 6 are zeros. So J 1 and J 2 also have good convergence prop-
rties. Moreover, [ C] : , j =
1 K 0
∑
t [ Z] : ,t . Thus, the matrix C converges
fter a certain number of iterations. In summary, the 6 terms, i.e.,
, J 1 , Z, J 2 , C , and E , converge as shown in Algorithm 1 .
.2.2. Detecting stage
In the detection stage, our aim is to distinguish normal and ab-
ormal samples. Our solution is that we control the reconstruction
oefficient vectors of all the testing samples (including normal and
bnormal samples) compactly distribute surrounding the center of
in (3), i.e., all the reconstruction coefficient vectors are similar to
hose of normal samples. Therefore, the reconstruction error of a
ormal testing sample will be smaller and the reconstruction er-
or of an abnormal sample will be larger. By solving (3), we obtain
he low-rank dictionary D and the mean vector c . Given a testing
ample set Y t , the reconstruction coefficient vectors are obtained
y solving the following optimization problem:
in
Z t , E t ‖
Z t − C t ‖ 2 , 1 + γ ‖
E t ‖ 2 , 1
s . t . Y t = D Z t + E t (22)
here Z t is the reconstruction coefficient set. Each column of C t s denoted as c t , and c t = c. E t is the reconstruction error set of
t , and γ is a regularization parameter. The above problem can be
onverted to the equivalent problem:
min
t , E t , W t
‖
W t − C t ‖ 2 , 1 + γ ‖
E t ‖ 2 , 1
s . t . Y t = D Z t + E t , Z t = W t (23)
( 23 ) can be solved by solving the following ALM problem:
min
t , E t , W t , L 1 ,t , L 2 ,t ‖
W t − C t ‖ 2 , 1 + γ ‖
E t ‖ 2 , 1
+ tr [L T 1 ,t ( Y t − D Z t − E t )
]+ tr
[L T 2 ,t ( Z t − W t )
]+
μt
2
( ‖
Y t − D Z t − E t ‖
2 F + ‖
Z t − W t ‖
2 F ) (24)
here L 1, t and L 2, t are Lagrange multipliers and μt > 0 is a
enalty parameter. We can solve ( 23 ) by the inexact ALM algo-
ithm. The update steps are similar to those in the training stage
nd omitted here, we only give the iteration steps, as shown in
lgorithm 2 .
A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 7
Algorithm 2 Solving Problem (22) by Inexact ALM.
Input: matrix Y t , dictionary D , parameter γ
Output: Z t Initialize: Z t = 0 , E t = 0 , W t = 0 , L 1 ,t = 0 , L 2 ,t = 0 , μt = 10 −6 , μ̄t = 10 30 , ρt = 1 . 1 , ε t = 10 −6
while not converged do
1. Fix others and update W t by
W t = arg min 1 μt
‖ W t − C t ‖ 2 , 1 +
1 2 ‖ W t − ( Z t + L 2 ,t / μt ) ‖ 2 F + C t
2. Fix others and update Z t by
Z t = ( D T D + I) −1 [ D T Y t − D T E t + W t + ( D T L 1 ,t − L 2 ,t ) / μt ]
3. Fix others and update E t by
E t = argmin γμt ‖ E t ‖ 2 , 1 +
1 2 ‖ E t − ( Y t − D Z t + L 1 ,t / μt ) ‖ 2 F
4. Update the multipliers
L 1 ,t = L 1 ,t + μt ( Y t − D Z t − E t )
L 2 ,t = L 2 ,t + μt ( Z t − W t )
5. Update the parameter μt by
μt = min ( ρt μt , μ̄t )
6. Check the convergence conditions
‖ Y t − D Z t − E t ‖ ∞ < ε t and
‖ Z t − W t ‖ ∞ < ε t end while
Fig. 4. (a) A clip of testing surveillance video, i.e., frame 850 to frame 1350 (the first abnormal frame is approximately frame 1295). (b) The RC values corresponding to
normal/abnormal frames.
l
i
o
n
n
p
s
c
o
d
s
t
t
s
p
a
e
m
v
t
m
a
e
a
r
s
w
t
Z
w
R
w
p
n
R
w
i
v
5
m
C
2
b
H
a
d
In traditional abnormal event detection based on dictionary
earning, whether the reconstruction coefficient vector of a test-
ng sample is sparse is the key to judge if the sample is normal
r not, such as in [48] . Because the dictionary is learned based on
ormal training samples, it is short of the ability to represent ab-
ormal testing samples. Assume that H t 1 is a normal testing sam-
le and H t 2 is an abnormal testing sample, and the sparse repre-
entation coefficient vectors over D are z t 1 and z t 2 respectively. We
an find that z t 2 is more dense than z t 1 . In our LRCCDL method,
ur dictionary is also trained by normal training samples, so the
ictionary has strong ability to sparsely represent normal testing
amples. Furthermore, the second term ‖ Z − C ‖ 2 , 1 of the objec-
ive function of (3) encourages the reconstruction coefficient vec-
ors of training samples to surround their mean vector compactly,
o the reconstruction coefficient vectors of abnormal testing sam-
les should be far away from the mean vector. In the process of
nomaly detection, we utilize (22) to force the reconstruction co-
fficient vectors of all the testing samples to distribute around the
ean vector compactly ( c t = c), and such reconstruction coefficient
ectors of abnormal testing samples are similar to those of normal
esting samples, which will lead to a bad distortion for the abnor-
al testing samples. Assume that H t 3 is a normal testing sample
nd H t 4 is an abnormal testing sample, and the reconstruction co-
fficient vectors over D are z t 3 and z t 4 respectively. In the stage of
nomaly detection, the gap between the normal testing sample’s
econstruction error, i.e., ‖ H t 3 − D z t 3 ‖ 2 , and the abnormal testing
ample’s reconstruction error, i.e., ‖ H t − D z t ‖ 2 , will become large,
4 4hich will improve the distinguishing ability of our algorithm for
he normal and abnormal samples.
By solving (22), we can obtain the reconstruction coefficient set
t of Y t over the low-rank dictionary D . Given a sample H t in Y t ,
e define the reconstruction cost (RC) as follows:
C = ‖
H t − D z t ‖ 2 + λ‖
z t ‖ 1 (25)
here z t is the coefficient vector of H t in Z t , and λ is the multiplier
arameter to balance ‖ H t − D z t ‖ 2 and ‖ z t ‖ 1 . H t is determined as
ormal if the value of RC satisfies the following criterion
C < τ (26)
here τ is an artificially defined threshold to control the sensitiv-
ty of the algorithm to abnormal events.
Fig. 4 shows an example of abnormal event detection with the
alues of RC.
. Experimental results
To validate our proposed methods, we demonstrate experi-
ents on four public datasets, i.e., PETS 2009, UMN, UCSD, and
UHK Avenue. Specifically, in the experiments based on the PETS
009 dataset, we validate the significance of our first contribution
y comparing our method LRCCDL with the method extracting the
MOFP feature directly from the optical flow field of original im-
ges, and in the experiments based on the UMN dataset, we vali-
ate the significance of our second and third contributions by com-
8 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355
Fig. 5. (a) The normal scene of people walking toward all directions. (b) The abnormal scene of people moving toward one direction. (c) The classification results of sequence
Time 14-(55, 17) . Top: the detection result of LRCCDL; Middle: the detection result of LRCCDL_RAW. Bottom: the ground truth of the testing set.
t
d
s
a
a
r
5
a
o
f
f
n
i
s
s
a
u
t
5
r
a
f
c
d
w
r
o
5
r
d
a
r
w
s
m
t
W
f
I
l
o
w
o
p
a
o
p
paring our LRCCDL method with the traditional sparse reconstruc-
tion method.
5.1. Anomaly detection at the global scale on the PETS 2009 dataset
In this section, we choose the PETS 2009 dataset to evaluate our
algorithm by abnormal event detection at the global scale. In the
following experiments, some specific scenes are chosen as the de-
tection targets, i.e., abnormal events. In dataset PETS 2009, the res-
olution of a frame image is 576 × 768. We set the size of an image
patch as 144 × 192. Between every two neighbor patches, there is
no overlapping part. We evenly divide 0 ◦ − 360 ◦ into 36 bins (the
number of bins is a parameter in the process of the motion de-
scriptor extraction [15,39–42] ). The length of HMOFP feature vector
is 576 based on spatial basis Type A, as shown in Fig. 2 . In the ex-
periments, we compare our method LRCCDL with the method ex-
tracting the HMOFP feature directly from the optical flow field of
original images, which is denoted as LRCCDL_RAW.
5.1.1. Detection of crowd movement direction
In this part, the 0th frame to the 399th frame of Time 14–55
are chosen as the training set. The 400th frame to the 488th frame
of Time 14–55 are chosen as the normal testing set, including 89
frames. The abnormal testing set includes 89 frames, i.e., the 0th
frame to the 88th frame of Time 14–17 . In the training set and
the normal testing set, the people of the crowd are walking to-
ward several directions. In the abnormal testing set, the crowd is
moving only toward one direction. For convenience, the two test-
ing video sequences are denoted as Time 14-(55, 17) in this section.
Fig. 5 (a) and (b) show the normal and abnormal scenes. The ac-
curacy values of LRCCDL and LRCCDL_RAW are 93.21% and 91.97%
respectively. Fig. 5 (c) shows the detection results.
5.1.2. Detection of people running
In this part, the training set contains two parts: the 0th frame
to the 49th frame of Time 14–31 and the 0th frame to the 60th
frame of Time 14–17 . The 0th frame to the 37th frame and the
108th frame to the 173rd frame of Time 14–16 are chosen as the
normal testing set, including 104 frames. The abnormal testing set
includes the 38th frame to the 107th frame and the 174th frame to
the 222nd frame of Time 14–16 . In the training set and the normal
testing set, the people of the crowd are walking from right to left
and walking back toward the negative direction. In the abnormal
testing set, the crowd is running toward one direction. Fig. 6 (a)
and (b) show the normal and abnormal scenes. The accuracy val-
ues of LRCCDL and LRCCDL_RAW are 96.96% and 92.34% respec-
tively. Fig. 6 (c) shows the detection results.
5.1.3. Detection of people splitting
In this part, the 0th frame to the 40th frame of Time 14–16 are
chosen as the training set. The 0th frame to the 63rd frame of Time
14–31 are chosen as the normal testing set, including 64 frames.
The abnormal testing set includes 66 frames, i.e., the 64th frame to
the 129th frame of Time 14–31 . In the training set and the normal
esting set, the people of the crowd are walking toward the same
irection. In the abnormal testing set, the people of the crowd are
plitting in some directions. Fig. 7 (a) and (b) show the normal and
bnormal scenes. The accuracy values of LRCCDL and LRCCDL_RAW
re 95.59% and 94.65% respectively. Fig. 7 (c) shows the detection
esults.
.1.4. Detection of people scattering
In this part, the 0th frame to the 222nd frame of Time 14–16
re chosen as the training set. The 48th frame to the 93rd frame
f Time 14–17 are chosen as the normal testing set, including 46
rames. The abnormal testing set includes 36 frames, i.e., the 342nd
rame to the 377th frame of Time 14–33 . In the training set and the
ormal testing set, the people of the crowd are running or walk-
ng toward one direction. In the abnormal testing set, the crowd is
cattering in all directions. For convenience, the two testing video
equences are denoted as Time 14-(17, 33) in this section. Fig. 8 (a)
nd (b) show the normal and abnormal scenes. The accuracy val-
es of LRCCDL and LRCCDL_RAW are 99.75% and 98.75% respec-
ively. Fig. 8 (c) shows the detection results.
.1.5. Performance comparison
The experimental results above show that the LRCCDL algo-
ithm of the HMOFP descriptor based on background subtraction
nd binarization can obtain better performance in general. There-
ore, we can obtain a better motion descriptor based on our first
ontribution. Table 1 shows the detection results on the PETS 2009
ataset. Other than LRCCDL_RAW, our algorithm is also compared
ith the histogram of optical flow orientation (HOFO) method rep-
esented in [38] . As shown in the table, the detection accuracy of
ur proposed algorithm LRCCDL is better than other methods.
.2. Anomaly detection at the global scale on the UMN dataset
In this section, the UMN dataset is chosen to evaluate our algo-
ithm LRCCDL by anomaly detection at the global scale. The UMN
ataset includes three different crowded scenes, i.e., lawn, indoor,
nd plaza. The number of frame images is 7739 in total, and the
esolution of a frame image is 240 × 320. In the dataset, the scenes
here people are walking randomly are normal events, and the
cenes where people are running away simultaneously are abnor-
al events. We set the size of an image patch as 60 × 80. Be-
ween every two neighbor patches, there is no overlapping part.
e evenly divide 0 ◦ − 360 ◦ into 18 bins. The length of the HMOFP
eature vector is 288 based on spatial basis Type A shown in Fig. 2 .
n each scene, we choose the first 400 normal frames to train the
ow-rank dictionary. Traditional anomaly detection methods based
n the learned dictionary uses sparse reconstruction. As contrasted
ith LRCCDL, we also use the sparse reconstruction method to
btain the reconstruction coefficient vectors of the testing sam-
les. The reconstruction cost is the same as (25). This method for
nomaly detection is denoted as LRCCDL_SR. LRCCDL_SR is based
n the method described in [42] , while there are two different
oints: one is that the motion feature is obtained based on binary
A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 9
Fig. 6. (a) The normal scene of people walking toward one direction. (b) The abnormal scene of people running toward one direction. (c) The classification results of sequence
Time 14–16 . Top: the detection result of LRCCDL; Middle: the detection result of LRCCDL_RAW. Bottom: the ground truth of the testing set.
Fig. 7. (a) The normal scene of people walking toward the same direction. (b) The abnormal scene of people splitting in some directions. (c) The classification results of
sequence Time 14–31 . Top: the detection result of LRCCDL; Middle: the detection result of LRCCDL_RAW. Bottom: the ground truth of the testing set.
Fig. 8. (a) The normal scene of people walking or running toward one direction. (b) The abnormal scene of people scattering in all directions. (c) The classification results
of sequence Time 14-(17, 33) . Top: the detection result of LRCCDL; Middle: the detection result of LRCCDL_RAW. Bottom: the ground truth of the testing set.
Table 1
Different detection results on the PETS 2009 dataset.
Method Accuracy
Time 14-(55,17) Time 14–16 Time 14–31 Time 14-(17,33)
LRCCDL (Ours) 93.21% 96.96% 95.59% 99.75%
LRCCDL_RAW 91.97% 92.34% 94.65% 98.75%
HOFO [38] 90% 93.24% 94.61% 97.5%
f
T
O
5
i
t
d
c
t
C
5
q
t
(
s
t
5
i
h
s
T
m
5
t
o
O
t
n
a
w
s
m
m
w
o
rames; another is that the reconstruction cost is replaced by (25).
he detailed algorithm of LRCCDL_SR is presented in the Appendix.
ur experimental results are as follows.
.2.1. Abnormal event detection in the lawn scene
In the lawn scene, there are 1453 frames in the video sequence
n total. Fig. 9 (a) and (b) show two representative frames to exhibit
he normal event and abnormal event. Fig. 9 (c) and (d) show the
etection results and the receiver operating characteristic (ROC)
urves in the lawn scene. The area under the ROC Curve (AUC) of
he method LRCCDL is 99.94%, and the AUC of the method LRC-
DL_SR is 98.07%.
.2.2. Abnormal event detection in the indoor scene
In the indoor scene, there are 4144 frames in the video se-
uence in total. Fig. 10 (a) and (b) show two representative frames
o exhibit the normal event and abnormal event. Fig. 10 (c) and
d) show the detection results and the ROC curves in the indoor
cene. The AUC of the method LRCCDL is 99.55%, and the AUC of
he method LRCCDL_SR is 94.69%.
.2.3. Abnormal event detection in the plaza scene
In the plaza scene, there are 2142 frames in the video sequence
n total. Fig. 11 (a) and (b) show two representative frames to ex-
ibit the normal event and abnormal event. Fig. 11 (c) and (d)
hows the detection results and the ROC curves in the plaza scene.
he AUC of the method LRCCDL is 99.93%, and the AUC of the
ethod LRCCDL_SR is 97.65%.
.2.4. Performance comparison
The experimental results above show that compared with the
raditional sparse reconstruction, the LRCCDL algorithm based on
ur second and third contributions obtains better performance.
ur solution based on the low-rank dictionary for anomaly de-
ection by enlarging the gap between the reconstruction errors of
ormal testing samples and those of abnormal ones is effective
nd robust. Except for LRCCDL_SR, our algorithm is also compared
ith several state-of-the-art methods. The performance compari-
on results are shown in Table 2 . The hand-crafted feature based
ethods are listed above the dotted line, and deep learning-based
ethods are listed below the dotted line. In the remaining tables,
e also use the dotted line to distinguish these two types of meth-
ds. As shown in Table 2 , for the lawn and plaza scenes, the AUC
10 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355
Fig. 9. (a) The normal event in the lawn scene. (b) The abnormal event in the lawn scene. (c) The classification results of the lawn scene. Top: the detection result of
LRCCDL; Middle: the detection result of LRCCDL_SR. Bottom: the ground truth of the testing set. (d) The ROC curves in the lawn scene.
Fig. 10. (a) The normal event in the indoor scene. (b) The abnormal event in the indoor scene. (c) The classification results of the indoor scene. Top: the detection result of
LRCCDL; Middle: the detection result of LRCCDL_SR. Bottom: the ground truth of the testing set. (d) The ROC curves in the indoor scene.
Fig. 11. (a) The normal event in the plaza scene. (b) The abnormal event in the plaza scene. (c) The classification results of the plaza scene. Top: the detection result of
LRCCDL; Middle: the detection result of LRCCDL_SR. Bottom: the ground truth of the testing set. (d) The ROC curves in the plaza scene.
A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 11
Fig. 12. The normal scene (the 1st frame) and the abnormal scenes (the 2nd frame to the 5th frame) in dataset UCSD Ped1.
Fig. 13. The normal scene (the 1st frame) and the abnormal scenes (the 2nd frame to the 4th frame) in dataset UCSD Ped2.
Table 2
Comparison of LRCCDL with other methods on the UMN dataset.
Method AUC
Lawn Indoor Plaza
LRCCDL (Ours) 99.94% 99.55% 99.93%
LRCCDL_SR 98.07% 94.69% 97.65%
HOFO [38] 98.45% 90.37% 98.15%
Sparse [48] 99.5% 97.5% 96.4%
STCOG [32] 93.62% 77.59% 96.61%
Lee et al. [26] 99.4% 90.9% 98.1%
HOIF [33] 99.94% 99.18% 99.88%
Patil et al. [43] 98.67% 93.68% 97.11%
Zhang et al. [45] 99.3% 96.9% 98.8%
NN [48] 84%
Optical Flow [21] 93%
SF [21] 96%
—————————————————————————————————————–
AVID [61] 99.6%
TCP [62] 98.8%
Wang et al. [64] 99.0%
o
o
t
i
5
g
i
d
T
o
F
m
p
t
t
o
b
a
a
e
p
5
c
Table 3
Different EER values and AUC values of the detection of LAE on
UCSD Ped1.
Method Criteria
EER AUC
LRCCDL (Ours) 17.05% 90.01%
Sparse [48] 19% 86%
Ren et al. [47] 46.44% 54.26%
Adam et al. [34] 38% 56.63%
MDT [34] 25% 81.8%
SF-MPPCA [34] 32% 67.25%
MPPCA [34] 40% 59%
SF [21] 30% 67.5%
Lee et al. [26] 24.1% 80%
HOFME [44] 33.1% 72.7%
—————————————————————————————————————–
AVID [61] 12.3% –
Conv-WTA + SVM [58] 14.8% 91.6%
ConvLSTM-AE [60] – 75.5%
Liu et al. [63] – 83.1%
TCP [56] 8% 95.7%
Wang et al. [64] 18% 88.5%
Conv-AE [59] 27.9% 81.0%
Huang et al. [55] 11.2% 92.6%
Xu et al. [57] 12% 95.7%
1
d
c
w
U
a
c
s
a
1
t
i
f
c
o
u
m
t
c
s
a
a
f our proposed LRCCDL based on the HMOFP feature outperforms
ther methods. For the indoor scene, the method in [61] obtains
he best result. However, our LRCCDL algorithm is comparable to
t.
.3. Anomaly detection at the local scale on the UCSD dataset
In this section, the UCSD dataset is chosen to evaluate our al-
orithm LRCCDL by anomaly detection at the local scale, which
ncludes experiments of object localization. There are two sub-
atasets in the UCSD dataset, i.e., UCSD Ped1 and UCSD Ped2.
he two first frames in Fig. 12 and Fig. 13 show that the frames
nly contain pedestrians in the normal scenes. Other frames in
igs. 12 and 13 show that cars, wheelchairs, skaters and bikes com-
only occur in the abnormal scenes. The evaluation contains two
arts: (1) the frame-level groundtruth-based local anomaly detec-
ion, i.e., compared with the frame-level groundtruth annotation,
he frame should be determined as abnormal if at least one pixel
f the frame is abnormal; and (2) the pixel-level groundtruth-
ased anomaly localization, i.e., compared with the groundtruth
nnotation at the pixel-level, the abnormal frame is determined as
correctly detected frame if at least 40% of the truly abnormal pix-
ls are detected. Otherwise, the frame is considered to be a false
ositive.
.3.1. Anomaly detection
In the UCSD Ped1 dataset, the training set contains 34 short
lips, and the testing set contains 36 short clips. The frame size is
58 × 238 and there are 200 frames in each clip. In the UCSD Ped2
ataset, the training set contains 16 short clips, and the testing set
ontains 12 short clips. There are 150 to 180 frames in each clip
ith a 240 × 360 resolution. In addition, a subset of 10 clips for
CSD Ped1 and 12 clips for UCSD Ped2 are provided with manu-
lly generated pixel-level binary masks, which identify the regions
ontaining anomalies. In the experiments, we only choose the first
hort clip in the training set during the training stage. The size of
frame is reset as 240 × 320 and the image patch size is set as
0 × 10 without overlap between two neighbor patches. In addi-
ion, 0 ◦ − 360 ◦ are divided into 18 bins. The HMOFP feature vector
s 72 based on spatial basis Type B, as shown in Fig. 2 .
Figs. 14 (a) and 15 (a) show the detection ROC curves in the
rame-level groundtruth-based local abnormal event detection. We
ompare the detection result of our LRCCDL algorithm with those
f the state-of-the-art methods. Tables 3 and 5 show the AUC val-
es and equal error rate (EER) values of our algorithm and other
ethods as a quantitative comparison. Figs. 14 (b) and 15 (b) show
he detection ROC curves in the pixel-level groundtruth-based lo-
al abnormal event localization. As another quantitative compari-
on, the AUC values and equal detected rate (EDR) values of our
lgorithm and the state-of-the-art methods are shown in Tables 4
nd 6 . Examples of localization are presented in Fig. 16 .
12 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355
Fig. 14. (a) The ROC curves of the local abnormal event detection using frame-level groundtruth on UCSD Ped1. (b) The ROC curves of the local abnormal event localization
using pixel-level groundtruth on UCSD Ped1.
Fig. 15. (a) The ROC curves of the local abnormal event detection using frame-level groundtruth on UCSD Ped2. (b) The ROC curves of the local abnormal event localization
using pixel-level groundtruth on UCSD Ped2.
Fig. 16. (a) The localization results on UCSD Ped1. (b) The localization results on UCSD Ped2.
a
j
a
o
1
t
c
u
m
b
5.3.2. Performance comparison
The quantitative comparisons are displayed in Tables 3 to 6 .
The compared state-of-the-art methods include both hand-crafted
feature-based methods and deep learning-based methods. The cri-
teria include two items: AUC and EER (EDR = 1-EER). The AUC of a
classifier is equivalent to the probability that the classifier will rank
a randomly chosen positive instance higher than randomly chosen
negative instance [75] . Depending on the thresholds, the false re-
ject rate, i.e., the rate of the positive samples that are wrongly de-
tected as negative samples, and the false accept rate, i.e., the rate
of negative samples that are wrongly detected as positive samples,
re changing along the ROC curve. When the value of the false re-
ect rate equals the false accept rate, the common value is denoted
s EER, which corresponds to the abscissa of the intersection point
f the ROC curve and the back-diagonal dotted line in Figs. 14 and
5 . Regarding the problem of binary classification, EER corresponds
o the classification result under a special threshold value, but AUC
an reflect the classification results under all of the threshold val-
es. The classifier with a greater area has a better average perfor-
ance. Based on the analysis about AUC and EER, AUC is the main
aseline to evaluate the performance of binary classifiers.
A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 13
Table 4
Different EDR values and AUC values of the localization of LAE
on UCSD Ped1.
Method Criteria
EDR AUC
LRCCDL (Ours) 63.98% 76.09%
Sparse [48] 46% 46.1%
Ren et al. [47] 48.49% 48.23%
Adam et al. [34] 24% 13.3%
MDT [34] 45% 44.1%
SF-MPPCA [34] 28% 21.3%
MPPCA [34] 18% 20.5%
SF [21] 21% 17.9%
Lee et al. [26] 60% 64.9%
Zhang et al. [45] 62.5% 65%
—————————————————————————————————————–
AVID [61] 85.6% –
Conv-WTA + SVM [58] 64.2% 66.1%
TCP [56] 59.2% 64.5%
Wang et al. [64] 67% 68.9%
Huang et al. [55] 61.3% 69.71%
Xu et al. [57] – 69.9%
Table 5
Different EER values and AUC values of the detection of LAE on
UCSD Ped2.
Method Criteria
EER AUC
LRCCDL (Ours) 9.44% 95.20%
Adam et al. [34] 42% 64%
MDT [34] 25% 85%
SF-MPPCA [34] 36% 71%
MPPCA [34] 30% 77%
SF [21] 42% 63%
Lee et al. [26] 9.8% 92%
HOFME [44] 20% 87.5%
Zhang et al. [45] 16% 90%
Amraee et al. [27] 21% 85.5%
—————————————————————————————————————–
AVID [61] 14% -
Conv-WTA + SVM [58] 8.9% 96.6%
ConvLSTM-AE [60] – 88.1%
Liu et al. [63] – 95.4%
TCP [56] 18% 88.4%
Wang et al. [64] 12% 91.3%
Sabokrou et al. [62] 11% -
Conv-AE [59] 21.7% 90%
Xu et al. [57] 13% 92.3%
b
p
b
T
m
[
A
l
t
w
a
w
p
5
5
o
c
Table 6
Different EDR values and AUC values of the localization of LAE
on UCSD Ped2.
Method Criteria
EDR AUC
LRCCDL (Ours) 76.42% 82.72%
Adam et al. [34] 20% 22%
MDT [34] 45% 42%
SF-MPPCA [34] 25% 20%
MPPCA [34] 18% 22%
SF [21] 22% 28%
Lee et al. [26] 76% 81.5%
Zhang et al. [45] 68% 75%
Amraee et al. [27] 71% 80%
—————————————————————————————————————–
AVID [61] 85% -
Conv-WTA + SVM [58] 83.1% 89.3%
Wang et al. [64] 83% 80.1%
Sabokrou et al. [62] 85% -
Table 7
Different AUC values on CUHK Avenue.
Method Criteria
AUC
LRCCDL (Ours) 88.68%
Conv-WTA + SVM [58] 82.1%
Conv-AE [59] 70.2%
ConvLSTM-AE [60] 77%
DeepAppearance [63] 84.6%
Unmasking [63] 80.6%
Stacked RNN [63] 81.7%
Liu et al. [63] 85.1%
Huang et al. [55] 76.8%
3
a
t
t
c
p
c
o
m
5
A
o
t
o
a
w
d
s
s
r
o
r
6
m
From Tables 3 to 6 , compared with the hand-crafted feature-
ased methods, the values of EER (or EDR) and AUC of our pro-
osed methods are the best. Compared with the deep learning-
ased methods, the AUC of our proposed method is the best in
able 4 . Based on the values of AUC, the performance of our
ethod is better than those of [59,60,63,64] in Table 3 , and
56,57,59,60,64] in Table 5 and [64] in Table 6 . Moreover, the
UC values of our method are slightly lower than those of deep
earning-based methods with the best performances. Moreover,
aking the values of EER and AUC from the four tables together,
e can find that there is no deep learning-based method that is
lways better than ours under the criteria of EER and AUC on the
hole UCSD dataset. In conclusion, our method achieves a good
erformance on these two datasets.
.4. Anomaly detection at the local scale on the CUHK avenue dataset
.4.1. Anomaly detection and performance comparison
In this section, we chose the CUHK Avenue dataset to evaluate
ur method. There are 16 training video clips and 21 testing video
lips in the dataset. The resolution of the frames in each video is
60 × 640. The abnormal events include wrong direction, strange
ction, and abnormal objects, which are shown in Fig. 17 .
In the experiment, we also only choose the first clip in the
raining set during the training stage. The HMOFP feature is ex-
racted in the same manner as that in the UCSD dataset. AUC is
hosen as the criterion for the performance evaluation. Our pro-
osed method is compared with the state-of-the-art methods, in-
luding deep learning-based methods in recent years. The values
f AUC are shown in Table 7 . It can be seen that our proposed
ethod obtains the highest AUC value.
.4.2. Analysis of parameters in the algorithm
There are two parameters in Algorithm 1 and one parameter in
lgorithm 2 , i.e., α, β and γ . The relationship between γ and the
ther two parameters is γ = β/α. We fix β and change α to ob-
ain different values of AUC. In addition, α is fixed and the value
f β is changed. The AUC curves that are obtained according to the
bovementioned cases are shown in Fig. 18 (a). It can be seen that
hen α = 5 , β = 0 . 1 , and γ = 0 . 02 , a better AUC is achieved. Ad-
itionally, we illustrate the performance of AUC with a patch size
et {4,8,10,20} in Fig. 18 (b), and we can find that when the patch
ize is 10 × 10, a better AUC is achieved. We mainly represent the
esults on the CUHK Avenue dataset, and our experimental results
n the above datasets show that the conclusion regarding the pa-
ameters can be generalized to other datasets.
. Conclusions
In this paper, we present a novel algorithm to detect abnor-
al events based on dictionary learning. Unlike the previous hand-
14 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355
Fig. 17. The normal scene (the 1st frame) and the abnormal scenes (the 2nd frame to the 4th frame) in dataset CUHK Avenue.
Fig. 18. (a) The performance of AUC under the different parameters in the algorithm on dataset CUHK Avenue. (b) The performance of AUC under the different patch sizes.
p
W
r
s
a
v
m
b
t
s
D
A
C
6
N
e
T
a
s
A
m
crafted feature-based methods in the literature that focus only on
the representation of the motion descriptor of a crowd or the
model to detect whether an event is normal, our new method fully
uses the motion descriptor information to build the anomaly de-
tection framework.
Based on the background subtraction and binarization of
surveillance videos, we remove the low variations and noise com-
ing from the background objects and obtain the motion descrip-
tor HMOFP. The motion descriptor can describe the motion of a
crowd in the foreground more precisely. In the training stage, to
make full use of the low-rank structure of the training sample set
and restrict the reconstruction coefficient vectors, we propose the
LRCCDL solution by a joint optimization of the nuclear-norm and
the l 2, 1 -norm. The joint optimization achieves two results: one is
the learned low-rank dictionary and the other is the compact re-
construction coefficient vectors of the training samples, which are
surrounding a mean center. The dictionary and the center are used
in the detection stage. In the detection stage, we design a strategy
to force the reconstruction coefficient vectors of abnormal samples
to have the same distribution as normal samples utilizing the l 2, 1 -
norm, which realizes the aim of obtaining a large gap between the
reconstruction error of abnormal testing samples and normal ones.
As a result, abnormal testing samples obtain larger reconstruction
errors than normal testing samples. Finally, a reconstruction cost
(RC) function is developed to detect the frame abnormality based
on the combination of the reconstruction error and the sparsity of
the reconstruction coefficient vector.
In the experiments, which are compared with the deep
learning-based methods, some results of our proposed method are
not superior. However, the amount of training data required by
our method is far smaller than that of deep learning-based meth-
ods, especially for the UCSD and CUHK Avenue datasets. Moreover,
the detection results of our method are comparable to those of
the deep learning-based methods and superior to those of hand-
crafted feature-based methods.
l
In the future works, one important aspect is that how to im-
rove the performance of the anomaly detection at the local scale.
hat’s more, optimizing our proposed method to decrease the
unning time is another important task of our forthcoming re-
earch. In consideration of the widely using of deep learning, the
nomaly detection based on deep learning is a very attractive and
aluable area. Although some explorations arose in recent years,
ore robust and efficient algorithms for different scenes need to
e further studied. In addition, combining deep learning with the
raditional methods such as sparse representation may produce a
urprising result. All of these will be our future research directions.
eclaration of Competing Interest
None.
cknowledgement
This work is supported by the National Key R&D Program of
hina (no. 2019YFB2204200 ), the NSFC (nos. 61572067 , 61872034 ,
1672089 , 61703436 , and 61572064 ), CELFA, the Beijing Municipal
atural Science Foundation under Grant 4202055 , the Natural Sci-
nce Foundation of Guizhou Province ( [2019]1064 ), the Science and
echnology Program of Guangzhou ( 201804010271 ). This work is
lso supported in part by the Natural Sciences and Engineering Re-
earch Council of Canada (NSERC), Grant No. RGPIN239031 .
ppendix
The complete algorithm for the LRCCDL_SR method.
Step 1: Depending on binary frames, we can get the optimized
otion feature set of TR , which is denoted as H
∗.
Step 2: The optimized dictionary D T can be obtained by the on-
ine dictionary learning algorithm.
A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355 15
H
m
w
R
t
R
w
t
m
R
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
The online dictionary learning algorithm
Require: H
∗ ∈ R M×K 0 (training sample set), λ ∈ R (regularization parameter),
D 0 ∈ R M×K (initial dictionary), T (number of iterations).
1: A 0 ∈ R K×K ← 0 , B 0 ∈ R M×K ← 0 (reset the “past” information).
2: for t = 1 to T do
(1) Draw H ∗t from H
∗ .
(2) Sparse coding: using the Lasso algorithm to compute
αt = arg min α∈ R K
1 2 ‖ H ∗t − D t−1 α‖ 2 2 + λ‖ α‖ 1 .
(3) A t ← A t−1 + αt αT t = [ a 1 , a 2 , ..., a K ] . The element in the n th row and the
n th column in the matrix A t is denoted as A t (n, n ) .
(4) B t ← B t−1 + H ∗t αT t = [ b 1 , b 2 , ..., b K ] .
(5) Compute the dictionary D t .
(a) for n = 1 to K do
Update the n th column of D t−1 = [ d 1 , d 2 , ..., d K ] :
if ( A t (n, n )= 0)
u n ← d n .
Else
u n ←
1 A t (n,n )
( b n − D t−1 a n ) + d n .
end if
d t,n ←
1 max ( ‖ u n ‖ 2 , 1)
u n (the n th column of D t ).
(b) end for
(c) Return D t = [ d t, 1 , d t, 2 , ..., d t,K ] .
3: end for
4: Return D T (learned dictionary).
Step 3: Extract the HMOFP feature of the testing frame f t , i.e.,
t and calculate its sparse reconstruction coefficient vector z t by
in ‖
z t ‖ 1 s . t . H t = D T z t ,
hich can be solved by the OMP method.
Step 4: Then the RC value of f t is computed by
C = ‖
H t − D z t ‖ 2 + λ‖
z t ‖ 1 .
Step 5: The frame f t is detected as normal if the following cri-
erion is satisfied
C < τ,
here τ is a user defined threshold that controls the sensitivity of
he algorithm.
Note: Step 1, and Step 4-Stpe 5 are same as our proposed
ethod LRCCDL.
eferences
[1] H. Keval , CCTV control room collaboration and communication: does it work?
in: Proc. Human Centred Technol. Workshop, 2006, pp. 11–12 . [2] M. Thida , Y.L. Yong , P. Climent-Pérez , H-l. Eng , P. Remagnino , A literature re-
view on video analytics of crowded scenes, Intell. Multimed. Surveill (2013)
17–36 . [3] B. Zhan , D.N. Monekosso , P. Remagnino , et al. , Crowd analysis: a survey, Mach.
Vision Appl. (2008) 345–357 . [4] N.N.A. Sjarif , S.M. Shamsuddin , S.Z. Hashim , Detection of abnormal behaviors
in crowd scene: a review, Int. J. Advance. Soft Comput. Appl. 4 (1) (2012) 1–33 .[5] J.C.S. Junior , S.R. Musse , C.R. Jung , Crowd analysis using computer vision tech-
niques, IEEE Signal Process. Mag. 27 (5) (2010) 66–77 .
[6] C. Ma , Z. Miao , X. Zhang , M. Li , A saliency prior context model for real-timeobject tracking, IEEE Trans. Multimed. 19 (11) (2017) 2415–2424 .
[7] J. Geng , Z. Miao , X. Zhang , Efficient heuristic methods for multimodal fusionand concept fusion in video concept detection, IEEE Trans. Multimed. 17 (4)
(2015) 498–511 . [8] S. Ali , M. Shah , Floor fields for tracking in high density crowd scenes, in: Eur.
Conf. Comput. Vis. (ECCV), 2008, pp. 1–14 .
[9] M. Rodriguez , S. Ali , T. Kanade , Tracking in unstructured crowded scenes, in:Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 1389–1396 .
[10] Y. Cen , L. Zhang , K. Wang , et al. , Iterative reweighted minimization for gen-eralized norm/quasi-norm difference regularized unconstrained nonlinear pro-
gramming[J], IEEE Access 7 (2019) 153102–153122 . [11] Y. Cen , Y. Cen , K. Wang , et al. , Energy-efficient nonuniform content edge pre–
caching to improve quality of service in fog radio access networks[J], Sensors19 (6) (2019) 1422 .
[12] H. Wang , Y. Cen , Z. He , et al. , Robust generalized low-rank decomposition
of multimatrices for image recovery[J], IEEE Trans. Multimed. 19 (5) (2016)969–983 .
[13] H. Wang , Y. Cen , Z. He , et al. , Reweighted low-rank matrix analysis with struc-tural smoothness for image denoising[J], IEEE Trans. Image Process. 27 (4)
(2017) 1777–1792 .
[14] H. Wang , Y. Li , Y. Cen , et al. , Multi-Matrices low-rank decomposition withstructural smoothness for image denoising[J], IEEE Trans. Circuits Syst. Video
Technol. (2019) . [15] A. Li , Z. Miao , Y. Cen , Global abnormal event detection based on compact co-
efficient low-rank dictionary learning, in: Asian Conf. Pattern Recognit. (ACPR),2017, pp. 4 83–4 87 .
[16] Available: http://mha.cs.umn.edu/movies/crowd-activity-all.avi . [17] Available: http://www.cvg.reading.ac.uk/PETS2009/data.html .
[18] Available: http://www.svcl.ucsd.edu/projects/anomaly/dataset.html .
[19] Available: http://www.cse.cuhk.edu.hk/leojia/projects/detectabnormal/dataset. html .
20] D. Kosmopoulos , S.P. Chatzis , Robust visual behavior recognition, IEEE SignalProcess. Mag. 27 (5) (2010) 34–45 .
[21] R. Mehran , A. Oyama , M. Shah , Abnormal crowd behavior detection using so-cial force model, in: IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2009,
pp. 935–942 .
22] S. Yen , C. Wang , Abnormal event detection using HOSF, in: IEEE Int. Conf. ITConverg. Secur. (ICITCS), 2013, pp. 1–4 .
23] Y. Zhang , L. Qin , H. Yao , Q. Huang , Abnormal crowd behavior detection basedon social attribute-aware force model, in: IEEE Int. Conf. Image Process. (ICIP),
2012, pp. 2689–2692 . [24] Y. Zhang , L. Qin , H. Yao , Q. Huang , Social attribute-aware force model: exploit-
ing richness of interaction for abnormal crowd detection, IEEE Trans. Circuits
Syst. Video Technol. 25 (7) (2015) 1231–1245 . 25] R. Chaker R , Z. Al Aghbari , I.N. Junejo , Social network model for crowd anomaly
detection and localization, Pattern Recognit 61 (2017) 266–281 . 26] D.G. Lee , H.I. Suk , S.K. Park , et al. , Motion influence map for unusual human
activity detection and localization in crowded scenes, IEEE Trans. Circuits Syst.Video Technol. 25 (10) (2015) 1612–1623 .
[27] S. Amraee , A. Vafaei , K. Jamshidi , et al. , Anomaly detection and localization in
crowded scenes using connected component analysis, Multimed. Tools Appl.77 (12) (2018) 14767–14782 .
28] T. Hung , J. Lu , Y. Tan , Cross-scene abnormal event detection, in: IEEE Int. Symp.Circuits Syst. (ISCAS), 2013, pp. 2844–2847 .
29] T. Sandhan , T. Srivastava , A. Sethi , Y. Jin , Unsupervised learning approach forabnormal event detection in surveillance video by revealing infrequent pat-
terns, in: IEEE Int. Conf. Image Vis. Comput. N. Z. (IVCNZ), 2013, pp. 4 94–4 99 .
30] I. Tziakos , A. Cavallaro , L. Xu , Local abnormality detection in video using sub-space learning, in: IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS),
2010, pp. 519–525 . [31] M. Haque , M. Murshed , Panic-driven event detection from surveillance video
stream without track and motion features, in: IEEE Int. Conf. Multimed. Expo(ICME), 2010, pp. 173–178 .
32] Y. Shi , Y. Gao , R. Wang , Real-time abnormal event detection in complicated
scenes, in: IEEE Int. Conf. Pattern Recognit. (ICPR), 2010, pp. 3653–3656 . [33] Y. Yin , Q. Liu , S. Mao , Global anomaly crowd behavior detection using crowd
behavior feature vector, Int. J. Smart Home 9 (12) (2015) 149–160 . 34] V. Mahadevan , W. Li , V. Bhalodia , et al. , Anomaly detection in crowded scenes,
in: IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2010, pp. 1975–1981 . [35] D. Singh , C.K. Mohan , Graph formulation of video activities for abnormal ac-
tivity recognition, Pattern Recognit 65 (2017) 265–272 . 36] X. Gu , J. Cui , Q. Zhu , Abnormal crowd behavior detection by using the particle
entropy, Opt.-Int. J. Light Electron Opt. 125 (14) (2014) 3428–3433 .
[37] C.P. Lee , K.M. Lim , W.L. Woon , Statistical and entropy based abnormal motiondetection, in: IEEE Stud. Conf. Res. Dev. (SCOReD), 2010, pp. 192–197 .
38] T. Wang , H. Snoussi , Detection of abnormal visual events via global optical floworientation histogram, IEEE Trans. Inf. Forensics Secur. 9 (6) (2014) 988–998 .
39] A. Li , Z. Miao , Y. Cen , T. Wang , V. Voronin , Histogram of maximal optical flowprojection for abnormal events detection in crowded scenes, Int. J. Distrib.
Sens. Netw. (2015) 1–12 .
40] A. Li , Z. Miao , Y. Cen , Q. Liang , Abnormal event detection based on sparsereconstruction in crowded scenes, in: IEEE Int. Conf. Speech Signal Process.
(ICASSP), 2016, pp. 1786–1790 . [41] A. Li , Z. Miao , Y. Cen , Global anomaly detection in crowded scenes based on
optical flow saliency, in: IEEE Int. Workshop Multimed. Signal Process. (MMSP),2016, pp. 1–5 .
42] A. Li , Z. Miao , Y. Cen , Y. Cen , Anomaly detection using sparse reconstruction in
crowded scenes, Multimed. Tools Appl. 76 (24) (2017) 26249–26271 . 43] N. Patil , P.K. Biswa , Global abnormal events detection in surveillance video —
A hierarchical approach, in: IEEE Int. Symp. Embed. Comput. Syst. Des., 2017,pp. 217–222 .
44] R.V.H.M. Colque , C. Caetano , M.T.L. de Andrade , et al. , Histograms of opticalflow orientation and magnitude and entropy to detect anomalous events in
videos, IEEE Trans. Circuits Syst. Video Technol. 27 (3) (2017) 673–682 .
45] Y. Zhang , H. Lu , L. Zhang , X. Ruan , Combining motion and appearance cues foranomaly detection, Pattern Recognit 51 (C) (2016) 443–452 .
46] X. Chen , J. Lai , Detecting abnormal crowd behaviors based on the div-curl char-acteristics of flow fields, Pattern Recognit 88 (2019) 342–355 .
[47] H. Ren , T.B. Moeslund , Abnormal event detection using local sparse repre-sentation, in: IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS), 2014,
pp. 125–130 .
48] Y. Cong , J. Yuan , J. Liu , Sparse reconstruction cost for abnormal event detection,in: IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2011, pp. 3449–3456 .
49] Y. Yuan , Y. Feng , X. Lu , Structured dictionary learning for abnormal event de-tection in crowded scenes, Pattern Recognit 73 (2018) 99–110 .
16 A. Li, Z. Miao and Y. Cen et al. / Pattern Recognition 108 (2020) 107355
A
N
C
s
Z
i
C
F
d
t
t
O
H
J
t
C
d
Y
H
N
s
d
J
i
t
X
T
g
o
w
T
n
D
R
d
c
p
D
H
L
i
S
H
C
r
[50] X. Zhu , J. Liu , J. Wang , C. Li , H. Lu , Sparse representation for robust abnormalitydetection in crowded scenes, Pattern Recognit 47 (5) (2014) 1791–1799 .
[51] M. Aharon , M. Elad , A. Bruckstein , K-SVD: an algorithm for designing overcom-plete dictionaries for sparse representation, IEEE Trans. Signal Process 54 (11)
(2006) 4311–4322 . [52] C. Bi , H. Wang , R. Bao , SAR image change detection using regularized dictio-
nary learning and fuzzy clustering, in: IEEE Int. Conf. Cloud Comput. Intell.Syst. (CCIS), 2014, pp. 327–330 .
[53] M. Yang , D. Dai , L. Shen , et al. , Latent dictionary learning for sparse representa-
tion based classification, in: IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),2014, pp. 4138–4145 .
[54] M. Yang , L. Zhang , X. Feng , et al. , Sparse representation based fisher discrimi-nation dictionary learning for image classification, Int. J. Comput. Vis. 109 (3)
(2014) 209–232 . [55] S. Huang , D. Huang , X. Zhou , Learning multimodal deep representations for
crowd anomaly event detection[J], Math. Probl. Eng. (2018) 1–13 .
[56] M. Ravanbakhsh , M. Nabi , H. Mousavi , E. Sangineto , N. Sebe , Plug-and-playcnn for crowd motion analysis: an application in abnormal event detection,
in: IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2018, pp. 1689–1698 . [57] M. Xu , X. Yu , D. Chen , et al. , An efficient anomaly detection system for
crowded scenes using variational autoencoders[J], Appl. Sci. 9 (16) (2019) 3337 .[58] H.T.M. Tran , D. Hogg , Anomaly detection using a convolutional winner-take-all
autoencoder, in: Proc. Br. Mach. Vis. Conf. (BMVC), 2017, pp. 1–12 .
[59] M. Hasan , J. Choi , J. Neumann , A.K. Roy-Chowdhury , L.S. Davis , Learning tem-poral regularity in video sequences, in: IEEE Conf. Comput. Vis. Pattern Recog-
nit. (CVPR), 2016, pp. 733–742 . [60] W. Luo , W. Liu , S. Gao , Remembering history with convolutional LSTM
for anomaly detection, in: IEEE Int. Conf. Multimed. Expo (ICME), 2017,pp. 439–4 4 4 .
[61] M. Sabokrou , M. Pourreza , M. Fayyaz , R. Entezari , et al. , AVID: adversarial visual
irregularity detection, in: Asian Conf. Comput. Vis. (ACCV), 2018, pp. 1–18 . [62] M. Sabokrou , M. Fayyaz , M. Fathy , Z. Moayed , R. Klette , Deep-anomaly: fully
convolutional neural network for fast anomaly detection in crowded scenes,Comput. Vis. Image Underst. 172 (2018) (2018) 88–97 .
[63] W. Liu , W. Luo , D. Lian , S. Gao , Future frame prediction for anomaly detec-tion–a new baseline, in: IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
2018, pp. 6536–6545 .
[64] S. Wang , E. Zhu , J. Yin , F. Porikli , Video anomaly detection and localizationby local motion based joint video representation and OCELM, Neurocomputing
277 (2018) (2018) 161–175 . [65] Q. Sun , H. Liu , T. Harada , Online growing neural gas for anomaly detection in
changing surveillance scenes, Pattern Recognit 64 (2017) 187–201 . [66] A. Elgammal , R. Duraiswami R , D. Harwood , et al. , Background and foreground
modeling using nonparametric kernel density estimation for visual surveil-
lance, Proc. IEEE 90 (7) (2002) 1151–1163 . [67] G. Liu , Z. Lin , Y. Yu , Robust subspace segmentation by low-rank representation,
in: Int. Conf. Mach. Learn. (ICML), 2010, pp. 663–670 . [68] Z. Lin, M. Chen, and Y. Ma, The augmented lagrange multiplier method for
exact recovery of corrupted low-rank matrices, UIUC Technical Report UILU-ENG-09-2215, 2009.
[69] J.F. Cai , E.J. Candès , Z. Shen , A singular value thresholding algorithm for matrixcompletion, SIAM J. Optim. 20 (4) (2010) 1956–1982 .
[70] G. Liu , Z. Lin , S. Yan , et al. , Robust recovery of subspace structures by low-rank
representation, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 171–184 . [71] J. Yang , W. Yin , Y. Zhang , Y. Wang , A fast algorithm for edge-preserving varia-
tional multichannel image restoration, SIAM J. Imag. Sci. 2 (2) (2009) 569–592 .[72] D. Bertsekas , Constrained Optimization and Lagrange Multiplier Methods, Aca-
demic Press, 1982 . [73] Y. Zhang, Recent advances in alternating direction methods: practice and the-
ory, tutorial, 2010.
[74] J. Eckstein , D. Bertsekas , On the Douglas-Rachford splitting method and theproximal point algorithm for maximal monotone operators, Math. Progr. 55
(1992) 293–318 .
[75] T. Fawcett , An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8)(2006) 861–874 .
ng Li received the Bachelor degree in 2011 from Harbin Institute of Technology.ow he is a doctor student of Beijing Jiaotong University. His research interests
ompressed Sensing, Video Processing, Abnormal Event detection, Sparse Recon-truction, Low-rank Matrix Reconstruction etc.
henjiang Miao received the B.E. degree from Tsinghua University, Beijing, China,
n 1987, and the M.E. and Ph.D. degrees from Northern Jiaotong University, Beijing,hina, in 1990 and 1994, respectively. From 1995 to 1998, he was a Post-Doctoral
ellow with the Ecole Nationale Superieure d’Electrotechnique, d’Electronique,’Informatique, d’Hydraulique et des Telecommunications, Institute National Poly-
echnique de Toulouse, Toulouse, France. From 1998 to 2004, he was with the Insti-
ute of Information Technology, National Research Council Canada, Nortel Networks,ttawa, ON, Canada. He joined Beijing Jiaotong University, Beijing, China, in 2004.
e is currently a Professor and the Director of the Media Computing Center, Beijingiaotong University, and Director of the Institute for Digital Culture Research, Cen-
er for Ethnic and Folk Literature and Art Development, Ministry of Culture, Beijing,hina. His current research interests include image and video processing, multime-
ia processing, and intelligent human/machine interaction.
igang Cen received the Ph.D. degree in control science engineering from theuazhong University of Science Technology, Wuhan, China, in 2006. In 2006, he
joined the Signal Processing Centre, School of Electrical and Electronic Engineering,
anyang Technological University, Singapore, as a Research Fellow. From 2014 to2015, he was a Visiting Scholar with the Department of Computer Science, Univer-
ity of Missouri, Columbia, MO, USA. He is currently a Professor and a Supervisor ofoctoral students with the School of Computer and Information Technology, Beijing
iaotong University, Beijing, China. His research interests include compressed sens-ng, sparse representation, low-rank matrix reconstruction, and wavelet construc-
ion theory.
iao-Ping Zhang received the B.S. and Ph.D. degrees in electronic engineering fromsinghua University, Beijing, China, in 1992 and 1996, respectively, and the MBA de-
ree (with honors) in finance, economics, and entrepreneurship from the Universityf Chicago Booth School of Business, Chicago, IL, USA. Since Fall 20 0 0, he has been
ith the Department of Electrical and Computer Engineering, Ryerson University,
oronto, ON, Canada, where he is currently a Professor and the Director of Commu-ication and Signal Processing Applications Laboratory. He has served as Program
irector of Graduate Studies. He is cross-appointed to the Finance Department, Tedogers School of Management, Ryerson University. He is Cofounder and CEO of Ei-
oSearch, Toronto, ON, Canada. His research interests include statistical signal pro-essing, multimedia content analysis, sensor networks and electronic systems, com-
utational intelligence, and applications in bioinformatics, finance, and marketing.
r. Zhang is a Registered Professional Engineer in the Province of Ontario, Canada.e is a member of the Beta Gamma Sigma Honor Society.
inna Zhang received the M.S. degree in College of Mechanical Engineering from
the Guizhou University, Guiyang, China, in 2010. She is currently a lecturer withthe College of Mechanical Engineering, Guizhou University. Her research interests
nclude signal processing, fault diagnosis etc.
himing Chen received the Ph.D. degree in control science engineering from the
uazhong University of Science Technology in 2006. He is currently a Supervisor
of doctor students with the School of Electrical and Automation Engineering, Easthina Jiaotong University. His research interests include signal processing, multi-
obot control, complex network etc.