Crowd Scene Understanding with Coherent Recurrent Neural ... · Crowd Scene Understanding with Coherent Recurrent Neural Networks ⇤ Hang Su⇤, Yinpeng Dong⇤, Jun Zhu⇤, Haibin

Crowd Scene Understanding with Coherent Recurrent Neural Networks ⇤

Hang Su⇤, Yinpeng Dong⇤, Jun Zhu⇤, Haibin Ling†, and Bo Zhang⇤

⇤ Tsinghua National Lab for Information Science and Technology⇤ State Key Lab of Intelligent Technology and Systems

⇤ Department of Computer Science and Technology, Tsinghua University, Beijing, China† Department of Computer and Information Sciences, Temple University, USA

⇤l{suhangss, dongyp13, dcszj, dcszb}@mail.tsinghua.edu.cn, † [email protected]

Abstract

Exploring crowd dynamics is essential in under-standing crowd scenes, which still remains as achallenging task due to the nonlinear character-istics and coherent spatio-temporal motion pat-terns in crowd behaviors. To address these issues,we present a Coherent Long Short Term Memory(cLSTM) network to capture the nonlinear crowddynamics by learning an informative representa-tion of crowd motions, which facilitates the criti-cal tasks in crowd scene analysis. By describingthe crowd motion patterns with a cloud of keypointtracklets, we explore the nonlinear crowd dynam-ics embedded in the tracklets with a stacked LSTMmodel, which is further improved to capture the col-lective properties by introducing a coherent regu-larization term; and finally, we adopt an unsuper-vised encoder-decoder framework to learn a hiddenfeature for each input tracklet that embeds its inher-ent dynamics. With the learnt features properly har-nessed, crowd scene understanding is conducted ef-fectively in predicting the future paths of agents, es-timating group states, and classifying crowd events.Extensive experiments on hundreds of public crowdvideos demonstrate that our method is state-of-the-art performance by exploring the coherent spatio-temporal structures in crowd behaviors.

1 IntroductionUnderstanding collective behaviors in crowd scenes has awide range of applications in video surveillance and crowdmanagement [Sulman et al., 2008], especially in present erawith recurrent and tragic accidents in populous and diversehuman activities. However, a crowd is more than sum of in-dividuals, thus making the vision-related tasks disproportion-ately difficult along with the crowd scales. The past decadehas witnessed a significant progress in crowd scene analy-sis in learning global motion patterns [Mehran et al., 2010;

⇤This research is partly supported by National Natural ScienceFoundation of China (Nos. 61571261, 61322308, 61332007), andChina Postdoctoral Science Foundation (No. 2015M580099)

Wu et al., 2010], modeling local spatio-temporal varia-tions [Kratz and Nishino, 2012; Su et al., 2013], analyz-ing interactions among individuals [Mehran et al., 2009],profiling group behaviors [Zhou et al., 2011; 2012b], anddetecting abnormal crowd behaviors [Solmaz et al., 2012;Mahadevan et al., 2010; Li et al., 2014]. Recently, Li etal. [2015] gave a comprehensive review on the state-of-the-art techniques on crowd scene understanding.

Although various methods have been developed, there isstill no publicly accepted framework in understanding thecrowd scenes, especially when extreme clutters or severeocclusions occur. One of the essential challenges is thatcrowd spatio-temporal behavior patterns behave abundantlynonlinear dynamics, such as limit cycles, quasi-period andeven chaos. This non-linear interaction between individu-als always result in various complex, spatio-temporal motionpatterns, e.g., the oscillations of the pedestrian flow at bot-tlenecks [Helbing and Johansson, 2009]. The popular lin-ear dynamic systems in crowd modeling [Lin et al., 2009;Shao et al., 2014] may fail to capture the nonlinear character-istics. Although the nonlinear characteristics of crowd mo-tions investigated in crowd simulation [Massink et al., 2011],few attempts are made in the vision-based crowd motion anal-ysis.

Another challenge in crowd behavior analysis is the col-lective effect (or coherent motion) [Zhou et al., 2012a; 2014],e.g., pedestrians in crowds tend to form coherent groupsby aligning with other neighbors. Different from the indi-vidual motion phenomena, there widely exist various self-organized spatio-temporal patterns even without externallyplanned or organized, which has been well explained withsocial force assumption [Helbing and Johansson, 2009]. Inthis case, methods that do not leverage the coherent char-acteristics may hinder the capabilities in capturing the in-herent crowd dynamics. For instance, crowd features learntfrom a multi-task deep architecture [Shao et al., 2015], al-though more effective than the handcrafted features, sufferfrom the lack of considering the essential non-linear tem-poral correlations and coherent motions in crowd behav-ior analysis. More recently, coherent motions in crowdscenes are detected with a thermal energy field such that pre-defined activities are effectively recognized [Lin et al., 2016;Wang et al., 2014]. However, it still fails to explore the non-linear crowd dynamics which hinders the performance for

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

3469

complex crowd behaviors.

1.1 Our ProposalTo address the aforementioned challenges, we propose to ex-plore the crowd dynamics with a coherent Long Short TermMemory (LSTM) architecture, which investigates the non-linearities of crowd behaviors with a stacked LSTM and en-forces the consistence of spatio-temporal structures with a co-herent regularization.

Recently, deep learning [Schmidhuber, 2015] has achievedstate-of-the-art performance in several vision tasks related tovisual understanding [Simonyan and Zisserman, 2014] andscene analysis [Ramanathan et al., 2015] partly due to itsgood capabilities in modeling nonlinearity. It inspires us toexplore the nonlinear dynamics of crowd behaviors with adeep architecture model. Specifically, we build our modelbased on the Long Short Term Memory (LSTM) [Hochre-iter and Schmidhuber, 1997] network, which is an improvedRecurrent Neural Network (RNN) for temporal data model-ing. LSTM has been proven successful on tasks in generat-ing image caption [Vinyals et al., 2015] and video descrip-tion [Donahue et al., 2015], since it provides a good proba-bility to explore the long term dynamics by overcoming theproblem of gradient vanishing and exploding for conventionalRNNs [Lipton, 2015]. In order to cope with nonlinear dy-namics in the long term crowd behaviors, we use a multi-layer LSTM network to learn an informative representation ofcrowd tracklets1, which are more conservative and less likelyto drift than long trajectories.

In order to capture the coherent spatio-temporal structures,we further improve the multi-layer LSTM network by intro-ducing a coherent regularization term to model the local spa-tial and temporal dependency between neighboring pedestri-ans within a coherent group. The memory unit in LSTMtherefore not only stores the dynamic information embeddedin the tracklet of its own but also the dynamics of its neigh-boring agents. The resulting model is denoted as coherentLSTM (cLSTM).

Finally, we adopt an unsupervised LSTM auto-encoderframework [Srivastava et al., 2015] to learn a representationto explore the crowd dynamics, such that the tedious efforts incollecting labeled data are significant reduced. Specifically, astacked cLSTM first encodes the input tracklets into a hid-den feature, which is subsequently decoded to reproduce theinput tracklets. By exploiting the hidden feature and the co-herent regularization, we can extrapolate the past dynamics tothe future and forecast the motion beyond what has been ob-served by recursively unrolling the feature to the future. Morecritical tasks in crowd scene analysis are also conducted us-ing the learnt representation, including group state estimationand crowd event classification.

In summary, our work differs significantly from the exist-ing studies [Shao et al., 2014; 2015] in that the crowd dynam-ics are captured via an LSTM model, which is “deep in time”

1A tracklet is a fragment of a trajectory obtained by a tracker,e.g., a KLT keypoint tracker [Baker and Matthews, 2004], whichstarts when a novel key-point is detected and terminates when ambi-guities arise.

and can identify informative structures in time domain. To thebest of our knowledge, this study is a first attempt that investi-gates the non-linear characteristics of crowd motion patternswith LSTM. Our main contributions are:

• We propose to investigate the crowd dynamics with astacked LSTM model, such that the complex and non-linear crowd motion patterns are well captured;

• To consider the collective properties in crowd motionpatterns, we propose to improve LSTM by introducinga coherent regularization which encourages a consistentspatio-temporal hidden feature;

• Finally, we adopt the hidden features learnt from the co-herent LSTM to critical tasks in crowd scene analysis,including future path prediction, group state estimation,and crowd behavior classification. Experiments demon-strate state-of-the-art performance of our method.

2 Model Crowd Motions with cLSTMWe consider to describe the crowd motion pattern with a setof tracklets {x

t

} due to its explainable semantics in crowdbehaviors [Li et al., 2015], as illustrated in Fig. 1. In this sec-tion, we aim to extract informative features from these track-lets to explore the crowd dynamics, which facilitates the sub-sequent tasks in crowd scene analysis. To this end, we firstintroduce the basic coherent LSTM unit which updates itsmemory with the tracklet of its own together with its neigh-boring agents; afterwards, we step into details of the coherentregularization term and describe the LSTM models by stack-ing the coherent LSTM unit.

Figure 1: Samples of tracklets for crowd motion patterns.

2.1 Coherent Long Short Term Memory UnitLSTM has a powerful capability in modeling the nonlin-ear dynamics for sequential data [Lipton, 2015; Greff et al.,2015]. As illustrated in Fig. 2, each LSTM unit has a cellas a memory, which maintains its state c

t

at time t by reg-ulating the information flow into/out of the LSTM unit withnon-linear gates [Greff et al., 2015].

Specifically, the state of a cell is controlled through sig-moidal gates including an input gate i

t

which takes activationfrom the current data point x

t

and the hidden layer at the pre-vious time step h

t�1 as

i

t

= �(Wxi

x

t

+W

hi

h

t�1 +W

ci

c

t�1 + b

i

), (1)

and a forget gate f

t

which enables the cell to reset its state as

f

t

= �(Wxf

x

t

+W

hf

h

t�1 +W

cf

c

t�1 + b

f

), (2)

where Ws are the weight matrices with a proper size; bs arebias vectors. Note that all the W

c• matrices corresponding to

3470

σ

σ

σ

ϕ

Forget Gate

Input Gate

Output Gate

Cell

tx

1th −

th

Coherent Regularization

Figure 2: A diagram of a coherent LSTM unit.

the cell states are diagonal, whereas the rest are full matrices.The total input at the input terminal is passed through the tanhnon-linearity and multiplied by the activation of the input gatei

t

, which is then added to the cell state ct�1 multiplied by the

forget gate’s activation f

t

as

c

t

= f

t

� c

t�1 + i

t

� tanh(W

xc

x

t

+W

hc

h

t�1 + b

c

), (3)

where � is the element-wise multiplication. The final outputfrom the LSTM unit h

t

is computed by multiplying the up-dated cell state that passed through a tanh non-linearity withan output gate’s activation as

h

t

= o

t

� tanh(c

t

), (4)

where o

t

is the output gate as

o

t

=�(Wxo

x

t

+W

ho

h

t�1 +W

co

c

t

+ b

o

). (5)

The output gate o

t

controls how much of the memory cellshould be transferred to the hidden feature. Compared withthe conventional RNN, the additional cells in LSTMs sumthe activities over time. Such strategy avoids a quick gradientvanishing and enables the LSTMs to learn extremely complexand long-term temporal dynamics in crowd behavior analysis.

Different from the individual behaviors, there widely existcoherent motion phenomena [Zhou et al., 2012a] in crowds,since the individuals are always willing to engage with “seed”groups and form spatially coherent structures. Therefore, wepropose to improve the conventional LSTM by taking theneighboring tracklets into account. The intuition behind themodel is if the dynamics of two tracklets are coherent in thespatial and temporal domain, i.e., when the neighboring re-lationship of individuals remains invariant over time or thecorrelation of their velocities remains high, they tend to havesimilar hidden states. To this end, we propose to update thememory unit by incorporating its own state together with itsneighboring agents with a coherent regularization as

c

t

= f

t

� c

t�1 +

X

j2N�j

(t)f jt

� c

j

t�1

+ i

t

� tanh(W

xc

x

t

+W

hc

h

t�1 + b

c

), (6)

where N denotes the set of neighboring tracklets within acoherent group; f j

t

and c

j

t�1 are corresponding to the forgetgate and cell state for LSTM in the coherent group; and �

j

(t)weights the dependency between the tracklets, as detailed be-low.

2.2 Coherent Motion ModelingIn this section, we first investigate the dependency betweenagents in coherent groups, which are discovered using the co-herent filtering [Zhou et al., 2012a], as illustrated in Fig. 3.The coherent keypoints with similar motion patterns and ten-dencies are marked with the same color.

Figure 3: Group detection using coherent filtering [Zhou etal., 2012a], in which different groups are indicated with dif-ferent colors (best viewed in color version).

The dependency relationship between two tracklets withinthe same group is measured with their pairwise velocity cor-relations as

⌧j

(t) =v

i

(t) · vj

(t)

kvi

(t)kkvj

(t)k , (7)

where v

i

(t) and v

j

(t) are the velocities of the ith and jthtracklets, respectively. The dependency coefficient betweenthe ith and jth tracklets in Eq. (6) is defined as

�j

(t) =1

Z

i

exp

✓⌧j

(t)� 1

2�2

◆2 (0, 1], (8)

where Z

i

is the normalization constant corresponding to theith tracklet. �

j

(t) tends to be 1 when the tracklets i and j aresimilar, and decreases when the tracklets become different. Inthis case, our model with coherent regularization encouragesthe tracklets to learn similar feature distributions by sharinginformation across tracklets within a coherent group.

In order to model the long term crowd dynamics, additionaldepths are also added to LSTMs by stacking them on top ofeach other, i.e., using the output of the LSTM in the (l�1)thlayer as the input to the LSTM in the lth layer, as illustratedin Fig. 4.

3 Crowd Scene Analysis based on cLSTMIn this section, we describe an unsupervised encoder-decoderframework using the coherent LSTM to generate a hiddenfeature, which embeds the inherent characteristics for eachtracklet. It shares similar ideas to that of auto-encoders [Vin-cent et al., 2010] such that the parameters are optimized byminimizing the difference between the reproduced and tar-get sequences. Critical tasks in crowd scene analysis is con-ducted based on the hidden features, e.g., future path forecast-ing, group state estimation and crowd behavior classification.

To learn an informative representation, we take the“encoder-decoder”approach inspired by [Donahue et al.,

3471

LSTM

LSTM LSTM

LSTM


Tracklet: Tracklet:iT jT

Figure 4: We stack a series of coherent LSTM units to cap-ture the time-varying nonlinear crowd dynamics by mappingtracklets in a coherent group to similar hidden features, whichis conducted by incorporating a coherent regularization.

2015], which consists of an encoder coherent LSTM and adecoder coherent LSTM, as is shown in Fig. 5. The encodercoherent LSTM runs though a tracklet to come up with a hid-den feature, which is decoded to produce a target sequencewith a decoder coherent LSTM. Note that our model allowsto reproduce the input tracklets, and also provides a methodto forecast the unseen future paths by exploiting the hiddenfeatures.


1x 2x 3x

2x 1x3x

4x 5x 6xeW eW

rdW rdW

pdW pdWEncoder

Reconstruction Decoder

Prediction DecoderLearnt Hidden

Features

Th

Figure 5: The encoder-decoder framework with coherentLSTM. The encoder generates a hidden feature of each track-let, and the decoder reproduces the input tracklets and pre-dicts the future paths.

At the encoding stage, we obtain representation vectors fortracklets with the encoder coherent LSTM following Eq. (4)as

h

T

= cLSTMe

(x

T

,hT�1), (9)

Coherent regularization

Coherent regularization

Motion Prediction

LSTM

LSTM

LSTM

Figure 6: Prediction of future path with coherent regulariza-tion, in which the blue paths are reliable tracklets obtainedfrom KLT trackers and the green ones are the forecasted pathsfrom the cLSTM.

where cLSTMe

is an encoding operation to map the inputtracklet to a hidden feature; x

T

and h

T�1 are the input track-lets at time step T and hidden feature vectors at the previoustime step T�1, respectively.

As for the reconstruction decoder, the coherent LSTM gen-erates a set of estimated tracklet ˆx for each tracklet, whichreproduces the input tracklets but in a reverse order to avoidthe long range correlations as

ˆ

x

t

= cLSTMdr

(h

t

, ˆxt+1), where t 2 [1, T ], (10)

where cLSTMdr

decodes the representation recursively toreproduce the input tracklets; h

t

is the feature that is de-rived from the hidden features h

T

in Eq. 9. The parame-ters {Ws,bs} are optimized by minimizing the reconstruc-tion error between the input tracklets and the reproduced oneswhen training the model.

The prediction decoder is similar to that of the reconstruc-tion decoder, except that the decoder LSTM extrapolates fu-ture paths. Specifically, the prediction is implemented by un-rolling the hidden feature by

ˆ

x

t

= cLSTMdp

(h

t

, ˆxt�1), where t > T, (11)

where cLSTMdp

is a decoding operation to predict the futurepath of an agent derived from the hidden feature by takingthe coherent motion into account. During the training stage,tracklets in the training dataset {x

t

}Tt=1 are divided into two

segments as {xt

}T0t=1 and {x

t

}Tt=T0+1. The former segments

are input to the encoder to learn a hidden feature, which isused to predict the latter segment by minimizing the differ-ence between the original and estimated tracklets. The pa-rameter T0 is modified to adjust the length ratio between thesegments, which enhances inherent evolutionary dynamicsare well captured.

The success of path prediction lies in the facts that the hid-den states generated from the encoder captures the dynamicsof tracklets to forecast the future within a coherent group,which is further enhanced by incorporating with the coherentregularization, as the schematic shown in Fig. 6.

In summary, we capture the inherent crowd dynamics bymapping the tracklets to hidden features with the cLSTMmodel. With the features properly harnessed, we reconstructthe input tracklets as well as predict the future paths. Areconstruction-oriented encoder would suffer from the ten-dency to memorize the inputs, and the future predictor wouldsuffer from the tendency to ignore the initial frames becauseof the more significant impacts from the last few frames.

3472

Therefore, the proposed method encourages a more inher-ent feature by mutually enhancing each other, such that thefeatures maintain the dynamics embedded in the whole se-quences by not just memorizing the information.

3.1 Crowd Scene ProfilingWith crowd dynamics properly captured by our coherentLSTM model, the critical tasks in crowd scene analysis canbe conducted based on the learnt features, e.g., understandinggroup states and recognizing crowd behaviors.

The states of groups with coherent spatio-temporal struc-tures are generally recognized as Gas, Solid, Pure Fluid andImpure Fluid [Shao et al., 2014], which is related to multiplesocial psychological and physical factors, e.g., crowd den-sity, goals, interactions between group members, etc. In thissection, we implement group state estimation with the learntrepresentation since it embeds dynamics of coherent groups.Specifically, we feed the feature learnt from the unsupervisedcLSTM to a softmax classifier, which infers the group statesbased on the dynamics of tracklet within a coherent group as

L =

1

NT

NX

i=1

TX

j=1

CX

c=1

8>><

>>:1{yi(t) = c} log e⌘

Tc hi(t)

NcPc=1

e⌘Tc hi(t)

9>>=

>>;, (12)

where 1(·) is an indicator function with value 1 if the predi-cate is true otherwise 0; ⌘

c

is a parameter to weight the hid-den feature corresponding to class c; h

i

(t) is the hidden fea-tures corresponding to the ith tracklet in the group at timestep t; N denotes the total number of tracklets in a coherent

group; and T is length of tracklets. The term log

NcPc=1

e⌘Tc hi(t)

normalizes the distribution to guarantee the probability con-dition. A reliable inference is conducted by maximizing thesoftmax regression.

Besides estimating the state of each individual group, wealso implement the holistic crowd video classification bytraining another softmax classifier over the sequential hid-den features of all the tracklets in a crowd video, producing adistribution over the holistic crowd behavior categories. Thesuccess of complex crowd behavior recognition lies on thefacts that composing deep layers of cLSTMs can result in apowerful capability in exploring the nonlinearities and coher-ent spatio-temporal structures in crowd motions.

4 Experimental ResultsIn this section, we demonstrate the effectiveness of the fea-tures learnt from our algorithm on three critical applica-tions in crowd scene analysis: pedestrian future path pre-diction, group state estimation, and crowd behavior classi-fication. Evaluations are conducted on the CUHK CrowdDataset [Shao et al., 2014], which includes crowd videos withdifferent densities and perspective scales in many environ-ments, e.g., street, airports, etc. The ground truth of keypointtracklets, group state, and crowd video classification are alsoavailable for the dataset. It consists of more than 400 se-quences, and 200,000+ tracklets in total.

In each experiment, we construct a coherent LSTM with128 hidden units, such that the input tracklets are mappedto 128-dimensional hidden features. Similar as the workin [Shao et al., 2014], we randomly select half of the track-lets in the sequences for training and the remaining for test-ing. When optimizing the parameters in predicting the fu-ture paths, we divide each tracklet into two segments, and usethe hidden features learnt from the first segments (e.g., 2/3 ofeach tracklet) to predict the latter segments (e.g., the rest 1/3tracklet).

4.1 Future Path Forecasting of PedestriansWe first test the performance of our framework on path fore-casting by comparing our approach with a baseline methodof Kalman filter, which implements path prediction by incor-porating a linear dynamic model with the uncertainties of thecurrent state, which cannot capture the nonlinear characteris-tics of the complex crowd motions and leverage informationof the coherent groups. We also conduct path prediction witha variant of the proposed cLSTM model by neglecting the co-herent regularization between neighboring agents, which isdenoted as Un-coherent LSTM. In each experiment, we takea fragment of tracklets as the input (e.g., 2/3 of each trackletin this paper), reconstruct them and then generate the rest ofthe predicted tracklets (e.g., 1/3 of each tracklet) to evaluatethe performance.

Figure 7: Prediction of future paths with coherent regular-ization, in which the red paths are reliable tracklets obtainedfrom KLT trackers and the green ones are the forecasted pathsfrom the cLSTM.

Sample results are demonstrated in Fig. 7, in which the redtracklets are the paths obtained with the KLT tracker [Bakerand Matthews, 2004] that are followed by green curves oftracklets generated with our cLSTM prediction model. Theresults demonstrate that our algorithm is capable to capturethe inherent dynamics of each tracklet, which reflects the ten-dency of neighboring agents by taking the coherent regular-ization into account. Since the near future paths are well pre-dicted, it provides a good probability to prevent the seriousincidents before they would emerge.

In Table 1, we report the quantitative performance of pathforecasting in terms of prediction error, which measures theaverage distance between the ground-truth tracklets and theestimated paths in terms of pixel as unit. It shows that our

3473

approach significantly outperforms the alternative methods.Compared with the baseline method of Kalman filter, ourcLSTM captures the inherent nonlinearity of the crowd be-havior such that the complex crowd motions are well pre-dicted with a high precision; moreover, our method also ben-efits from the coherent regularization due to the collectiveproperties in moving crowds.

Table 1: Error of Path Prediction

Kalman Filter Un-coherent LSTM Coherent LSTM9.32 ± 1.99 6.64 ± 1.76 4.37±0.93

4.2 Group State EstimationGroup states well reflect the characteristics of a crowd, whichare useful in various applications. In the CUHK CrowdDataset [Shao et al., 2014], groups are classified into fourstates as Gas2, Solid3, Pure Fluid4 and Impure Fluid5. Sam-ples of groups with different states are illustrated in Fig. 8.

(a) Gas (b) Solid (c) Pure Fluid (d) Impure Fluid

Figure 8: Samples of groups undergoing different states.

In this section, we train a softmax classifier using the hid-den features learnt by our cLSTM, and then implement thegroup state estimation. As a comparison, we also conductgroup state estimation by feeding the descriptors learnt fromvariants of the proposed cLSTM model, including the predic-tion LSTM by only targeting at predicting future paths, re-construction LSTM by neglecting the prediction components,and un-coherent LSTM by neglecting the coherent regulariza-tion. Besides, we also report the results based on a collectivetransition [Shao et al., 2014], which is the state-of-the-art al-gorithm in group state estimation.

The quantitative evaluation in terms of confusion matrix isreported in Fig. 9. Obviously, our proposed algorithm withcoherent LSTM (Fig. 9 (e)) outperforms the alternative meth-ods. As a baseline method, group state estimation based oncollective transition [Shao et al., 2014] (Fig. 9(a)) exploresthe crowd dynamics via a linear transition matrix, which isnot valid for groups with complex motion patterns, e.g., whengroups undergo an impure fluid state. The performance alsodegrades when we neglect the reconstruction or predication

2Gas: Particles move in different directions without forming col-lective behaviors

3Solid: Particles move in the same direction with relative posi-tions unchanged

4Pure Fluid: Particles move towards the same direction withever-changing relative positions

5Impure Fluid: Particles move in a pure fluid style with invasionof particles from other groups

(a) Collective Transition

(e) Coherent LSTM(d) Un-coherent LSTM

(b) Prediction LSTM (c) Reconstruction LSTM

Figure 9: Confusion matrices of estimating group states us-ing different methods: (a) collective transition [Shao et al.,2014]; (b) prediction LSTM; (c) reconstruction LSTM; (d)un-coherent LSTM; and (e) coherent LSTM. See text for thedescription of each method.

components for prediction LSTM (Fig. 9 (b)) or reconstruc-tion LSTM (Fig. 9 (c)), since the prediction LSTM tends tomaintain information of the last few frames rather than thewhole tracklets, and the reconstruction LSTM suffers fromthe tendency to memorize the tracklets rather than explorethe inherent dynamics. When we implement group state es-timation without taking the collective properties into accountusing the un-coherent LSTM (Fig. 9(d)), the performance issignificantly inferior to our cLSTM method especially forgroups with organized structures, e.g., the group undergoinga solid or fluid state.

4.3 Crowd Video ClassificationFinally, we demonstrate the effectiveness of our method inclassifying crowd video dependent on the holistic crowd be-haviors in a scene. In CUHK Crowd Dataset [Shao et al.,2014], all video clips are annotated into 8 classes which arecommonly seen in crowd videos as 1) Highly mixed pedes-trian walking; 2) Crowd walking following a mainstream andwell organized; 3) Crowd walking following a mainstream butpoorly organized; 4) Crowd merge; 5) Crowd split; 6) Crowdcrossing in opposite directions; 7) Intervened escalator traf-fic; and 8) Smooth escalator traffic.

Similar as the implementations of group state estimation,we conduct crowd video classification by feeding the repre-sentations, which are learnt by our coherent LSTM, to a soft-max classifier. As comparisons, crowd video classificationis also implemented with variants of the proposed cLSTMmethod, including prediction LSTM, reconstruction LSTM,and un-coherent LSTM (See Section 4.2 for details). Besides,we also compare with collective transition [Shao et al., 2014].In Fig. 10, we report a confusion matrix in crowd video classi-fication based on our cLSTM, which demonstrates that videosare classified into specific categories with high qualities.

In Fig. 11, we demonstrate the accuracy of crowd videoclassification on each class by various methods as mentionedabove. We can see that the coherent LSTM outperforms thealternative methods in general, since it captures the nonlin-

3474

Figure 10: Confusion matrix of crowd video classificationbased on coherent LSTM. See text for the name of each class.

ear dynamics embedded in the complex crowd motion bytaking into the coherent motion at the same time, especiallyfor crowds with high nonlinearity, e.g., highly mixed pedes-trian walking. Our method significantly outperforms the col-lective transition method [Shao et al., 2014], which fails toexplore the nonlinear characteristics of crowd motion andmeanwhile neglects the coherent spatio-temporal structuresin crowd videos. Note that the videos of Crowd walking fol-lowing a mainstream and well organized are well recognizedby collective transition, since crowd motions of these videossatisfy the linear dynamics assumption of collective transitionin most cases.

Our cLSTM also benefits from exacting hidden featuresby incorporating the reconstruction with the prediction taskstogether. Obviously, neither prediction LSTM or reconstruc-tion LSTM fails to capture the inherent dynamics of overallcrowd motion, which degenerates the performance in videoclassification. Besides, the effect of coherent regularization isalso essential in the scenarios when the collectiveness withina crowd are significant, e.g., crowd merge or split.

Figure 11: Per-class accuracy comparison of crowd videoclassification using different methods. See text for the nameof each class.

5 ConclusionsWe present a novel recurrent neural network with coherentlong short term memory (cLSTM) units to understand crowdscenes. To address the nonlinear dynamics in complex crowdscenes, we propose to map a set of tracklets that describe thecrowd motion patterns to hidden features with LSTM, which

can keep track of an input tracklet. To consider the collec-tive properties of moving crowds, we introduce a coherentregularization such that the memory units are updated by tak-ing into account dynamics of tracklets with coherent motions,which encourages the learnt hidden features to have consis-tent spatial and temporal structures. With the inherent dynam-ics properly harnessed, our algorithm also provides a methodto predict the possible future paths, which is of great im-portance in crowd management to prevent serious accidentsbefore they emerge. Extensive experiments on real-worlddataset demonstrate that our method outperforms its variantsand alternative method in group state estimation and crowdvideo classification.

References[Baker and Matthews, 2004] Simon Baker and Iain

Matthews. Lucas-kanade 20 years on: A unifyingframework. International Journal of Computer Vision,56(3):221–255, 2004.

[Donahue et al., 2015] Jeffrey Donahue, Anne Hendricks,Kate Saenko, and Trevor Darrell. Long-term recurrentconvolutional networks for visual recognition and descrip-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 2625–2634, 2015.

[Greff et al., 2015] Klaus Greff, Rupesh Kumar Srivastava,Jan Koutnık, Bas R Steunebrink, and Jurgen Schmidhu-ber. Lstm: A search space odyssey. arXiv preprintarXiv:1503.04069, 2015.

[Helbing and Johansson, 2009] Dirk Helbing and Anders Jo-hansson. Pedestrian, crowd and evacuation dynamics.Springer, 2009.

[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter andJurgen Schmidhuber. Long short-term memory. NeuralComputation, 9(8):1735–1780, 1997.

[Kratz and Nishino, 2012] Kratz and K. Nishino. Trackingpedestrians using local spatio-temporal motion patternsin extremely crowded scenes. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, pages 987 –1002, May 2012.

[Li et al., 2014] Weixin Li, Vijay Mahadevan, and NunoVasconcelos. Anomaly detection and localization incrowded scenes. Pattern Analysis and Machine Intelli-gence, IEEE Transactions on, 36(1):18–32, 2014.

[Li et al., 2015] Teng Li, Huan Chang, Meng Wang, Bing-bing Ni, Richang Hong, and Shuicheng Yan. Crowdedscene analysis: A survey. Circuits and Systems for VideoTechnology (TCSVT), IEEE Transactions on, 25(3):367–386, 2015.

[Lin et al., 2009] Dahua Lin, E. Grimson, and J. Fisher.Learning visual flows: A lie algebraic approach. In Com-puter Vision and Pattern Recognition, IEEE Conference on(CVPR), pages 747 –754, June 2009.

[Lin et al., 2016] Weiyao Lin, Yang Mi, Weiyue Wang,Jianxin Wu, Jingdong Wang, and Tao Mei. A diffusion

3475

and clustering-based approach for finding coherent mo-tions and understanding crowd scenes. IEEE Transactionson Image Processing, 25(4):1674–1687, 2016.

[Lipton, 2015] Zachary C Lipton. A critical review of recur-rent neural networks for sequence learning. arXiv preprintarXiv:1506.00019, 2015.

[Mahadevan et al., 2010] Vijay Mahadevan, Weixin Li, ViralBhalodia, and Nuno Vasconcelos. Anomaly detection incrowded scenes. In Computer Vision and Pattern Recog-nition (CVPR), 2010 IEEE Conference on, pages 1975–1981, 2010.

[Massink et al., 2011] Mieke Massink, Diego Latella, An-drea Bracciali, and Jane Hillston. Modelling non-linearcrowd dynamics in bio-pepa. In Fundamental Approachesto Software Engineering, pages 96–110. 2011.

[Mehran et al., 2009] R. Mehran, A. Oyama, and M. Shah.Abnormal crowd behavior detection using social forcemodel. In Computer Vision and Pattern Recognition, IEEEConference on (CVPR), pages 935 –942, June 2009.

[Mehran et al., 2010] Ramin Mehran, Brian E. Moore, andMubarak Shah. A streakline representation of flow incrowded scenes. In European Conference on ComputerVision (ECCV), pages 439–452, 2010.

[Ramanathan et al., 2015] Vignesh Ramanathan, KevinTang, Greg Mori, and Li Fei-Fei. Learning temporalembeddings for complex video analysis. arXiv preprintarXiv:1505.00315, 2015.

[Schmidhuber, 2015] Jurgen Schmidhuber. Deep learning inneural networks: An overview. Neural Networks, 61:85–117, 2015.

[Shao et al., 2014] Jing Shao, Chen Change Loy, and Xiao-gang Wang. Scene-independent group profiling in crowd.In Computer Vision and Pattern Recognition (CVPR),2014 IEEE Conference on, pages 2227–2234, 2014.

[Shao et al., 2015] Jing Shao, Kai Kang, Chen Change Loy,and Xiaogang Wang. Deeply learned attributes forcrowded scene understanding. In Computer Vision andPattern Recognition (CVPR), IEEE Conference on, pages1–9, 2015.

[Simonyan and Zisserman, 2014] Karen Simonyan and An-drew Zisserman. Two-stream convolutional networks foraction recognition in videos. In Advances in Neural Infor-mation Processing Systems (NIPS), pages 568–576, 2014.

[Solmaz et al., 2012] Berkan Solmaz, Brian Moore, andMubarak Shah. Identifying behaviors in crowd scenesusing stability analysis for dynamical systems. PatternAnalysis and Machine Intelligence (TPAMI), IEEE Trans-actions on, pages 2064 –2070, Oct. 2012.

[Srivastava et al., 2015] Nitish Srivastava, Elman Mansi-mov, and Ruslan Salakhutdinov. Unsupervised learn-ing of video representations using lstms. arXiv preprintarXiv:1502.04681, 2015.

[Su et al., 2013] Hang Su, Hua Yang, Shibao Zheng, YawenFan, and Sha Wei. The large-scale crowd behavior per-

ception based on spatio-temporal viscous fluid field. In-formation Forensics and Security, IEEE Transactions on,8(10):1575–1589, 2013.

[Sulman et al., 2008] Noah Sulman, Thomas Sanocki,Dmitry Goldgof, and Rangachar Kasturi. How effectiveis human video surveillance performance? In PatternRecognition, 19th International Conference on (ICPR),pages 1–8, 2008.

[Vincent et al., 2010] Pascal Vincent, Hugo Larochelle, Is-abelle Lajoie, Yoshua Bengio, and Pierre-Antoine Man-zagol. Stacked denoising autoencoders: Learning usefulrepresentations in a deep network with a local denoisingcriterion. Journal of Machine Learning Research (JLMR),11:3371–3408, 2010.

[Vinyals et al., 2015] Oriol Vinyals, Alexander Toshev,Samy Bengio, and Dumitru Erhan. Show and tell: A neu-ral image caption generator. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CVPR), pages 3156–3164, 2015.

[Wang et al., 2014] Weiyue Wang, Weiyao Lin, YuanzheChen, Jianxin Wu, Jingdong Wang, and Bin Sheng. Find-ing coherent motions and semantic regions in crowdscenes: A diffusion and clustering approach. In EuropeanConference on Computer Vision (ECCV), pages 756–771.2014.

[Wu et al., 2010] Shandong Wu, Brian E. Moore, andMubarak Shah. Chaotic invariants of lagrangian particletrajectories for anomaly detection in crowded scenes. InComputer Vision and Pattern Recognition (CVPR), IEEEConference on, pages 2054 –2060, June 2010.

[Zhou et al., 2011] Bolei Zhou, Xiaogang Wang, and XiaoouTang. Random field topic model for semantic region anal-ysis in crowded scenes from tracklets. In Computer Vi-sion and Pattern Recognition (CVPR), IEEE Conferenceon, June 2011.

[Zhou et al., 2012a] Bolei Zhou, Xiaoou Tang, and Xiao-gang Wang. Coherent filtering: detecting coherent motionsfrom crowd clutters. In European Conference on Com-puter Vision (ECCV), pages 857–871, 2012.

[Zhou et al., 2012b] Bolei Zhou, Xiaogang Wang, and Xi-aoou Tang. Understanding collective crowd behaviors:Learning a mixture model of dynamic pedestrian-agents.In Computer Vision and Pattern Recognition (CVPR),IEEE Conference on, pages 2871 –2878, June 2012.

[Zhou et al., 2014] Bolei Zhou, Xiaoou Tang, HepengZhang, and Xiaogang Wang. Measuring crowd collective-ness. Pattern Analysis and Machine Intelligence, IEEETransactions on (TPAMI), 36(8):1586–1599, 2014.

3476

Documents

Crowd Scene Understanding with Coherent Recurrent Neural ... · Crowd Scene Understanding with Coherent Recurrent Neural Networks ⇤ Hang Su⇤, Yinpeng Dong⇤, Jun Zhu⇤, Haibin