16
1494 IEEE TRANSACTIONS ONMULTIMEDIA, VOL. 19, NO. 7, JULY 2017 A Hierarchical Spatio-Temporal Model for Human Activity Recognition Wanru Xu, Zhenjiang Miao, Member, IEEE, Xiao-Ping Zhang, Senior Member, IEEE, and Yi Tian Abstract—There are two key issues in human activity recognition: spatial dependencies and temporal dependencies. Most recent methods focus on only one of them, and thus do not have sufficient descriptive power to recognize complex activity. In this paper, we propose a hierarchical spatio-temporal model (HSTM) to solve the problem by modeling spatial and temporal constraints simultaneously. The new HSTM is a two-layer hidden conditional random field (HCRF), where the bottom-layer HCRF aims at describing spatial relations in each frame and learning more discriminative representations, and the top-layer HCRF utilizes these high-level features to characterize temporal relations in the whole video sequence. The new HSTM takes advantage of the bottom layer as the building blocks for the top layer and it aggregates evidence from local to global level. A novel learning algorithm is derived to train all model parameters efficiently and its effectiveness is validated theoretically. Experimental results show that the HSTM can successfully classify human activities with higher accuracies on single-person actions (UCF) than other existing methods. More importantly, the HSTM also achieves superior performance on more practical interactions, including human–human interactional activities (UT-Interaction, BIT-Interaction, and CASIA) and human–object interactional activities (Gupta video dataset). Index Terms—Activity recognition, hidden conditional random field (HCRF), hierarchical structure, spatio-temporal dependencies. I. INTRODUCTION W ITH the explosive growth of multimedia data, espe- cially for videos, there is a heavy demand for automatic multimedia analysis. Among them, it is more meaningful to understand “what is he doing” or “what dose he want to do” for human beings. Under this background, vision-based human activity recognition has played an important role in multimedia Manuscript received September 15, 2016; revised January 15, 2017; accepted February 16, 2017. Date of publication February 24, 2017; date of current version June 15, 2017. This work was supported in part by the Natural Science Foundation of China under Grant 61672089, Grant 61273274, Grant 61572064, Grant PXM2016_014219_000025, and (973 Program) Grant 2011CB302203, in part by the National Key Technology R&D Program of China under Grant 2012BAH01F03 and Grant NSFB4123104, and in part by the Engineering Research Council of Canada under Grant RGPIN239031. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Shu-Ching Chen. W. Xu, Z. Miao, and Y. Tian are with the Beijing Jiaotong University, Beijing 100044, China (e-mail: [email protected]; [email protected]; [email protected]). X.-P. Zhang is with the Department of Electrical & Computer Engineer- ing, Ryerson University, Toronto ON M5B 2K3, Canada (e-mail: xzhang@ee. ryerson.ca). Color versions of one or more of the figrues in this paper are available online at http//ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2017.2674622 content analysis and it also has contributed to other multimedia applications. Human activity recognition is the most active re- search area in computer vision with many applications [1], [2] such as human-computer interaction, video retrieval and video surveillance. There is a fine difference between the meanings of action and activity according to the definition in [3]. Action is characterized by simple motion-patterns typically executed by a single human, while activity is more complex and involves coordinated actions or interactional objects among a small num- ber of humans. Recently, there has been a considerable number of works focusing on action analysis of single person, but it is desirable to recognize complex interactions in real world. In this paper, we propose a unified framework to model human actions and activities, especially for various interactional activities. Based on traditional pattern recognition, some simple mod- els [4] can be utilized to classify actions by analyzing certain statistical features from the whole action without considering their temporal and spatial distributions. Actions always occur in the three-dimensional space during a period of time. In the temporal domain, there exist strong time sequence relations be- tween neighboring frames, and body parts also affect each other in the spatial domain. From this perspective, recent human ac- tivity recognition methods can be classified into two categories, each with its own strengths and weaknesses. The first category can be called feature based method, which focuses on extracting or learning spatio-temporal features to incorporate spatial and temporal information and then summarizes all frames of a video sequence into a single representation for the feature matching [4]–[8]. The second one can be called model based method, which directly models the spatio-temporal dynamic of human activity and then matches between the learned model and the observation, and it can be subdivided into two categories de- pending on which internal relation (spatial or temporal) is con- sidered. The temporal relation modeling method pays special attention to temporal dependencies [9]–[11], while the spatial relation modeling method aims at modeling dependencies in the spatial domain [12]–[15]. In general, model based methods can provide good performance, but they lead to high computational cost. Feature based methods can be calculate simply and rapidly, while most of them are sensitive to noise and lack generation ability, which is to say it is difficult to extend them to other datasets. Feature based method: Some conceptually weaker models such as those applying space-time interest points (STIPs) [16] to bag-of-features (BOF) [4] have achieved promising per- formance on the object detection and even the human action recognition. In [4], an action is represented as a collection of STIPs and only their labels are reserved to compute his- togram characteristic. These approaches completely disregard spatio-temporal relations among these local features. Although 1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 7, JULY 2017

A Hierarchical Spatio-Temporal Modelfor Human Activity Recognition

Wanru Xu, Zhenjiang Miao, Member, IEEE, Xiao-Ping Zhang, Senior Member, IEEE, and Yi Tian

Abstract—There are two key issues in human activityrecognition: spatial dependencies and temporal dependencies.Most recent methods focus on only one of them, and thus do nothave sufficient descriptive power to recognize complex activity.In this paper, we propose a hierarchical spatio-temporal model(HSTM) to solve the problem by modeling spatial and temporalconstraints simultaneously. The new HSTM is a two-layer hiddenconditional random field (HCRF), where the bottom-layer HCRFaims at describing spatial relations in each frame and learningmore discriminative representations, and the top-layer HCRFutilizes these high-level features to characterize temporal relationsin the whole video sequence. The new HSTM takes advantage ofthe bottom layer as the building blocks for the top layer and itaggregates evidence from local to global level. A novel learningalgorithm is derived to train all model parameters efficientlyand its effectiveness is validated theoretically. Experimentalresults show that the HSTM can successfully classify humanactivities with higher accuracies on single-person actions (UCF)than other existing methods. More importantly, the HSTM alsoachieves superior performance on more practical interactions,including human–human interactional activities (UT-Interaction,BIT-Interaction, and CASIA) and human–object interactionalactivities (Gupta video dataset).

Index Terms—Activity recognition, hidden conditionalrandom field (HCRF), hierarchical structure, spatio-temporaldependencies.

I. INTRODUCTION

W ITH the explosive growth of multimedia data, espe-cially for videos, there is a heavy demand for automatic

multimedia analysis. Among them, it is more meaningful tounderstand “what is he doing” or “what dose he want to do”for human beings. Under this background, vision-based humanactivity recognition has played an important role in multimedia

Manuscript received September 15, 2016; revised January 15, 2017; acceptedFebruary 16, 2017. Date of publication February 24, 2017; date of currentversion June 15, 2017. This work was supported in part by the Natural ScienceFoundation of China under Grant 61672089, Grant 61273274, Grant 61572064,Grant PXM2016_014219_000025, and (973 Program) Grant 2011CB302203,in part by the National Key Technology R&D Program of China under Grant2012BAH01F03 and Grant NSFB4123104, and in part by the EngineeringResearch Council of Canada under Grant RGPIN239031. The associate editorcoordinating the review of this manuscript and approving it for publication wasProf. Shu-Ching Chen.

W. Xu, Z. Miao, and Y. Tian are with the Beijing Jiaotong University,Beijing 100044, China (e-mail: [email protected]; [email protected];[email protected]).

X.-P. Zhang is with the Department of Electrical & Computer Engineer-ing, Ryerson University, Toronto ON M5B 2K3, Canada (e-mail: [email protected]).

Color versions of one or more of the figrues in this paper are available onlineat http//ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TMM.2017.2674622

content analysis and it also has contributed to other multimediaapplications. Human activity recognition is the most active re-search area in computer vision with many applications [1], [2]such as human-computer interaction, video retrieval and videosurveillance. There is a fine difference between the meaningsof action and activity according to the definition in [3]. Actionis characterized by simple motion-patterns typically executedby a single human, while activity is more complex and involvescoordinated actions or interactional objects among a small num-ber of humans. Recently, there has been a considerable numberof works focusing on action analysis of single person, but it isdesirable to recognize complex interactions in real world. In thispaper, we propose a unified framework to model human actionsand activities, especially for various interactional activities.

Based on traditional pattern recognition, some simple mod-els [4] can be utilized to classify actions by analyzing certainstatistical features from the whole action without consideringtheir temporal and spatial distributions. Actions always occurin the three-dimensional space during a period of time. In thetemporal domain, there exist strong time sequence relations be-tween neighboring frames, and body parts also affect each otherin the spatial domain. From this perspective, recent human ac-tivity recognition methods can be classified into two categories,each with its own strengths and weaknesses. The first categorycan be called feature based method, which focuses on extractingor learning spatio-temporal features to incorporate spatial andtemporal information and then summarizes all frames of a videosequence into a single representation for the feature matching[4]–[8]. The second one can be called model based method,which directly models the spatio-temporal dynamic of humanactivity and then matches between the learned model and theobservation, and it can be subdivided into two categories de-pending on which internal relation (spatial or temporal) is con-sidered. The temporal relation modeling method pays specialattention to temporal dependencies [9]–[11], while the spatialrelation modeling method aims at modeling dependencies in thespatial domain [12]–[15]. In general, model based methods canprovide good performance, but they lead to high computationalcost. Feature based methods can be calculate simply and rapidly,while most of them are sensitive to noise and lack generationability, which is to say it is difficult to extend them to otherdatasets.

Feature based method: Some conceptually weaker modelssuch as those applying space-time interest points (STIPs) [16]to bag-of-features (BOF) [4] have achieved promising per-formance on the object detection and even the human actionrecognition. In [4], an action is represented as a collectionof STIPs and only their labels are reserved to compute his-togram characteristic. These approaches completely disregardspatio-temporal relations among these local features. Although

1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Page 2: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

XU et al.: HSTM FOR HUMAN ACTIVITY RECOGNITION 1495

Fig. 1. Role of spatial dependencies and temporal dependencies in human activity recognition. Two sample frames from two video clips are shown in (a) and(b) separately. It is often hard to distinguish activities from each frame alone. However, if we look at the consecutive frames, we can easily recognize the personin (c) is drinking and the person in (d) is pouring. We segment two individual persons from two interactional activities as shown in (e) and (f). It is often hardto distinguish activities from each individual person alone. However, if we look at the whole scene, we can easily recognize the person in (g) is hugging and theperson in (h) is pushing. It shows that spatial dependencies and temporal dependencies are two key elements in modeling human activity.

the BOF is easy to implement, such orderless and naive mod-els lose too much information, since an interest point is notisolated but surrounded by its spatio-temporal context. There-fore, improved approaches [5], [6], [8], [17] are proposed forintegrating some hand-designed contextual information intoBOF. Recently, several mid-level features [18]–[21] are ex-tracted to improve discriminative power and enhance spatio-temporal descriptive capability.

Model based method: Probabilistic graphical models can pro-vide a consistent representation for both spatial and temporalstructures in image and video processing [22]–[28]. To learn thetemporal dependencies between consecutive frames, numerousapproaches have been proposed based on hidden Markov mod-els (HMM) [9], [29]. Conditional random fields (CRFs) andtheir variation hidden conditional random fields (HCRFs) [14],[15] are widely used to build the structure of local patches, dueto the relaxing assumption of the conditional independence ofobserved data. In recent years, there are two common ways toclassify activities with CRFs/ HCRFs, which focus on mod-eling spatial and temporal structures respectively. On the onehand, for modeling temporal structures, relations between con-secutive frames are captured by a discriminative graph [10],[11] whose input is human pose. Though such methods havesemantic features and intuitive construction, pose estimation isstill a challenge. On the other hand, the HCRF [12], [13] isutilized to automatically discover spatial dependencies amonglocal patches. These approaches model every frame of videosequence independently and obtain the class label for the wholevideo by majority voting. In contrast with the pose, local featurescan be easily extracted and deal with occlusion well. However,the correlation of neighboring frames is neglected and the globaldescriptive capacity is inadequate in these approaches.

Looking at the two persons in Fig. 1(a) and 1(b) and Fig. 1(e)and 1(f), it is difficult to tell that they are doing two differentactivities. After observing the consecutive frames Fig. 1(c) and1(d) or the whole scene Fig. 1(g) and 1(h), we can easily rec-ognize them. It shows that spatial dependencies and temporal

dependencies are two key elements in modeling human activity,which play a decisive role in recognition. Due to its time-spaceattribute, the human activity cannot be completely representedby only spatial constrains or only temporal constrains.

To address these limitations, we propose a hierarchical spatio-temporal model (HSTM), where the bottom layer aims at de-scribing spatial relations in each frame and the top layer isutilized to characterize temporal relations in the whole videosequence. In fact, it is a hierarchical model that utilizes the bot-tom layer as the building blocks for the top layer and aggregatesevidence from local to global level. To our knowledge, this pro-posed new method is the first model based method to jointlymodel spatio-temporal dependencies of human activity usingthe hierarchical probabilistic graphical model.

There are three main contributions in this paper. First, weintegrate hierarchical and structural information into the inter-pretation process by constructing the HSTM on the spatial scaleand the temporal scale, such that all levels of spatio-temporalrelations are combined to enhance descriptive capability. Sec-ond, raw observations are converted to high-level semanticrepresentations by the bottom layer to combine the flexibility oflocal features with the discrimination power of global featuresin a consistent multi-layer framework. Last, we derive a novellearning algorithm to train all parameters efficiently and effec-tively, and it improves classification ability by jointly measuringspatio-temporal similarity of activities. It is a unified frameworkto model both one-person actions and multi-person activities. Itsevaluations are conducted on three types of datasets. The first isthe one-person action dataset (UCF), the second is the complexhuman-human interactional activity datasets (UT-Interaction,BIT-Interaction and CASIA), and the last is the human-objectinteraction dataset (Gupta video dataset). Experimental resultsshow that the HSTM can successfully recognize all of them withhigh accuracies.

The rest of this paper is organized as follows: The second sec-tion focuses on related works. The third section details the struc-ture of this novel HSTM. The training and inference algorithms

Page 3: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

1496 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 7, JULY 2017

of the HSTM are given in Section IV with some verifications oftheory. Section V reports experiments using five human activitydatasets with various tasks. Section VI concludes this paper.

II. RELATED WORKS

Hierarchical models can construct long-range and multi-resolution dependencies. Multiple layers can not only improvediscriminative power, but also enhance descriptive capabil-ity. Therefore, various hierarchical feature learning algorithms[30]–[36] and hierarchical classification models [37]–[42] havebeen proposed. In this paper, the HSTM is also inspired by theconcept of hierarchy and we integrate feature learning into ourhierarchical classification model simultaneously.

Hierarchical feature learning: Recent works have shown thatlearning hierarchical features leads to a significant improvementin various computer vision fields by deep architectures withmultiple hidden layers [30]–[35] or spatial pyramids of imagepatches [36]. An extension of the independent subspace analy-sis algorithm combining deep learning technique is presented in[30] to learn invariant and hierarchical spatio-temporal featuresfrom unlabeled data. A hierarchical sequence summarizationmethod [33] is introduced for action recognition, which learnsmultiple layers of discriminative features at different temporalresolutions. A high-level descriptions called interactive phrases[43], [44] are introduced to describe human interactions, whichare learned by a novel hierarchical model based on the latentSVM framework. A new representation - hierarchical movemes[45] is proposed to represent human movements at multipletemporal levels of granularities, which ranging from atomicmovements to coarser movements. In addition, the popular con-volutional neural network (CNN) [32], [34] is also utilized toautomatically learn convolutional feature representations for hu-man action recognition. Since hierarchical feature learning canreplace hand-designed features with more robust and portablelearned features, we adopt such main concept into the HSTMand learn hierarchical features to capture more global or higher-level representations.

Hierarchical classification model: Many existing works fo-cus on recognition of atomic actions with simple single-layermodels. However, human activity often involves various in-teractions among persons, objects and groups in a real worldscenario. Hierarchical classification models [37]–[42] can beused to classify complex interactional activities due to theirstrong constructive modeling capability. The hierarchical hiddenMarkov model (HHMM) [41] is introduced to exploit the nat-ural hierarchical decomposition, which extends the traditionalHMM in a hierarchical manner to contain multiple layers of hid-den states. The coupled hidden Markov models (CHMMs) [46]provides a way to model multiple interactional process, whichcouples HMMs with temporal, asymmetric conditional proba-bilities. The coupled observation decomposed hidden Markovmodel (CODHMM) [47] is presented to model multi-person ac-tivities, and the two levels in human interactions, including theindividual level and the interaction level, are modeled by twohidden Markov chains. A novel Bayes network with multiplehidden nodes [48] is proposed to simultaneously estimate thescene object, scene type, human action, and manipulable ob-ject probabilities for human-object interaction. A graph [49] isused to present the hierarchy of tasks that compose a workflowand each task is modeled by a separate HMM. Such generativemodels have strict independence hypotheses and lack ability tomodel complex observed data.

From this perspective, the discriminative models should havemore potentials. A novel discriminative multi-scale model [50]based on structured SVM, is introduced for predicting the ac-tivity from a partially observed video. For jointly modeling theindividual person actions, the group activity, and their interac-tions, a novel discriminative latent framework [51] is employedto explore group-person interaction and person-person inter-action contextual information. In addition, several hierarchicalCRFs/HCRFs [37], [38] have been proposed for image label-ing, segmentation and classification. A novel deep discrimina-tive structured model, called convolutional neural random field(CNRF)[52], is proposed to combine the advantages of bothCRF and CNN, where the CNN is developed for feature learningand the CRF is applied for capturing the temporal dependencies.Although such multi-layer structure strengthens the CRF/HCRFby addressing complex dependencies, there still exists two lim-itations. First, some local classifiers are required to be trainedin advance and their probability outputs are treated as the unaryand even the pairwise potentials. Second, the relationships be-tween different layers are separated, such that only the top layerhas contribution to the final result and the intermediate layersjust play a role of feature learning.

Compared with hierarchical generative models [41], [46]–[48], the proposed HSTM is a discriminative model with betterclassification ability and more relax independence hypotheses.While previous works either use multiple layers on various spa-tial quantizations [37], [38], [48], [51] or construct hierarchicallayers at different temporal resolutions [33], [50], [53], the pro-posed HSTM is “doubly hierarchical” which jointly modelsdifferent spatio-temporal scales. Rather than relying on priorsalone [45], [48], all parameters in the HSTM are trained jointlyby a bottom-to-up learning strategy and no other classifier is re-quired. This training strategy not only limits the computationalcomplexity, but also enables all layers to contribute to the finalclassification decision. That is to say, different from the methodin [52], where only the temporal similarity is measured, bothspatial and temporal similarity of activities are jointly measuredin the HSTM. Due to such hierarchical connection, the mutualinfluence of different levels can be completely represented andthe two layers can complement each other as well. For example,when the pose of “stand” appears in one frame, we can cor-rectly recognize it via combining the adjacent frames with thetop HCRF. To the contrary, when a specific pose is found by thebottom HCRF, it also provides a strong evidence for the top layer.

III. A NOVEL FRAMEWORK OF HIERARCHICAL

SPATIO-TEMPORAL MODEL

Though great performance has been achieved by the single-layer HCRF for the action recognition, disadvantages associatedwith its simple structure are also noticeable. As a result, it is dif-ficult to capture discriminative and representative informationby the model with only one layer. Therefore, we propose a hi-erarchical spatio-temporal model to recognize complex spatio-temporal dynamics in activities.

The overview of our approach is illustrated in Fig. 2. Our two-layer HSTM model consists of a spatial layer and a temporallayer. Intuitively, the bottom layer is used as a spatial summaryof each frame, while the top layer is a temporal description forthe whole video. First, raw features are extracted as inputs of thebottom layer to train a Tree-HCRF for each frame in the spatialdomain. Second, two kinds of transmitted information betweenlayers are calculated, including the learned high-level features

Page 4: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

XU et al.: HSTM FOR HUMAN ACTIVITY RECOGNITION 1497

Fig. 2. Overview of our approach. (a) Input video. (b) STIP. (c) Tree-HCRF. (d) Transferred message. (e) Chain-HCRF.

and the frame-level marginal posterior probabilities. Finally, aChain-HCRF is constructed in the temporal domain combiningwith all level evidences (learned high-level semantic features,frame-level features and video-level features) to achieve theclass label of the whole action sequence.

A. Problem Formulation

The task is to predict an activity label y from the observationsX. For the visual activity recognition, activities can be modeledin two levels:

1) The spatial level: The local patch-level features in eachframe are extracted as inputs of the spatial layer to modelthe spatial dependencies among these local patches.

2) The temporal level: Some global features and learnedhigh-level features are treated as inputs of the temporallayer to model the temporal dependencies in the wholevideo sequence.

In the spatial level, since numerous local patches are extractedin each frame, the spatial observations XS

t at time t can betreated as a combination of

XSt = [XS

1,t , XS2,t , ...X

SNt ,t ] (1)

where Nt is the number of space-time interest points extractedin the t-th frame and XS

n,t represents the local observation ofthe n-th STIP at time t. There are many intra-variations in hu-man activities, namely an activity may contain many differentintermediate states. Therefore, a set of hidden variables is in-troduced to capture theses intra-variations and model complexdependencies among observations. For the t-th frame, there ex-ists one-to-one correspondence between an observation and itshidden variable

hSt = {[hS

1,t , hS2,t , ...h

SNt ,t ]|hS

n,t ∈ HS}. (2)

Each patch-level observation XSn,t is associated with a hidden

variable hSn,t , whose value is a member of the spatial hidden

state set HS . Intuitively, hSn,t assigns a “body part label” to the

patch XSn,t , where “body part label” associates with certain local

motion and appearance patterns distinctive of certain activities.In the temporal level, the temporal features XT include the

high-level semantic feature XLt learned from the spatial layer,

the trajectory-based [54] frame-level feature XFt and video-

level feature XV . They are considered as global observationsto summarize each frame and the whole video respectively.Similarly, each temporal observation XT

t is associated with ahidden variable hT

t , which would be assigned a state from thetemporal hidden state set HT . Intuitively, hT

t assigns a “poselabel” to the frame XT

t , where “pose label” corresponds to

certain global motion and appearance patterns distinctive ofcertain activities.

For a video sequence with K frames, based on the two levels,the observations X can be decomposed as

X = {[XSt ,XT

t ]Kt=1 ,XV } = {[XS

t ,XLt ,XF

t ]Kt=1 ,XV } (3)

and the hidden variables h can be decomposed into two parts

h = [hSt , hT

t ]Kt=1 . (4)

In fact, it is a fine-to-coarse procedure, where the spatial leveldetails relations in intra-frame to infer each “pose label”, whilethe temporal level modifies the inference and summarises re-lations in inter-frames to infer each “activity label”. As such,it avoids making a classification decision when seeing only asmall piece of the whole or when just looking at the whole andomitting details.

The HSTM is a discriminative model which directly estimatesthe probability of output conditioned on the whole observations.Our aim is to learn a mapping from X to y, which can matchthe training data well. After giving these definitions of classlabel y, observations X and hidden variables h, the conditionalprobabilistic model can be formulated as

P (y|X;θ) =1Z

h∈Hexp (E(y,h,X;θ)). (5)

The E(y,h,X;θ) represents the potential function parame-terized by θ, which can model various relations among vari-ables depending on the definition form and the model structure.The parameters θ = {θS

v , θTv , θS

l , θTl , θS

e , θTe , θT

o } represent thecompatibilities between two variables sets, including the pa-rameter of root potential (θT

o ), feature-related potential (θSv , θT

v ),structure-related potential (θS

e , θTe ) and activity-related potential

(θSl , θT

l ), and the details would be introduced in the followingsubsection. The partition function can be defined as

Z =∑

h∈H,y∈Yexp (E(y,h,X;θ)) (6)

which plays a role of normalization to enable (5) to become avalid probability measure.

In the following, a two-layer model is developed to repre-sent the observation sequence and we call it hierarchical spatio-temporal model (HSTM). Our activity recognition frameworkfocuses on three aspects: to construct the model structure and de-fine the corresponding potential function, to learn the high-levelfeatures from the bottom layer, to estimate the model parametersin a supervised way.

Important notations in this model are summarized as follows.

Page 5: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

1498 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 7, JULY 2017

Fig. 3. (a) The graphical representation of the HSTM. The HSTM incorporates two layers of hidden nodes denoted as the spatial layer and the temporal layer,which are connected by black lines and blue lines, respectively. In addition, there exists a feature learning process between two layers, depicted by the red dashedlines. Gray nodes represent observed variables, white nodes denote as unobservable variables, and shadow nodes are the variables to be learned. The variouspotential functions with corresponding parameters are labeled and the details can refer to Section III. The HSTM describes activities by all level features andmodels activities in the spatial and the temporal domains jointly. (b) An illustration of the HSTM. In this example, the human activity “pouring” is represented bythe HSTM from spatial to temporal and from local to global.

Y: the activity label set whose size is Ny and y, y ∈ Y .K: the number of frames in an activity video.Nt : the number of local STIPs extracted from the t-th frame.NT : the size of the temporal hidden state set.NS : the size of the spatial hidden state set.�

Tn : the n-th temporal hidden state in HT .

�Sn : the n-th spatial hidden state in HS .

HT : the temporal hidden state set, HT = {�T1 , ...�T

N T }.HS : the spatial hidden state set, HS = {�S

1 , ...�SN S }.

H: the whole hidden state set H = {HS ,HT }.XV : the video-level observation of the whole activity.XL

t : the learned high-level semantic observation at time t.XF

t : the frame-level observation at time t.XT

t : the temporal observation at time t, XTt = {XL

t ,XFt }.

XSi,t : the i-th observation in the spatial layer at time t.

XSt : the observations in the spatial layer at time t, see (1).

XT : the observations in the temporal layer of an activity.X : the observations of an activity video sequence, see (3).hT

t : the hidden variable in the temporal layer at time t.hS

n,t : the n-th hidden variable in the spatial layer at time t.hS

t : the spatial hidden variables at time t, see (2).hT : the temporal hidden variables of an activity video.h : the hidden variables of an activity sequence, see (4).

B. Hierarchical Spatio-Temporal Model (HSTM)

The graphical representation of the HSTM proposed in thispaper is depicted in Fig. 3. The spatial relations of local patcheswithin a frame are captured by a tree structured graph withnodes from the spatial layer. To capture the temporal relationsbetween neighboring frames, hidden nodes in the temporal layerare connected as a chain. We show two HCRFs with differenttopologies in Fig. 2(c) and 2(e), utilized to encode spatial andtemporal dependencies respectively. Due to such connectionbetween two layers, the bottom layer is treated as the buildingblocks for the top layer. Moreover, a feature learning process isadopted to convert raw observations to some high-level semanticrepresentations, equivalent to aggregating evidences from localto global level.

To avoid making a decision of which layer is more appro-priate for activity recognition, both of them have contributionto the final classification result. It can be noted that besides thetemporal layer, there also exists a connection between the spa-tial layer and the class label y. Therefore, the potential functionof the HSTM can be defined in the form of a sum

E(y,h,X;θ)=ET (y, hT ,XT ; θT )+K∑

t=1

ES (y, hSt ,XS

t ; θS )

(7)where we denote the parameters of the spatial and the tempo-ral layers as θS and θT respectively. The definitions of spatialpotentials and temporal potentials are given as follows:

ES(y, hS

t ,XSt ; θS

)=

Nt∑

i=1

ESv

(hS

i,t , XSi,t ; θ

Sv

)

+∑

i,j∈Et

ESe

(y, hS

i,t , hSj,t ; θ

Se

)+

Nt∑

i=1

ESl

(y, hS

i,t ; θSl

)(8)

ET(y, hT ,XT ; θT

)=ET

o

(XV , y; θT

o

)+

K∑

t=1

ETv

(hT

t ,XTt ; θT

v

)

+K∑

t=2

ETe

(y, hT

t−1 , hTt ; θT

e

)+

K∑

t=1

ETl

(y, hT

t ; θTl

)(9)

where i, j, t are the vertexes in the HSTM and Et denotes theedge set in the spatial layer at time t. The details of these po-tentials including root potential (Eo ), feature-related potential(Ev ), structure-related potential (Ee ) and activity-related poten-tial (El) would be introduced in the following subsection.

In general, the advantage of the HSTM is that the two layersare relatively independent for training and integrated uniformlyfor modeling. First, the bottom layer not only plays a role offeature learning, but also provides evidences for classification.In other words, the information transmitted from bottom to topconsists of two parts: the learned high-level features and theframe-level marginal posterior probabilities. Second, there is noconnection between hidden nodes from different layers, so thatthe complexity can be limited to linear to the number of layersby a bottom-to-up learning strategy.

Page 6: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

XU et al.: HSTM FOR HUMAN ACTIVITY RECOGNITION 1499

1) Potential Functions of the HSTM: In the HSTM, thereare four kinds of potential functions: feature-related potential,activity-related potential, root potential and structure-related po-tential, where the first three belong to the unary potential andthe last one is the pairwise potential. The unary potentials andpairwise potentials enable the HSTM to capture arbitrary de-pendencies and have different effects on modeling. In somecase, the discriminative power of unary potentials is importantfor recognition. The goal of pairwise potentials is to take somestructural information into account, so that the complex spatio-temporal dynamics in activities can be integrated. In this paper,we assume when given features, the potential function is linearin its corresponding parameter.

Feature-related potential: models the semantic relationshipsbetween features and hidden vertexes. We model the agreementbetween features and hidden states by incorporating the spatialfeature-related potential and the temporal feature-related poten-tial into the HSTM.

– The spatial feature-related potential

ESv (hS

i,t , XSi,t ; θ

Sv ) =

N S∑

n=1

XSi,t .1{hS

i,t = �Sn }.θS

vn (10)

where 1{hSi,t = �

Sn } denotes an indicator function, which

is 1 if hSi,t = �

Sn ; otherwise it is 0. We apply a template

θSv of size DS × NS to weigh the different importance of

elements in the feature for hidden states, where DS is thespatial feature dimension, and XS

i,t .θSvn can be interpreted

as, how likely the local patch XSi,t is assigned as the hidden

state �Sn .

– The temporal feature-related potential

ETv (hT

t ,XTt ; θT

v ) =N T∑

n=1

XTt .1{hT

t = �Tn }.θT

vn . (11)

The size of template θTv is DT × NT , where DT is the

temporal feature dimension, and XTt .θT

vn describes howlikely the frame XT

t is assigned as the hidden state �Tn .

Activity-related potential: is used to evaluate compatibilitiesbetween activities and hidden vertexes. The potential encodesclass-specific discriminative hidden variable information whichis of great importance in recognition. Intuitively, this potentialexpresses the idea that given an activity and without observingany low-level features, which state is more likely to occur for apatch or a frame.

– The spatial activity-related potential

ESl (y, hS

i,t ; θSl ) =

N S∑

n=1

y∈Y1{y = y}.1{hS

i,t = �Sn }.θS

lyn .

(12)Clearly, the state of a patch is related to the activity class.Therefore, the size of template θS

l is Ny × NS , and itsentry θS

lyn computes how possible the activity y contains alocal patch with hidden state �

Sn .

– The temporal activity-related potential

ETl (y, hT

t ; θTl ) =

N T∑

n=1

y∈Y1{y = y}.1{hT

t = �Tn }.θT

lyn .

(13)

Similarly, the state of a frame is also related to the activityclass. Consequently, the size of template θT

l is Ny × NT ,and its entry θT

lyn represents how possible the activity y

contains a frame with hidden state �Tn .

Root potential: models semantic relationships between theactivity and the global video-level feature of the whole sequence

ETo (XV , y; θT

o ) =∑

y∈YXV .1{y = y}.θT

oy (14)

where θTo can be treated as a global template, and its entry θT

oy

represents how possible this video contains the activity y.Structure-related potential: can describe the compatibility of

variables belonging to a three-wise connected clique and modelmotion constraint of a pair of hidden states in a specific activityclass. The higher order potential improves the HSTM with thecapability of expressing high-level dependencies.

– The spatial structure-related potential

ESe (y, hS

i,t , hSj,t ; θ

Se ) =

y∈Y

N S∑

n=1

N S∑

m=1

1{hSi,t = �

Sn }.

1{hSj,t = �

Sm}.1{y = y}.θS

eynm (15)

where θSe is a Ny × NS × NS template, and its entry

θSeynm estimates how likely a frame contains a pair of lo-

cal patches with hidden states �Sn and �

Sm , when given the

activity y. The potential encodes the structural informationof patches in a specific activity class, since the structuresof body parts differ depending on activity classes. TheHSTM is able to automatically discover all possible patchconfigurations in the spatial layer and a minimum spanningtree algorithm is ran for each frame to create the graphicalstructure of local patches.

– The temporal structure-related potential

ETe (y, hT

t−1 , hTt ; θT

e ) =N T∑

n=1

N T∑

m=1

y∈Y1{hT

t = �Tn }.

1{hTt−1 = �

Tm}.1{y = y}.θT

eynm (16)

whereθTe is a Ny × NT × NT template, and its entry

θTeynm measures how possible a video contains two con-

secutive frames with hidden states �Tn and �

Tm conditioned

on the activity y. The potential encodes the structural infor-mation of frames in a specific activity class, since differentactivity classes have different pose structures. Obviously,the configuration in the temporal layer just depends on thetemporal relation.

2) Two High-Level Features: We could take any raw fea-tures into the model. In this paper, space-time interest points areextracted as observations of the spatial layer to represent activi-ties. We denote XS

i,t as the feature vector describing appearanceof the i-th local patch in time t. Similar to [4], we utilize theHarris 3D detector to detect the local interest points and em-ploy histograms of oriented gradient (HOG) and histograms ofoptical flow (HOF) to encode their local appearances.

If we consider raw feature as feature knowledge, the high-level feature can be treated as model knowledge. The high-levelfeature XL

t is the overall characterization of the t-th framelearned from the spatial layer. The learned spatial hidden vari-able is used as the basic element to generate the high-level

Page 7: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

1500 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 7, JULY 2017

Fig. 4. Visualization of the two learned high-level features. Intuitively, the individual features are achieved by clustering similar features into a group and eachcircle in (b) contains patches with the same hidden state. The interaction features capture certain spatial constraints between pairs of individual features and eachpolygon in (c) contains pairs of patches with the same spatial structure.

representation due to its compact and semantic attribution. Thehigh-level feature which respectively characterizes high-levelcomponents and their spatial dependencies can be decomposedas

XLt = Γ(hS

t ) = {Γv (hSt ),Γe(hS

t )} (17)

where Γ() is a mapping from spatial hidden variables hSt to

learned high-level observation XLt , equivalent to converting lo-

cal features to global features by cumulating semantic patternsand their relations in a structured way. The novel individual fea-ture Γv

n,y (hSt ) for hidden state �

Sn and label y, and the interaction

feature Γen,m,y (hS

t ) for hidden state pair �Sn , �S

m and label y canbe computed as follows:

Γvn,y (hS

t ) =Nt∑

i=1

p(hSi,t = �

Sn |y,XS

t , θS )

=

Nt∑i=1

∑hS

i , t =�Sn

exp (ES (y, hSt ,XS

t ; θS ))

∑hS

t ∈HS ,y∈Yexp (ES (y, hS

t ,XSt ; θS ))

, (18)

Γen,m,y (hS

t ) =Nt∑

i=1

Nt∑

j=1

p(hSi,t = �

Sn , hS

j,t = �Sm |y,XS

t , θS )

=

Nt∑i=1

Nt∑j=1

∑hS

i , t =�Sn

∑hS

j , t =�Sm

exp(ES(y, hSt ,XS

t ; θS ))

∑hS

t ∈HS ,y∈Yexp (ES (y, hS

t ,XSt ; θS ))

(19)

where the two marginal probabilities p(hSi,t = �

Sn |y,XS

t , θS )and p(hS

i,t = �Sn , hS

j,t = �Sm |y,XS

t , θS ) can be calculated bybelief propagation. The high-level features are defined as theindividual distribution and the structural distribution of hid-den states, which cumulate the occurrence probability of eachindividual state and their interactional edge within a frameseparately. Finally, XL

t is achieved by concatenating individ-ual features Γv (hS

t ) = {Γvn,y (hS

t )|�Sn ∈ HS , y ∈ Y} and inter-

action features Γe(hSt ) = {Γe

n,m,y (hSt )|�S

n , �Sm ∈ HS , y ∈ Y}

spanning all the hidden state space and the activity label space.From another perspective, it has some similarities to the deeplearning paradigm. The spatial layer can be seen as applyinga soft-max function over the potentials across all the labels ateach local patch and the high-level features can be obtainedby a process which is something like structured pooling. Vi-sualization of the two learned high-level features is shown inFig. 4(a), where each circle corresponds to an individual featuredepending on its hidden state and each line corresponds to aninteraction feature between two individual features. Intuitively,the individual features Fig. 4(b) are achieved by clustering sim-ilar features into a group. The interaction features Fig. 4(c)capture certain spatial constraints between pairs of individualfeatures.

IV. TRAINING AND INFERENCE FOR THE HSTM

A. Parameters Estimation

In contrast with other hierarchical CRFs/HCRFs, we jointlyoptimize all the parameters for the HSTM. Given training videos{Xi , yi}M

i=1 , the objective function is defined as a L2 regularizedlikelihood function

θ∗ = arg maxθ

M∑

i=1

L1(i,θ)

= arg maxθ

M∑

i=1

log

∑h∈H

exp (E(yi,h,Xi ;θ))∑

h∈H,y∈Yexp (E(y,h,Xi ;θ))

− 12σ2 ||θ||2 . (20)

The first term denotes the log-likelihood on the training data. Thelast term can be treated as a Gaussian regularization term, and itis a common regularization term in standard CRF/HCRF model[14], [15], used to avoid overfitting and simplify the model. Alsoit can be seen as a Gaussian prior P (σ) ∼ exp( 1

2σ 2 ||θ||2) withvariance σ2 , which is set to 1 in (20) and (23). The trainingprocess for single-layer HCRF itself is time consuming, which

Page 8: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

XU et al.: HSTM FOR HUMAN ACTIVITY RECOGNITION 1501

would result in exponential computational complexity. What’smore, the HSTM is a two-layer model so that the traditionaltraining algorithm is not tractable.

Considering the graphical structure of the HSTM, there isa way to find an efficient and effective algorithm to estimateall parameters. First, each tree-HCRF in the bottom layer canbe trained independently, since it only captures the spatial con-straints within one frame. Second, the transmitted information(the high-level features and the frame-level marginal poste-rior probabilities) can be calculated in advance, since thereis no potential function between hidden nodes from differentlayers. Therefore, a bottom-to-up strategy is adopted to learnparameters layer-by-layer, which can limit the computationalcomplexity to linear to the number of layers.

In this paper, the parameter estimation method applies aspace-to-time and layer-by-layer search scheme over the val-idation set. In the following pages, only the i-th training sam-ple {Xi , yi} is considered for description convenience and itis easy to extend to the whole training set. Therefore, we canreformulate the objective function in (21). For simplification,we denote potential E(yi,h,Xi ;θ) as Eyi ,h,X i ;θ and the reg-ularization term is also omitted here. Equation (21) is shownat the bottom of this page, where L1(i,θ|θS ∗

) is an approx-imation of L1(i,θ), when given θS ∗

. It no longer guaranteesto find the global maximum, when a subset of the originalparameters is fixed. Therefore, we can easily understand thatmax

θL1(i,θ) ≥ max

θTL1(i,θ|θS ∗

), since θT ⊆ θ. The well op-

timized marginal posterior probability of the t-th frame in the

i-th training sample conditioned on the label y is defined as

pSi,y ,t =

hSt ∈HS

exp(ES (y, hSt ,XS

t ; θS ∗)) (22)

and the parameters of the spatial layer can be estimated by

θS ∗= arg max

θS

K∑

t=1

L3(i, t, θS )

= arg maxθS

K∑

t=1

log

∑hS

t ∈HS

exp (ESyi ,hS

t ,X St ;θS )

∑hS

t ∈HS ,y∈Yexp (ES

y,hSt ,X S

t ;θS )

− 12σ2 ||θS ||2 (23)

where L3(i, t, θS ) is the log-likelihood of the spatial layer.After achieving the parameters of the spatial layer, the tempo-

ral layer can be trained depending on the high-level features andthe frame-level posterior probabilities. This learning scheme nolonger guarantees to find the global optimal parameter settingfor the HSTM. In fact, the parameters are optimized to maximizea lower bound of the likelihood function of the complete HSTMby splitting model into local layers and integrating statistics overall of them.

Different from the CRF, the objective function is not concave,due to hidden variables, but we can still use a standard quasi-

maxθ

L1(i,θ) = maxθ

log

∑h∈H

exp (Eyi ,h,X i ;θ)∑

h∈H,y∈Yexp (Ey,h,X i ;θ)

= maxθ

log

∑hS ∈HS ,hT ∈HT

exp(

ETyi ,hT ,X T ;θT +

K∑t=1

ESyi ,hS

t ,X St ;θS

)

∑hS ∈HS ,hT ∈HT ,y∈Y

exp(

ETy,hT ,X T ;θT +

K∑t=1

ESy,hS

t ,X St ;θS

)

= maxθ

log

∏t

(∑

hSt ∈HS

exp(ES

yi ,hSt ,X S

t ;θS

)).

(∑

hT ∈HT

exp(ET

yi ,hT ,X T ;θT

))

∑y∈Y

∏t

(∑

hSt ∈HS

exp(ES

y,hSt ,X S

t ;θS

)).

(∑

hT ∈HT

exp(ET

y,hT ,X T ;θT

))

≥ maxθT

log

∏t

(∑

hSt ∈HS

exp(ES

yi ,hSt ,X S

t ;θS ∗

)).

(∑

hT ∈HT

exp(ET

yi ,hT ,X T ;θT

))

∑y∈Y

∏t

(∑

hSt ∈HS

exp(ES

y,hSt ,X S

t ;θS ∗

)).

(∑

hT ∈HT

exp(ET

y,hT ,X T ;θT

))

= maxθT

log

∏t pS

i,yi ,t.

(∑

hT ∈HT

exp(ET

yi ,hT ,X T ;θT

))

∑y∈Y

∏t pS

i,y ,t .

(∑

hT ∈HT

exp(ET

y,hT ,X T ;θT

))

= maxθT

L1(i,θ|θS ∗) (21)

Page 9: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

1502 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 7, JULY 2017

newton method (LBFGS) to solve this optimization problem andfind θ that is locally optimal. The LBFGS is chosen because ofits empirical success in the literature and higher convergencerate compared with the gradient descent method. The gradientof the objective function with respect to the i-th training samplecan be calculated as

L1(i,θ|θS ∗)

∂θTl

=∑

hTt

p(hTt |yi,Xi ;θ)

∂ETl (.)

∂θTl

−∑

y ,hTt

p(y, hTt |Xi ;θ)

∂ETl (.)

∂θTl

(24)

L1(i,θ|θS ∗)

∂θTv

=∑

hTt

p(hTt |yi,Xi ;θ)

∂ETv (.)

∂θTv

−∑

y ,hTt

p(y, hTt |Xi ;θ)

∂ETv (.)

∂θTv

(25)

L1(i, θ|θS∗)∂θT

e

=∑

hTt ,hT

t + 1

p(hTt , hT

t+1 |yi,Xi ;θ)∂ET

e (.)∂θT

e

−∑

y ,hTt ,hT

t + 1

p(y, hTt , hT

t+1 |Xi ;θ)∂ET

e (.)∂θT

e

(26)

L1(i, θ|θS∗)∂θT

o

=∂ET

o (.)∂θT

o

−∑

y

p(y, |Xi ; θ)∂ET

o (.)∂θT

o

. (27)

It is easy to obtain the partial derivative of potential function withrespect to corresponding parameter, due to their linear relation-ship. In addition, these marginal probabilities can be efficientlycomputed using the belief propagation (BP) algorithm. For thespatial layer, a similar calculation can be used and the forms ofthese partial derivatives are the same as those in [12], [13].

The training procedure is outlined in Algorithm 1. Lines 2–5implement learning the spatial layer step and Lines 1–1 imple-ment learning the temporal layer step. The loops in Lines 2–4and Lines 6–12 cumulate the log-likelihood of the spa-tial layer and the HSTM on training set separately. Afteroptimizing the spatial parameters in Line 5 by Quasi-Newton,the high-level features are computed in Line 1 using BPalgorithm and the frame-level marginal posterior probabilitiesare calculated in Line 1. The two types of information trans-mitted from the bottom to the top is the key evidence to es-timate the complete HSTM. The proposed training algorithmcan simplify the complexity of calculation without losing muchinformation.

Now we discuss some properties associated with the learningalgorithm. Before that, we need to make a few more definitions:For the i-th training sample, L1(i,θ) is the log-likelihood of thecomplete HSTM and L1(i,θ|θS ∗

) is its approximation or lowerbound; L2(i, θT ) is the log-likelihood of the temporal layer;L3(i, t, θS ) is the log-likelihood of the spatial layer at time t.

Theorem 1. maxθT

L1(i,θ|θS ∗) ≥ max

θTL2(i, θT ).

Proof. Since the marginal posterior probability with true la-bel is higher than others with wrong label in the spatial layer,that is pS

i,yi ,t≥ pS

i,y ,t . Based on (21), we scale the denomina-tor by replacing pS

i,y ,t with pSi,yi ,t

and then we can reach thefollowing conclusion:

maxθT

L1(i,θ|θS ∗)

= maxθT

log

∏t pS

i,yi ,t.

(∑

hT ∈HT

exp(ET

yi ,hT ,X T ;θT

))

∑y∈Y

∏t pS

i,y ,t .

(∑

hT ∈HT

exp(ET

y,hT ,X T ;θT

))

≥ maxθT

log

∏t pS

i,yi ,t.

(∑

hT ∈HT

exp(ET

yi ,hT ,X T ;θT

))

∑y∈Y

∏t pS

i,yi ,t.

(∑

hT ∈HT

exp(ET

y,hT ,X T ;θT

))

= maxθT

log

∑hT ∈HT

exp(ET

yi ,hT ,X T ;θT

)

∑y∈Y

∑hT ∈HT

exp(ET

y,hT ,X T ;θT

)

= maxθT

L2(i, θT ). (28)

�It shows that the two-layer HSTM trained by the novel learn-

ing algorithm can achieve higher recognition rate compared withthe single temporal layer.

Theorem 2: maxθT

L1(i,θ|θS ∗) ≥ max

θTL2(i, θT ) +

maxθS

∑Kt=1 L3(i, t, θS ). Equation (29) is shown at the

bottom of the following page. �Proof: Based on (21), we have the following corollary,The reason of the inequality in (29) is exp(.) ≥ 0 It demon-

strates that the novel learning algorithm can obtain higher recog-nition accuracy than that estimating the spatial and the temporalparameters independently, since the former achieves a largerlog-likelihood than the latter.

In addition, when the Theorem 1 is given, there is a sim-ple explanation for the Theorem 2. Since L3(i, t, θS ) ≤ 0, theTheorem 2 can be easily achieved

maxθT

L1(i,θ|θS ∗) ≥ max

θTL2(i, θT )

≥ maxθS

K∑

t=1

L3(i, t, θS ) + maxθT

L2(i, θT ).

(30)

B. Inference

Given the optimal model parameters θ∗ learned from thetraining set, prediction of a new test video sequence X is toselect y that could maximize y∗ = arg maxy∈Y P (y|X;θ∗),where the conditional probability is defined in (5). Since themarginal probabilities can be efficiently computed using BP al-gorithm, the inference operation is also very efficient. Given

Page 10: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

XU et al.: HSTM FOR HUMAN ACTIVITY RECOGNITION 1503

a sequence, the complexity of solving such an inference prob-lem is O(KNy |E|NS + NyKNT ), including O(Ny |E|NS ) foreach frame in the spatial layer and O(NyKNT ) for the temporallayer, when assuming the edges E form a tree.

Algorithm 1: Training Procedure for the HSTMInput:

The set of training data, {Xi , yi}Mi=1 ;

Output:Optimal solution, θ∗;

1: Initializing the parameters θ = {θS , θT } randomly;2: for each i ∈ [1,M ], each t ∈ [1,K] do3: Accumulating the log-likelihood of the spatial layer

L3(θS ) =M∑i=1

K∑t=1

L3(i, t, θS );

4: end for5: Estimating the spatial parameters

θS ∗= arg max

θSL3(θS );

6: for each i ∈ [1,M ] do7: for each t ∈ [1,K] do8: Learning the high-level features for the temporal layer,

as (18) and (19);9: Calculating the marginal posterior probabilities, as (22);

10: end for11: Accumulating the log-likelihood of the complete HSTM

L1(θ|θS ∗) =

M∑i=1

L1(i,θ|θS ∗);

12: end for13: Utilizing Quasi-Newton algorithm to estimate the

temporal parameters θT ∗= arg max

θTL1(θ|θS ∗

);

14: return θ∗;

V. EXPERIMENTAL RESULTS

Our proposed model is a unified framework to recognize hu-man activities, especially for various interactions. Therefore,

we evaluate performance of the HSTM on three tasks withfive standard benchmark datasets: one-person action recognition(UCF), human-human interactional activity recognition (UT-Interaction, BIT-Interaction and CASIA) and human-object in-teractional activity recognition (Gupta video dataset). In thefollowing subsections, we first compare the HSTM with otherrelated models, then discuss the hierarchical structure in theHSTM, and finally estimate the effectiveness of the learnedhigh-level features. All evaluations are conducted in a MicrosoftVisual Studio 2008 and OpenCV 2.3 environment on a 3.1 GHzCPU with 4G RAM.

UCF [55]: The dataset is used to recognize various one-personactions in realistic scenes. The videos in the UCF are taken underextremely challenging and uncontrolled conditions. The UCFincludes 150 clips from various sport events with 10 categoriesof actions: diving, golf, kicking, lifting, horse riding, running,skate boarding, swing bench, swing side and walking.

UT-Interaction [6]: The task in this benchmark dataset is torecognize complex human-human interactional activities. It (thesegmented version) contains a total of 120 videos of 6 classesof human interactions: shake hand, hug, push, kick, point andpunch. These video clips are taken on a parking lot with mostlystatic background and little camera jitters.

CASIA [56]: The task in this two-person activity dataset is torecognize complex human-human interactional activities underdifferent views. It contains 84 clips from three different views:angle view, horizontal view and top down view. There are 7 dif-ferent activity categories: fight, followalways, followtogether,meetapart, meettogether, overtake and rob. So each activity has12 clips and each view has 4 clips.

BIT-Interaction dataset [44]: The task in this benchmarkdataset is to recognize complex human-human interactional ac-tivities in realistic scenes with partial occlusion, moving object,cluttered background, and variations in scale, appearance, illu-mination and viewpoint. It contains a total of 400 videos of 8classes of human interactions: bow, boxing, handshake, high-five, hug, kick, pat, and push.

Gupta video dataset [48]: The task in this dataset is to recog-nize complex human-object interactional activities. It contains

maxθT

L1(i,θ|θS ∗) = max

θTlog

∏t

(∑

hSt ∈HS

exp(ES

yi ,hSt ,X S

t ;θS ∗

)).

(∑

hT ∈HT

exp(ET

yi ,hT ,X T ;θT

))

∑y∈Y

∏t

(∑

hSt ∈HS

exp(ES

y,hSt ,X S

t ;θS ∗

)).

(∑

hT ∈HT

exp(ET

y,hT ,X T ;θT

))

≥ maxθT

log

∏t

(∑

hSt ∈HS

exp(ES

yi ,hSt ,X S

t ;θS ∗

)).

(∑

hT ∈HT

exp(ET

yi ,hT ,X T ;θT

))

∏t

(∑

y∈Y,hSt ∈HS

exp(ES

y,hSt ,X S

t ;θS ∗

)).

(∑

y∈Y,hT ∈HT

exp(ET

y,hT ,X T ;θT

))

=∑

t

log

∑hS

t ∈HS

exp(ES

yi ,hSt ,X S

t ;θS

)

∑hS

t ∈HS ,y∈Yexp

(ES

y,hSt ,X S

t ;θS

) + maxθT

log

∑hT ∈HT

exp(ET

yi ,hT ,X T ;θT

)

∑y∈Y,hT ∈HT

exp(ET

y,hT ,X T ;θT

)

= maxθS

K∑

t=1

L3(i, t, θS ) + maxθT

L2(i, θT ) (29)

Page 11: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

1504 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 7, JULY 2017

TABLE IACCURACIES OF THE HSTM COMPARED WITH OTHER RELATED METHODS ON THE BIT-INTERACTION, UCF, UT-INTERACTION, CASIA, AND GUPTA

Algorithm Accuracy Algorithm Accuracy Algorithm Accuracy Algorithm Accuracy

BIT Baseline [4] 70.31% Kong et al. [44] 85.16% Ke et al. [34] 85.2% Kong et al. [50] 79.7%Donahue et al. [53] 69.5% Lan et al. [51] 82.03% Kong et al. [57] 85.38% The HSTM 88.50%

UCF Baseline [4] 68.67% Quoc et al. [30] 86.5% Georgia et al. [32] 75.8% Kovashka et al. [17] 87.27%Wang et al. [54] 88.2% Wang et al. [59] 85.6% Klaser et al. [60] 86.7% The HSTM 90.67%

CASIA Baseline [4] 50% Guo et al. [47] 83.28% Brand et al. [46] 76.14% The HSTM 95.24%

UT Baseline [4] 73.33% Lan et al. [45] 88.4% Ke et al. [34] over 90% Kong et al. [44] 88.33%Kong et al. [50] 95% Kong et al. [43] 91.67% Kong et al. [57] 88.33% The HSTM 94.17%

Gupta Baseline [4] 50% Prest et al. [61] 93% Gupta et al. [48] 93.34% The HSTM 96.30%

Fig. 5. Confusion matrices for activity classification on the UCF, UT-Interaction, BIT-Interaction, CASIA, and Gupta video dataset. (a) BIT-Interaction. (b) UCF.(c) UT-Interaction. (d) CASIA. (e) Gupta.

54 video clips with 10 actors performing 6 different activities:drinking from a cup, spraying from a bottle, answering a phonecall, making a phone call, pouring from a cup and lighting aflashlight. Videos in this dataset are recorded inside a labora-tory with a static camera.

A. Comparisons of the HSTM With Other Related Models

We follow experimental protocols used in published workson each dataset. A leave-one-out cross-validation (LOOCV)strategy is adopted in UCF, UT-Interaction, CASIA and Guptavideo dataset. For the BIT-Interaction, we randomly chose 272videos as training data and the remaining 128 videos are usedfor testing like other papers’ setup. The following experimentsare conducted on 20 hidden states for both the spatial layer andthe temporal layer, that is (NT = NS = 20).

In order to comprehensively evaluate the performance of theHSTM, we compare it with several related models on the five

benchmark datasets in Table I. The recognition rates of othermethods reported in the table are those published in their respec-tive papers. The classic approach [4] applying SVM based on thespace-time interest points (STIPs) with a bag-of-feature (BOF)style representation is used as the baseline. These comparableapproaches can be divided into 4 parts:

1) The related graphical model: discriminative models (eg.[43], [44], [51], [57]) and HMM based models (eg. [46],[47]).

2) The direct overall modeling: hand-designed contextualfeatures based methods (eg. [5], [6], [17], [58]).

3) The hierarchical feature learning: unsupervised hierarchi-cal feature learning (eg.[30]) and deep learning model(eg.[32], [34]).

4) The hierarchical classification model: max-margin learn-ing framework with hierarchical motion detectors(eg.[45]), long-term RNN model (eg.[53]), hierarchicallatent SVM framework with interactive phrases (eg.[44]),

Page 12: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

XU et al.: HSTM FOR HUMAN ACTIVITY RECOGNITION 1505

TABLE IIACCURACY OF THE HSTM WITH DIFFERENT LOCAL

DESCRIPTORS ON THE CASIA AND GUPTA

HOG HOF HOG3D HOG-HOF

CASIA 78.57% 92.86% 85.71% 95.24%Gupta 77.78% 94.44% 88.89% 96.30%

TABLE IIIEVALUATION OF RECOGNITION ACCURACY AND TESTING SPEED OF HSTMWITH DIFFERENT NUMBER OF HIDDEN STATES ON THE CASIA AND GUPTA

Hidden Size CASIA Gupta

Accuracy Speed Accuracy Speed

5 86.90% 10.21 fps 85.19% 12.89 fps10 90.47% 5.96 fps 87.04% 9.27 fps20 95.24% 4.82 fps 96.30% 5.69 fps25 94.05% 3.37 fps 94.44% 4.16 fps

and multi-feature max-margin hierarchical Bayesianmodel (eg.[40]).

It can be seen that the HSTM can be comparable to allthe state-of-the-art methods on the BIT-Interaction, UCF, UT-Interaction, CASIA and Gupta video dataset, whatever thedirect overall modeling(eg. [17], [58]), the spatial relation mod-eling(eg. [44], [51]) and the temporal relation modeling (eg.[46], [47]). In particular, our model achieves a huge improve-ment on the CASIA (about 12%). Moreover, there is no anymanual annotation needed in the HSTM, while the location ofperson’s hand and the pixel-wise segmentation of object arerequired for training in [48]. It strongly demonstrates that ourHSTM combining the advantages of both the spatial relationmodeling and the temporal relation modeling has more discrim-inative power. The confusion matrixes for activity classificationobtained by our proposal on the five datasets are shown in Fig. 5.

The space-time interest point with HOG-HOF is a popularlocal spatio-temporal descriptor for human activity recognition.We test it with other alternatives, such as only HOG, only HOFand HOG3D [62]. For fairness of the comparison, the same localpoint detector–the Harris3D algorithm is applied. For each de-tected point, these four types of descriptors are computed basedon their original implementation. The accuracy of the HSTMwith these different local descriptors on the CASIA and Guptais reported in Table II. As we can see from the results, HOG-HOF is the best descriptor and the HOG descriptor is the worst.Ranking of performance is: HOG-HOF > HOF > HOG3D >HOG. It shows that compared with appearance information,motion information is more important for activity representa-tion and the combination of the two seems good choice. Sinceit is based on the histograms of oriented 3D spatio-temporalgradients, which involves some motion changes, HOG3D canobtain better performance than HOG.

We also evaluate recognition accuracy and testing speed ofHSTM with different number of hidden states, i.e. hidden-5,hidden-10, hidden-20 and hidden-25 in Table III. It can be seenthat the computation cost increases with the number of hiddenstates, while the recognition accuracy does not always improvewith it. We can achieve better performance when we use only20 hidden states instead of 25. This is partly because there aremore parameters to be learned when the model contains more

TABLE IVACCURACY OF THE HSTM COMPARED WITH THE SINGLE SPATIAL

LAYER ON THE BIT-INTERACTION, UCF, UT, CASIA, AND GUPTA

Algorithm BIT UCF UT-Interaction CASIA Gupta

PFSL 18.15% 17.71% 25.77% 58.98% 24.07%PVSL 20.50% 18% 33.33% 84.52 % 28.92%HSTM 88.50% 90.67% 94.17% 95.24 % 96.30%

hidden states, such that the model may be too complex and hasa little bit of overfitting.

B. Comparisons of Hierarchical Layers With Single Layer

For detailed analysis, we evaluate whether our HSTM is in-deed better than one-layer models and study the strengths andweakness of different structures. While the performance of theHSTM shows a significant improvement over previous pub-lished results, it has not been pointed out the advantage of learn-ing from the hierarchical structure, as opposed to learning fromthe optimal single layer inside the hierarchy. To this end, wecompare the HSTM with the single spatial layer and the singletemporal layer separately.

1) Comparisons of the HSTM With the Single Spatial Layer:For the spatial layer of the HSTM, we test its classificationaccuracy by two different ways, where only the raw featuresand the spatial HCRF are utilized to infer the activity label.

Per-frame in spatial layer (denoted as PFSL): we classify ev-ery frame independently by (31), which is actually the commonHCRF based method for activity recognition.

y∗t = arg max

y∈Y

∑hS

t ∈HS

exp (ESy,hS

t ,X St ;θS )

∑hS

t ∈HS ,y∈Yexp (ES

y ,hSt ,X S

t ;θS )(31)

where y∗t is the estimated activity label of the t-th frame.

Per-video in spatial layer (denoted as PVSL): we achieve theactivity label for the whole video by majority voting

y∗ = arg maxy∈Y

K∑

t=1

1{y∗t = y} (32)

where y∗ is the estimated activity label of the whole video.The comparisons of the HSTM with the two contrast meth-

ods on the BIT-Interaction, UCF, UT-Interaction, CASIA andGupta video dataset are listed in Table IV. From the results, thefollowing points can be concluded:

1) The accuracy of the PFSL is lower than the PVSL in thespatial layer. Since human activity is a continuous process,it is not reasonable to decide a whole video sequence viaonly one frame. For example, the frame with the pose of“stand” may occur in all kinds of activities.

2) The HSTM always outperforms the other two methodsonly using the spatial layer. It can prove that there existsno one-layer spatial representation that is as discriminativeas the HSTM. That is to say, modeling human activity onlyin the spatial domain is far from enough and the temporalrelation is also necessary.

Although the discriminative power of the single spatial layeris much lower than the hierarchical structure, the function ofthe spatial layer is not to make a classification decision for

Page 13: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

1506 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 7, JULY 2017

TABLE VACCURACY OF THE HSTM COMPARED WITH THE SINGLE TEMPORAL

LAYER ON THE BIT-INTERACTION, UCF, UT, CASIA, AND GUPTA

Algorithm BIT UCF UT-Interaction CASIA Gupta

RFTL 58.75% 58.67% 55% 51.19% 59.26%LFTL 81.75% 86.67% 86.67% 92.86 % 88.89%LFPTL 88.25% 89.33% 91.67% 95.24 % 96.30%HSTM 88.50% 90.67% 94.17% 95.24 % 96.30%

every frame. It is just employed to provide some classificationevidences for the temporal layer and to summarize each frameinto a compact and semantic representation, which would bediscussed in the followings.

2) Comparisons of the HSTM With the Single TemporalLayer: The temporal layer of the HSTM is built for model-ing the correlation of neighboring frames. Its inputs consists oftwo parts: the one is transmitted from the spatial layer (learnedhigh-level features and frame-level marginal posterior probabil-ities) and the other is extracted directly from the whole frames(frame-level features and video-level features). We compare theHSTM with three different approaches in the temporal layer.

Raw features in temporal layer (denoted as RFTL): we em-ploy neither of the two input information to train the temporallayer. The traditional BOF style representation with spatial rawfeature (STIP) is treated as the observation XB of the temporalHCRF

y∗ = arg maxy∈Y

∑hT ∈HT

exp(ETy,hT ,X B ;θT )

∑y∈Y,hT ∈HT

exp(ETy ,hT ,X B ;θT )

. (33)

Learned features in temporal layer (denoted as LFTL): weonly apply the learned high-level feature XL as the observationof the temporal layer

y∗ = arg maxy∈Y

∑hT ∈HT

exp(ETy,hT ,X L ;θT )

∑y∈Y,hT ∈HT

exp(ETy ,hT ,X L ;θT )

. (34)

Learned features and posterior probabilities in temporallayer (denoted as LFPTL): we apply both the learned high-levelfeatures XL and the frame-level marginal posterior probabilitiespS

i,y ,t as the input of the temporal layer

y∗ = arg maxy∈Y

∏t pS

i,y ,t .∑

hT ∈HT

exp(ETy,hT ,X L ;θT )

∑y∈Y

∏t pS

i,y ,t .∑

hT ∈HT

exp(ETy ,hT ,X L ;θT )

. (35)

Table V shows the comparison results on the BIT-Interaction,UCF, UT-Interaction, CASIA and Gupta video dataset. Fromthe comparisons, the following points can be indicated:

1) The accuracy of the HSTM is much higher than the RFTL.It can conclude that modeling human activity only in thetemporal domain is far from enough and the spatial depen-dency is also necessary.

2) The performance of the HSTM and the LFPTL is a bitbetter than the LFTL. That is to say, some spatial relationshave been contained in these high-level features. It alsoconfirms that besides the temporal layer, the spatial layershould have contribution to the final recognition result. Forinstance, it would provide a strong cue for recognizing a

TABLE VICOMPUTATIONAL SPEED OF DIFFERENT MODELS ON THE BIT-INTERACTION,

UCF, UT-INTERACTION, CASIA, AND GUPTA

Algorithm BIT UCF UT CASIA Gupta

CODHMM [47] 17.56 fps 21.75 fps 29.78 fps 13.12 fps 19.03 fpsPFSL (Spatial HCRF) 4.47 fps 4.13 fps 5.41 fps 5.23 fps 5.80 fpsRFTL (Temporal HCRF) 23.25 fps 17.05 fps 27.27 fps 19.62 fps 20.77 fpsLFTL 4.27 fps 4.02 fps 5.1 fps 4.82 fps 5.70 fpsHSTM 4.25 fps 4.01 fps 5.0 fps 4.74 fps 5.69 fps

whole video sequence, when correctly classifying a specialframe in the spatial layer.

3) The HSTM can achieve a better performance than the LF-PTL, which lacks the frame-level features and the video-level features. We can find that combining with more levelevidences can improve model’s discriminative power anddescriptive capability.

To highlight the advantages of the HSTM, we give anotherstrong case to demonstrate its superior performance. The com-parisons of recognition rates and log-likelihoods for testing sam-ples by cross validation between the HSTM and the LFTL aredisplayed in Fig. 6. We can find that even for correctly classifiedinstances, the HSTM has a stronger discriminability, i.e., whenachieving the same recognition rate, the red line still dominatesthe blue line in the log-likelihood curve. Although sometimesthe accuracies of the LFTL are close to the HSTM, the log-likelihoods of the HSTM are almost much higher and the gapbetween the blue line and the red line in the right figure is largerthan that in the left figure. To some extent, the results supportthe first theorem, showing that the objective function will notdecrease when the spatial layer is added into the HSTM. How-ever, it can be seen that the log-likelihood of the HSTM is lowerthan the LFTL in some case, which seems to deviate from thetheorem. The reason is that we assume pS

i,yi ,t≥ pS

i,y ,t in an idealcase, while the marginal posterior probability with true label isnot necessarily higher than others with wrong label in testingprocess due to the weak discriminability of the spatial layer. Itfurther evidences that the spatial layer is an integral part of theHSTM.

The computational speed for testing reported in frames-per-second (fps) is given in Table VI. The PFSL is actually a com-mon HCRF based method for activity spatial modeling, theRFTL is actually a common HCRF based method for activitytemporal modeling. The LFTL is actually a two-layer HCRFwith independent training strategy and the CODHMM is anHMM based generative model. We can see the testing time ofthe HSTM is longer than that of other models, since the PFSLand the RFTL are one-layer HCRF models and the estimationof pS

i,y ,t in (22) is not required in the LFTL. In addition, muchmore time is required to test the PFSL compared with the RFTLand the CODHMM, because the number of observation nodesin the spatial layer are much larger than those in the temporallayer. From the table, we can come to the point that the pro-cessing speed of the HSTM is not much slower than that of thePFSL and the LFTL. It illustrates the efficiency of our learningalgorithm.

In general, the comparisons with the HSTM and its one-layersub-models (the single spatial layer and the single temporallayer) show that the HSTM achieves the highest classificationaccuracy and there is no one-layer representation that is as dis-

Page 14: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

XU et al.: HSTM FOR HUMAN ACTIVITY RECOGNITION 1507

Fig. 6. Comparisons of accuracies and log-likelihoods by cross validation between the HSTM and the LFTL on the CASIA. (a) The accuracy on the CASIA. (b)The log-likelihood on the CASIA.

Fig. 7. Comparisons of average precision (AP) for each activity between high-level features (LFBOF and LFTL) and raw features (RFBOF and RFTL) on theUCF, UT-Interaction, and Gupta video datasets. It indicates that the high-level features learned from the spatial layer can achieve higher accuracies. (a) UCF.(b) UT-Interaction. (c) Gupta.

criminative as the hierarchical representation. Meanwhile, itshows that the two layers are mutually dependent, i.e., neithercan perform effectively without the other. Moreover, the ex-periments also verify that the methods only modeling spatialdependencies (PFSL) or only modeling temporal dependencies(RFTL) lack the capacity to recognize complex activities.

C. Comparisons of High-Level Features With Raw Features

The spatial layer plays an important role in the HSTM, andone of its primary functions is to convert raw features to high-level semantic representations. In this section, we evaluate theeffectiveness of these learned high-level features by followingwell-designed contrast experiments.

Raw features with BOF (denoted as RFBOF) vs Learned fea-tures with BOF (denoted as LFBOF): we apply SVM basedon raw features (STIPs) and learned features (individual fea-tures and interaction features) with a BOF style representationrespectively.

RFTL vs LFTL: we employ raw features (STIPs) and learnedfeatures (individual features and interaction features) to trainthe temporal layer respectively.

As illustrated in Fig. 7, we compare the average precision(AP) for each activity among these contrast methods. It is clearthat:

1) The performance of the LFBOF is better than the RFBOF.It evidences that there is no much information loss whensummarizing by the spatial layer and the learned high-level features are more compact and semantic for activityrepresentation.

2) The classification accuracy of the RFTL is much lowerthan the LFTL. It further evidences that the spatial layerplays a role of feature learning and the learned high-levelfeatures are more discriminative for activity recognition.

The reason is that there are a number of difficulties whendirectly treating raw features as descriptors of activities, e.g.,different scales, different ranges, weak discriminative powerand descriptive capability. On the other hand, our proposedhigh-level features organize the spatial hidden variables definedin the scale [0:1] as the overall frame-level characterization andcontain discriminative information learned via mathematical op-timization. In summary, the contrast experiments between rawfeatures and learned features indicate that the high-level fea-tures learned from the spatial layer are more representative anddiscriminative.

VI. CONCLUSION

This paper proposes a new hierarchical spatio-temporal model(HSTM) for both one-person action recognition and interac-

Page 15: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

1508 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 7, JULY 2017

tional activity recognition by modeling spatial constraints andtemporal constraints simultaneously. In general, we representactivities from local to global and from space to time, such thatthe descriptive capability is obtained not only by integrating alllevels of spatio-temporal relations but also by combining theflexibility of local features with the discriminability of globalfeatures. A novel learning algorithm with bottom-to-up strategyis derived to train all parameters efficiently and effectively. Itenables both the spatial similarity and the temporal similar-ity of activities to be measured together to obtain the supe-rior classification ability. We evaluate the HSTM on the UCF,UT-Interaction, BIT-Interaction, CASIA and Gupta videodataset and obtain better recognition results compared with otherstate-of-the-art methods. Experimental results also show that theHSTM leads to significant improvements on both descriptive ca-pability and discriminative power.

REFERENCES

[1] Z. Zhou, F. Shi, and W. Wu, “Learning spatial and temporal extents ofhuman actions for action detection,” IEEE Trans. Multimedia, vol. 17,no. 4, pp. 512–525, Apr. 2015.

[2] G. Yu, N. A. Goussies, J. Yuan, and Z. Liu, “Fast action detection viadiscriminative random forest voting and top-k subvolume search,” IEEETrans. Multimedia, vol. 13, no. 3, pp. 507–517, Jun. 2011.

[3] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machinerecognition of human activities: A survey,” IEEE Trans. Circuits Syst.Video Technol., vol. 18, no. 11, pp. 1473–1488, Nov. 2008.

[4] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognitionvia sparse spatio-temporal features,” in Proc. 2nd Joint IEEE Int. WorkshopVis. Surveillance Perform. Eval. Tracking Surveillance, Oct. 2005, pp. 65–72.

[5] M. Bregonzio, S. Gong, and T. Xiang, “Recognising action as cloudsof space-time interest points,” in Proc. IEEE Conf. Comput. Vis. PatternRecog., Jun. 2009, pp. 1948–1955.

[6] M. S. Ryoo and J. K. Aggarwal, “Spatio-temporal relationship match:Video structure comparison for recognition of complex human activities,”in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep.-Oct. 2009, pp. 1593–1600.

[7] D. Liu, Y. Yan, M.-L. Shyu, G. Zhao, and M. Chen, “Spatio-temporalanalysis for human action detection and recognition in uncontrolled envi-ronments,” Int. J. Multimedia Data Eng. Manage., vol. 6, no. 1, pp. 1–18,2015.

[8] L. Ballan, M. Bertini, A. Del Bimbo, L. Seidenari, and G. Serra, “Effectivecodebooks for human action representation and classification in uncon-strained videos,” IEEE Trans. Multimedia, vol. 14, no. 4, pp. 1234–1245,Aug. 2012.

[9] Y.-M. Liang, S.-W. Shih, A. C.-C. Shih, H.-Y. M. Liao, and C.-C. Lin,“Learning atomic human actions using variable-length Markov models,”IEEE Trans. Syst. Man, Cybern. B, Cybern., vol. 39, no. 1, pp. 268–280,Feb. 2009.

[10] H. Ning, W. Xu, Y. Gong, and T. Huang, “Latent pose estimator forcontinuous action recognition,” in Proc. Eur. Conf. Comput. Vis., 2008,pp. 419–433.

[11] J. Zhang and S. Gong, “Action categorization with modified hidden con-ditional random field,” Pattern Recog., vol. 43, no. 1, pp. 197–203, 2010.

[12] Y. Wang and G. Mori, “Max-margin hidden conditional random fieldsfor human action recognition,” in Proc. IEEE Conf. Comput. Vis. PatternRecog., Jun. 2009, pp. 872–879.

[13] Y. Wang and G. Mori, “Learning a discriminative hidden part model forhuman action recognition,” in Proc. Adv. Neural Inf. Process. Syst., 2009,pp. 1721–1728.

[14] A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell, “Hid-den conditional random fields,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 29, no. 10, pp. 1848–1853, Oct. 2007.

[15] J. Lafferty et al., “Conditional random fields: Probabilistic models forsegmenting and labeling sequence data,” in Proc. 18th Int. Conf. Mach.Learn., 2001, vol. 1, pp. 282–289.

[16] I. Laptev, “On space-time interest points,” Int. J. Comput. Vis., vol. 64,no. 2/3, pp. 107–123, 2005.

[17] A. Kovashka and K. Grauman, “Learning a hierarchy of discriminativespace-time neighborhood features for human action recognition,” in Proc.IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2010, pp. 2046–2053.

[18] J. Wen, Z. Lai, Y. Zhan, and J. Cui, “The l 2, 1-norm-based unsupervisedoptimal feature selection with applications to action recognition,” PatternRecog., vol. 60, pp. 515–530, 2016.

[19] G. Varol and A. A. Salah, “Efficient large-scale action recognition invideos using extreme learning machines,” Expert Syst. Appl., vol. 42,no. 21, pp. 8274–8282, 2015.

[20] Y. Yuan, L. Qi, and X. Lu, “Action recognition by joint learning,” ImageVis. Comput., vol. 55, pp. 77–85, 2016.

[21] W. Xu, Z. Miao, and X.-P. Zhang, “Structured feature-graph model forhuman activity recognition,” in Proc. IEEE Int. Conf. Image Process., Sep.2015, pp. 1245–1249.

[22] J. Zhou and X.-P. Zhang, “An ICA mixture hidden Markov model forvideo content analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 18,no. 11, pp. 1576–1586, Nov. 2008.

[23] X. Wang and X.-P. Zhang, “A new localized superpixel Markov randomfield for image segmentation,” in Proc. IEEE Int. Conf. Multimedia Expo.,Jun.-Jul. 2009, pp. 642–645.

[24] M. Nematollahi and X.-P. Zhang, “A new robust context-based dense crfmodel for image labeling,” in Proc. IEEE Int. Conf. Image Process., Oct.2014, pp. 5876–5880.

[25] X. Wang and X.-P. Zhang, “Ice hockey shooting event modeling withmixture hidden Markov model,” Multimedia Tools Appl., vol. 57, no. 1,pp. 131–144, 2012.

[26] X. Wang and X.-P. Zhang, “Ice hockey shot event modeling with mixturehidden Markov model,” in Proc. 1st ACM Int. Workshop Events Multime-dia, 2009, pp. 25–32.

[27] X. Wang and X.-P. Zhang, “ICA mixture hidden conditional random fieldmodel for sports event classification,” in Proc. IEEE 12th Int. Conf. Com-put. Vis. Workshops, Sep.-Oct. 2009, pp. 562–569.

[28] J. Zhou and X.-P. Zhang, “Hidden Markov model framework using in-dependent component analysis mixture model,” in Proc. IEEE Int. Conf.Acoust., Speech Signal Process., May 2006, vol. 5, pp. V:553–V:556.

[29] B. T. Morris and M. M. Trivedi, “Trajectory learning for activity under-standing: Unsupervised, multilevel, and long-term adaptive approach,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 11, pp. 2287–2301,Nov. 2011.

[30] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, “Learning hierarchicalinvariant spatio-temporal features for action recognition with independentsubspace analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun.2011, pp. 3361–3368.

[31] M. Hasan and A. K. Roy-Chowdhury, “A continuous learning frameworkfor activity recognition using deep hybrid feature models,” IEEE Trans.Multimedia, vol. 17, no. 11, pp. 1909–1922, Nov. 2015.

[32] G. Gkioxari and J. Malik, “Finding action tubes,” in Proc. IEEE Conf.Comput. Vis. Pattern Recog., Jun. 2015, pp. 759–768.

[33] Y. Song, L.-P. Morency, and R. Davis, “Action recognition by hierarchi-cal sequence summarization,” in Proc. IEEE Conf. Comput. Vis. PatternRecog., Jun. 2013, pp. 3562–3569.

[34] Q. Ke, M. Bennamoun, S. An, F. Bossaid, and F. Sohel, “Spatial, structuraland temporal feature learning for human interaction prediction,” CoRR,2016. [Online]. Available: http://arxiv.org/abs/1608.05267

[35] L. Wang, T. Liu, G. Wang, K. L. Chan, and Q. Yang, “Video trackingusing learned hierarchical features,” IEEE Trans. Image Process., vol. 24,no. 4, pp. 1424–1435, Apr. 2015.

[36] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spa-tial pyramid matching for recognizing natural scene categories,” in Proc.IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., Jun. 2006, vol. 2,pp. 2169–2178.

[37] C. Russell et al., “Associative hierarchical CRFs for object class imagesegmentation,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep.-Oct. 2009,pp. 739–746.

[38] Q. Huang, M. Han, B. Wu, and S. Ioffe, “A hierarchical conditional randomfield model for labeling and segmenting images of street scenes,” in Proc.IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2011, pp. 1953–1960.

[39] X. Wang and Q. Ji, “Video event recognition with deep hierarchical contextmodel,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015,pp. 4418–4427.

[40] S. Yang, C. Yuan, B. Wu, W. Hu, and F. Wang, “Multi-feature max-marginhierarchical Bayesian model for action recognition,” in Proc. IEEE Conf.Comput. Vis. Pattern Recog., Jun. 2015, pp. 1610–1618.

[41] N. T. Nguyen, D. Q. Phung, S. Venkatesh, and H. Bui, “Learning and de-tecting activities from movement trajectories using the hierarchical hidden

Page 16: 1494 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. …xzhang/publications/tmm17-v19...Feature based method: Some conceptually weaker models such as those applying space-time interest points

XU et al.: HSTM FOR HUMAN ACTIVITY RECOGNITION 1509

Markov model,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. PatternRecog., Jun. 2005, vol. 2, pp. 955–960.

[42] L. Zhang, Z. Zeng, and Q. Ji, “Probabilistic image modeling with anextended chain graph for human activity recognition and image segmen-tation,” IEEE Trans. Image Process., vol. 20, no. 9, pp. 2401–2413, Sep.2011.

[43] Y. Kong, Y. Jia, and Y. Fu, “Interactive phrases: Semantic descriptionsforhuman interaction recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 36, no. 9, pp. 1775–1788, Sep. 2014.

[44] Y. Kong, Y. Jia, and Y. Fu, “Learning human interaction by interactivephrases,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 300–313.

[45] T. Lan, T.-C. Chen, and S. Savarese, “A hierarchical representationfor future action prediction,” in Proc. Eur. Conf. Comput. Vis., 2014,pp. 689–704.

[46] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models forcomplex action recognition,” in Proc. IEEE Comput. Soc. Conf. Comput.Vis. Pattern Recog., Jun. 1997, pp. 994–999.

[47] P. Guo, Z. Miao, X.-P. Zhang, Y. Shen, and S. Wang, “Coupled observationdecomposed hidden Markov model for multiperson activity recognition,”IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 9, pp. 1306–1320,Sep. 2012.

[48] A. Gupta, A. Kembhavi, and L. S. Davis, “Observing human-object inter-actions: Using spatial and functional compatibility for recognition,” IEEETrans. Pattern Anal. Mach. Intell., vol. 31, no. 10, pp. 1775–1789, Oct.2009.

[49] D. I. Kosmopoulos, N. D. Doulamis, and A. S. Voulodimos, “Bayesianfilter based behavior recognition in workflows allowing for user feedback,”Comput. Vis. Image Understand., vol. 116, no. 3, pp. 422–434, 2012.

[50] Y. Kong, D. Kit, and Y. Fu, “A discriminative model with multiple temporalscales for action prediction,” in Proc. Eur. Conf. Comput. Vis., 2014,pp. 596–611.

[51] T. Lan, Y. Wang, W. Yang, S. N. Robinovitch, and G. Mori, “Discriminativelatent models for recognizing contextual group activities,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 34, no. 8, pp. 1549–1562, Aug. 2012.

[52] C. Liu et al., “Convolutional neural random fields for action recognition,”Pattern Recog., vol. 59, pp. 213–224, 2016.

[53] J. Donahue et al., “Long-term recurrent convolutional networks for visualrecognition and description,” in Proc. IEEE Conf. Comput. Vis. PatternRecog., Jun. 2015, pp. 2625–2634.

[54] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu, “Action recognition bydense trajectories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun.2011, pp. 3169–3176.

[55] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action MACH a spatio-temporal maximum average correlation height filter for action recogni-tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2008, pp.1–8.

[56] “CASIA Action Database,” 2010. [Online]. Available: http://www.cbsr.ia.ac.cn/english/Action%20Databases%20EN.asp

[57] Y. Kong and Y. Fu, “Modeling supporting regions for close human inter-action recognition,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 29–44.

[58] A. Fathi and G. Mori, “Action recognition by learning mid-level motionfeatures,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2008,pp. 1–8.

[59] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluationof local spatio-temporal features for action recognition,” in Proc. Brit.Mach. Vis. Conf., 2009, pp. 124-1–124-11.

[60] A. Klaser, M. Marszałek, I. Laptev, and C. Schmid, “Will person detectionhelp bag-of-features action recognition?” 2010.

[61] A. Prest, V. Ferrari, and C. Schmid, “Explicit modeling of human-objectinteractions in realistic videos,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 35, no. 4, pp. 835–848, Apr. 2013.

[62] A. Klaser, M. Marszałek, and C. Schmid, “A spatio-temporal descrip-tor based on 3d-gradients,” in Proc. 19th Brit. Mach. Vis. Conf., 2008,pp. 275-1–275-10.

Wanru Xu received the B.E. degree from BeijingJiaotong University, Beijing, China, in 2011, and iscurrently working toward the Ph.D. degree at BeijingJiaotong University, Beijing, China.

Her research interests include pattern recognition,machine learning, and computer vision. Her currentresearch mainly focuses on human action recogni-tion for video surveillance and content-based videoretrieval.

Zhenjiang Miao (A’11–M’13) received the B.E. de-gree from Tsinghua University, Beijing, China, in1987, and the M.E. and Ph.D. degrees from North-ern Jiaotong University, Beijing, China, in 1990 and1994, respectively.

From 1995 to 1998, he was a Postdoctoral Fel-low with the Ecole Nationale Suprieure dElectrotech-nique, dElectronique, dInformatique, dHydrauliqueet des Telecommunications, Institute National Poly-technique de Toulouse, Toulouse, France, and a Re-searcher with the Institute National de la Recherche

Agronomique, Sophia Antipolis, France. From 1998 to 2004, he was with theInstitute of Information Technology, National Research Council Canada, NortelNetworks, Ottawa, ON, Canada. He joined Beijing Jiaotong University, Beijing,China, in 2004. He is currently a Professor, the Director of the Media ComputingCenter, Beijing Jiaotong University, and the Director of the Institute for Digi-tal Culture Research, Center for Ethnic & Folk Literature & Art Development,Ministry of Culture, Beijing, China. His current research interests include imageand video processing, multimedia processing, and intelligent human–machineinteraction.

Xiao-Ping Zhang (M’95–SM’06) received the B.S.and Ph.D. degrees in electronic engineering fromTsinghua University, Beijing, China, in 1992 and1996, respectively, and the MBA degree in finance,economics, and entrepreneurship (honors) from theUniversity of Chicago Booth School of Business,Chicago, IL, USA, in 2008.

Since Fall 2000, he has been with the Depart-ment of Electrical and Computer Engineering, Ry-erson University, Toronto, ON, Canada, where he iscurrently a Professor, the Director of Communica-

tion and Signal Processing Applications Laboratory. He previously served asthe Program Director of Graduate Studies. He is cross-appointed to the FinanceDepartment, Ted Rogers School of Management, Ryerson University, Toronto,ON, Canada. He was a Visiting Scientist with the Research Laboratory of Elec-tronics, Massachusetts Institute of Technology, Cambridge, MA, USA, in 2015.He is a frequent consultant for biotech companies and investment firms. Heis the cofounder and the CEO for EidoSearch, an Ontario-based company of-fering a content-based search and analysis engine for financial big data. Hisresearch interests include image and multimedia content analysis, statisticalsignal processing, sensor networks and electronic systems, machine learning,and applications in big data, finance, and marketing.

Prof. Zhang is a registered Professional Engineer in ON, Canada, and aMember of the Beta Gamma Sigma Honor Society. He is the General Co-Chairfor ICASSP2021. He is an elected Member of the ICME steering committee.He is the General Chair for MMSP’15. He is the Publicity Chair for ICME’06and Program Chair for ICIC’05 and ICIC’10. He served as a Guest Editor ofMultimedia Tools and Applications, and the International Journal of SemanticComputing. He is a Tutorial Speaker in ACMMM2011, ISCAS2013, ICIP2013,and ICASSP2014. He is/was an Associate Editor of the IEEE TRANSACTIONS

ON IMAGE PROCESSING, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS

FOR VIDEO TECHNOLOGY, the IEEE TRANSACTIONS ON MULTIMEDIA, the IEEETRANSACTIONS ON SIGNAL PROCESSING, the IEEE SIGNAL PROCESSING LET-TERS, and the Journal of Multimedia.

Yi Tian received the B.E. degree in 2011 from Bei-jing Jiaotong University, Beijing, China, where sheis currently working toward the Ph.D. degree.

Her research interests include pattern recognition,machine learning, and computer vision. Her currentresearch mainly focuses on human action recognitionfor video surveillance and human identification.