Cviu Nazim

  • View

  • Download

Embed Size (px)


Nazim paper action recognition body points

Text of Cviu Nazim

  • easf Ce


    Available online 5 February 2013

    Keywords:View invariancePose transitionAction recognitionAction alignmentFundamental ratios


    action recognition. A moving plane observed by a xed camera induces a fundamental matrix F betweentwo frames, where the ratios among the elements in the upper left 2 2 submatrix are herein referred to

    g of hu

    this area [25]. Action can be regarded as a collection of 4Dspacetime data observed by a perspective video camera. Due toimage projection, the 3D Euclidean information is lost and projec-tively distorted, which makes action recognition rather challeng-ing, especially for varying viewpoints and different camera

    ing a model of the human body, e.g. silhouette, body points, stickmodel, etc., and build algorithms that use the adopted model torecognize body pose and its motion over time. Spacetime featuresare essentially the primitives that are used for recognizing actions,e.g. photometric features such as the optical ow [911] and thelocal spacetime features [12,13]. These photometric features canbe affected by luminance variations due to, for instance, camerazoom or pose changes, and often work better when the motion issmall or incremental. On the other hand, salient geometric featuressuch as silhouettes [1418] and point sets [8,19] are less sensitiveto photometric variations, but require reliable tracking. Silhouettes

    q This paper has been recommended for acceptance by J.K. Aggarwal. Corresponding author.

    E-mail address: (N. Ashraf).1 Assistant Professor, Department of Computer Science, FC College (A Chartered

    Computer Vision and Image Understanding 117 (2013) 587602

    Contents lists available at

    Computer Vision and I

    w.eUniversity).is an important area of research in computer vision that plays acrucial role in various applications such as surveillance, humancomputer interaction (HCI), ergonomics, etc. In this paper, we focuson the recognition of actions in the case of varying viewpoints anddifferent and unknown camera intrinsic parameters. The chal-lenges to be addressed in action recognition include perspectivedistortions, differences in viewpoints, anthropometric variations,and the large degrees of freedom of articulated bodies [1]. The lit-erature in human action recognition has been extremely active inthe past two decades and signicant progress has been made in

    action recognition. The execution rates of the same action in differ-ent videos may vary for different actors or due to different cameraframe rates. Therefore, the mapping between same actions in dif-ferent videos is usually highly non-linear.

    To tackle these issues, often simplifying assumptions are madeby researchers on one or more of the following aspects: (1) cameramodel, such as scaled orthographic [6] or calibrated perspectivecamera [7]; (2) camera pose, i.e. little or no viewpoint variations;(3) anatomy, such as isometry [8], coplanarity of a subset of bodypoints [8], etc. Human action recognition methods start by assum-1. Introduction

    The perception and understandin1077-3142/$ - see front matter 2013 Elsevier Inc. A the fundamental ratios. We show that fundamental ratios are invariant to camera internal parametersand orientation, and hence can be used to identify similar motions of line segments from varying view-points. By representing the human body as a set of points, we decompose a body posture into a set of linesegments. The similarity between two actions is therefore measured by the motion of line segments andhence by their associated fundamental ratios. We further investigate to what extent a body part plays arole in recognition of different actions and propose a generic method of assigning weights to differentbody points. Experiments are performed on three categories of data: the controlled CMU MoCap dataset,the partially controlled IXMAS data, and the more challenging uncontrolled UCF-CIL dataset collected onthe internet. Extensive experiments are reported on testing (i) view-invariance, (ii) robustness to noisylocalization of body points, (iii) effect of assigning different weights to different body points, (iv) effectof partial occlusion on recognition accuracy, and (v) determining how soon our method recognizes anaction correctly from the starting point of the query video.

    2013 Elsevier Inc. All rights reserved.

    man motion and action

    parameters. Another source of challenge is the irregularities of hu-man actions due to a variety of factors such as age, gender, circum-stances, etc. The timeline of action is another important issue inReceived 10 February 2012Accepted 19 January 2013

    signicance in view-invariant action recognition, and explore the importance of different body parts inView invariant action recognition using w

    Nazim Ashraf a,, Yuping Shen a,b, Xiaochun Cao a,c, HaCollege of Engineering and Computer Science, Computational Imaging Lab, University obAdvanced Micro Devices, Quadrangle Blvd., Orlando, FL 32817, USAc State Key Laboratory of Information Security, Institute of Information Engineering, ChidDepartment of EECS, Computational Imaging Laboratory, University of Central Florida,

    a r t i c l e i n f o

    Article history:

    a b s t r a c t

    In this paper, we fully inv

    journal homepage: wwll rights reserved.ighted fundamental ratiosq

    san Foroosh a,d,1

    ntral Florida, 4000 Central Florida Blvd., Orlando, FL 32816, USA

    Academy of Sciences, Beijing 100093, Chinando, FL 32816, USA

    ate the concept of fundamental ratios, demonstrate their application and

    SciVerse ScienceDirect

    mage Understanding

    lsev ier .com/ locate/cviu

  • ettes of input videos into spacetime objects, and extract features in

    1.2. Our approach

    mage Understanding 117 (2013) 587602are usually stacked in time as 2D [16] or 3D object [14,18], whilepoint sets are tracked in time to form spacetime curves. Someexisting approaches are also more holistic and rely on machinelearning techniques, e.g. HMM [20], SVM [12], etc. As in mostexemplar-based methods, they rely on the completeness of thetraining data, and to achieve view-invariance, are usually expen-sive, as it would be required to learn a model from a large dataset.

    1.1. Previous work on view-invariance

    Most action recognitionmethods adopt simplied cameramodelsand assume xed viewpoint or simply ignore the effect of viewpointchanges. However, in practical applications such as HCI and surveil-lance, actions may be viewed from different angles by different per-spective cameras. Therefore, a reliable action recognition systemhas to be invariant to the camera parameters or viewpoint changes.View-invariance is, thus, of great importance in action recognition,and has started receiving more attention in recent literature.

    One approach to tackle view-invariant action recognition hasbeen based on using multiple cameras [2022,7]. Campbell et al.[23] use stereo images to recover a 3D Euclideanmodel of the humansubject, and extract view invariance for 3D gesture recognition;Weinland et al. [7] use multiple calibrated and background-sub-tracted cameras, and they obtain a visual hull for eachpose frommul-ti-viewsilhouettes, and stack themas amotionhistory volume, basedon which Fourier descriptors are computed to represent actions. Ah-mad et al. [20] build HMMs on optical ow and human body shapefeatures from multiple views, and feed a test video sequence to alllearned HMMs. These methods require the setup of multiple cam-eras, which is quite expensive and restricted inmany situations suchas online video broadcast, HCI, or monocular surveillance.

    A second line of research is based on a single camera and is moti-vated by the idea of exploiting the invariants associated with a gi-ven camera model, e.g. afne, or projective. For instance, Raoet al. [24] assume an afne camera model, and use dynamic in-stants, i.e. the maxima in the spacetime curvature of the hand tra-jectory, to characterize hand actions. The limit with thisrepresentation is that dynamic instants may not always exist ormay not be always preserved from 3D to 2D due to perspective ef-fects. Moreover the afne camera model is restrictive in most prac-tical scenarios. Amore recent work reported by Parameswaran et al.[8] relaxes the restrictions on the camera model. They propose aquasi-view-invariant 2D approach for human action representationand recognition, which relies on the number of invariants in a givenconguration of body points. Thus a set of projective invariants areextracted from the frames and used as action representation. How-ever, in order to make the problem tractable under variable dynam-ics of actions they introduced heuristics, and made simplifyingassumptions such as isometry of human body parts. Moreover, theyrequire that at least ve body points form a 3D plane or the limbstrace planar area during the course of an action. [25] described amethod to improve discrimination by inferring and then using la-tent discriminative aspect parameters. Another interesting ap-proach to tackle unknown views has been suggested by [26], whouse virtual views, connecting the action descriptors extracted fromsource view to those extracted from target view. Another interest-ing approach is [27], who used a bag of visual-words to model anaction and present promising results.

    Another promising approach is based on exploiting the multi-view geometry. Two subjects in the same exact body postureviewed by two different cameras at different viewing angles canbe regarded as related by the epipolar geometry. Therefore, corre-sponding poses in two videos of actions are constrained by the asso-

    588 N. Ashraf et al. / Computer Vision and Iciated fundamental matrices, providing thus a way to match posesand actio