8
Journal of the Institute of Industrial Applications Engineers Vol.5, No.2, pp.71–78, (2017.4.25) DOI: 10.12792/JIIAE.5.71 Online edition: ISSN 2187-8811 Print edition: ISSN 2188-1758 Paper A Study on Recognition of Two-Person Interactions Based on Concatenated History Images Tanmoy Paul *† Non-member, Ummul Afia Shammi Non-member Seiichi Serikawa Member, Md Atiqur Rahman Ahad Non-member (Received December 28, 2016, revised March 2, 2017) Abstract: Computer vision has applications like visual surveillance, video retrieval, human-computer interac- tion, etc. that are increasing day by day. Human action recognition is one such vital research area under computer vision. Many datasets dedicated to human action recognition have been created and are accompanied with equally large number of techniques of recognition in recent years. In this paper, we have proposed a technique to recog- nize two-person interactions based on a benchmark dataset from University of Texas, Austin called UT interaction Dataset. The task is challenging due to variations in motion, dynamic background, recording settings and inter- personal dierences. Our technique includes concatenated Motion History Images (cMHI). The produced cMHI templates are incorporated with Local Binary Pattern (LBP) and Histogram of Oriented Gradient (HOG) to de- velop smart feature vector for each class. Finally, Support Vector Machine (SVM) is exploited to classify various actions. 1. Introduction Human computer interaction is one of the important fields in computer vision and its operation needs a broader study for developing more human-computer cordial systems [3]. The action recognition process involves extracting data from a video and putting it in a classifier for detecting the final ac- tion. Despite the fact that there is a large advancement in this arena, it is actually very troublesome to recognize ac- tions with complicated dataset. The dilemma in recognizing human actions is compli- cated by much content. Variation between people and oc- currence of a single person can be a major factor in recog- nizing actions [4]. It also gets complicated by camera move- ment. So for accurate action recognition, either explicit or implicit background needs to get removed and background motions should be neglected [4]. Previous research involves recognizing simple one per- son action like walking, jogging, hand clapping etc., [5] [6]. Practically thinking, these types of single person activities will not be able to occur in public. Human actions usually consist of multiple persons. Recognition of multiple person activities (e.g., punching, pushing, and handshaking) will be obligatory for several applications like, automatic detection of brutal activities in intellectual surveillance system [7]. The objective of this research is to find a strategy to de- tect two-person interactions eciently. To detect the exis- tence and direction of motion in a frame, we have exploited a template matching approach named MHI (Motion His- * Corresponding author: [email protected] Department of Electrical and Electronic Engineering, University of Dhaka, Bangladesh tory Image). This technique has been used before in nu- merous cases for single person action recognition [1] [2]. In non-periodic actions, the starting action frame, interac- tion frame, and ending frame may not be clearly defined [7]. So, in this paper, we have proposed an approach to rec- ognize the actual frame where the interaction between two persons takes place. Our proposal introduces a modified version of MHI since basic MHI fails to generate apprecia- ble outcomes [2]. Once the frame of interaction is detected, instead of tak- ing the MHI of all the frames from start to end, only a cho- sen number of frames both before and after the interaction can be taken. For accurate detection of action, Local Binary Pattern (LBP) and Histogram of Oriented Gradient (HOG) have been used. For training and testing, Support Vector Machine (SVM) has been employed as classifier. This paper is described as follows: Section 2 covers previous works. Section 3 describes our proposed cMHI method. Experi- mental result and analysis are covered in section 4. Section 5 concludes the paper with a few future research scopes. 2. Related Works Datasets consists of various actions. Most of them are com- posed from 2005 and beyond [8]. It is very dicult to rec- ognize the action from some of the datasets. Therefore, re- searchers have used many dierent methods and algorithms for this purpose. In order to recognize the actions, a satisfy- ing feature descriptor and a stable classifier are needed. Bobick and davis [9] have developed Motion Energy Image (MEI) and Motion History Image (MHI) templates which is a template matching approach. It was a motion- based theory. It points the location of motion and how the Published by IIAE. 2017 71

A Study on Recognition of Two-Person Interactions Based on

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Journal of the Institute of Industrial Applications Engineers Vol.5, No.2, pp.71–78, (2017.4.25)DOI: 10.12792/JIIAE.5.71 Online edition: ISSN 2187-8811 Print edition: ISSN 2188-1758

Paper

A Study on Recognition of Two-Person Interactions Based onConcatenated History Images

Tanmoy Paul∗† Non-member, Ummul Afia Shammi‡ Non-memberSeiichi Serikawa Member, Md Atiqur Rahman Ahad‡ Non-member

(Received December 28, 2016, revised March 2, 2017)

Abstract: Computer vision has applications like visual surveillance, video retrieval, human-computer interac-tion, etc. that are increasing day by day. Human action recognition is one such vital research area under computervision. Many datasets dedicated to human action recognition have been created and are accompanied with equallylarge number of techniques of recognition in recent years. In this paper, we have proposed a technique to recog-nize two-person interactions based on a benchmark dataset from University of Texas, Austin called UT interactionDataset. The task is challenging due to variations in motion, dynamic background, recording settings and inter-personal differences. Our technique includes concatenated Motion History Images (cMHI). The produced cMHItemplates are incorporated with Local Binary Pattern (LBP) and Histogram of Oriented Gradient (HOG) to de-velop smart feature vector for each class. Finally, Support Vector Machine (SVM) is exploited to classify variousactions.

1. Introduction

Human computer interaction is one of the important fields incomputer vision and its operation needs a broader study fordeveloping more human-computer cordial systems [3]. Theaction recognition process involves extracting data from avideo and putting it in a classifier for detecting the final ac-tion. Despite the fact that there is a large advancement inthis arena, it is actually very troublesome to recognize ac-tions with complicated dataset.

The dilemma in recognizing human actions is compli-cated by much content. Variation between people and oc-currence of a single person can be a major factor in recog-nizing actions [4]. It also gets complicated by camera move-ment. So for accurate action recognition, either explicit orimplicit background needs to get removed and backgroundmotions should be neglected [4].

Previous research involves recognizing simple one per-son action like walking, jogging, hand clapping etc., [5] [6].Practically thinking, these types of single person activitieswill not be able to occur in public. Human actions usuallyconsist of multiple persons. Recognition of multiple personactivities (e.g., punching, pushing, and handshaking) will beobligatory for several applications like, automatic detectionof brutal activities in intellectual surveillance system [7].

The objective of this research is to find a strategy to de-tect two-person interactions efficiently. To detect the exis-tence and direction of motion in a frame, we have exploiteda template matching approach named MHI (Motion His-

∗ Corresponding author: [email protected]‡ Department of Electrical and Electronic Engineering, University of

Dhaka, Bangladesh

tory Image). This technique has been used before in nu-merous cases for single person action recognition [1] [2].In non-periodic actions, the starting action frame, interac-tion frame, and ending frame may not be clearly defined[7]. So, in this paper, we have proposed an approach to rec-ognize the actual frame where the interaction between twopersons takes place. Our proposal introduces a modifiedversion of MHI since basic MHI fails to generate apprecia-ble outcomes [2].

Once the frame of interaction is detected, instead of tak-ing the MHI of all the frames from start to end, only a cho-sen number of frames both before and after the interactioncan be taken. For accurate detection of action, Local BinaryPattern (LBP) and Histogram of Oriented Gradient (HOG)have been used. For training and testing, Support VectorMachine (SVM) has been employed as classifier. This paperis described as follows: Section 2 covers previous works.Section 3 describes our proposed cMHI method. Experi-mental result and analysis are covered in section 4. Section5 concludes the paper with a few future research scopes.

2. Related WorksDatasets consists of various actions. Most of them are com-posed from 2005 and beyond [8]. It is very difficult to rec-ognize the action from some of the datasets. Therefore, re-searchers have used many different methods and algorithmsfor this purpose. In order to recognize the actions, a satisfy-ing feature descriptor and a stable classifier are needed.

Bobick and davis [9] have developed Motion EnergyImage (MEI) and Motion History Image (MHI) templateswhich is a template matching approach. It was a motion-based theory. It points the location of motion and how the

Published by IIAE. 2017 71

72 T. Paul, U. Shammi, S. Serikawa, M. Ahad

object is moving. They present a recognition method whichautomatically performs temporal segmentation and runs inreal time.

Blank et al. [7] have formed a space-time volume in(x, y, t) space. Using features from Poisson equations andgeometric surface properties along with the 3D volumes,the actions are recognized. Action and human poses weremerged together in the approach by Yang et al. [11]. Theytreated poses as latent variable to presume the action labelin still images.

For recognizing actions in uncontrolled situation, Tan etal. [12] have proposed the Local Ternary Patterns (LTP) forrepresenting feature. Wang et al. [13] have used depth cam-eras for action recognition. They introduced a new featuretype called Local Occupancy Pattern. They proposed 3Dversion of temporal template. It was done by using multiplecameras. For classification, they have used Fourier analysisin cylindrical coordinate.

Weindland et al. [14] have proposed 3D version of tem-poral template. It too was done by using multiple cam-eras. For classification, they have utilized Fourier analy-sis in cylindrical coordinate. Dalal and Triggs [15] haveproposed an algorithm which consists of overlapping spa-tial blocks, HOG for feature descriptors and SVM as classi-fier. Fathi and Mori [16] have developed mid-level motionfeatures. These motion features are built from optical flowvectors. Using a variant of AdaBoost, local regions of theimage sequence are pointed in these motion features.

In recent years, many researchers approached two-personaction recognition. They have strived different methods aswell. Motivated by the numerous successes of human ac-tivity recognition methods using bag-of-words, Slimani etal. [17] have initiated a framework based on the conceptof co-occurring visual words. It can describe some interac-tions among several subjects. They have built a 3D spatio-temporal volume (x, y, t) for each interacting person. A setof visual words is extracted to represent the correspondingactivity. Each interaction is then represented by its ownunique frequency of co-occurring visual words between twopersons.

Waltisberg et al. [18] have taken probabilistic action la-bels from the Hough-voting framework for action recogni-tion for low-resolution video. They considered optical flowand gradients as low-level features. 3D feature patches arecreated and random tress are exploited for recognition.

Park et al. [19] have described a framework by exploit-ing three levels of abstraction. In the low-level, they haveused individual Bayesian networks for recognizing bodyparts (e.g., head, torso, arms, etc.). The overall body posewas then obtained by integrating these Bayesian networks.Actions of a subject are described by using a dynamicBayesian network (DBN) in the mid-level. Later, these de-scriptions are matched to recognize an interaction betweentwo subjects.

Yun et al. [20] have proposed an interaction dataset.Based on this dataset, they considered various features forrecognition, including geometric relational body-pose fea-tures. Based on distance between all pairs of joints, they

achieved good results. They used Multiple Instance Learn-ing (MIL) for classification.

3. MethodThe purpose of this paper is to find a method to recognizedifferent actions taking place between two persons. Oneway of detecting the motion in a frame is the templatematching approach named MHI, which generates uniquefeatures for different actions in different directions. ThenLBP is used to cancel out the unnecessary information.HOG is applied to extract the features from the LBP manip-ulated MHI. Then SVM is exploited for classification. Theaccuracy of such experiments depends largely on properchoice of feature extractor, classifier and of course MHIupon which the preceding two will work on. For a betterMHI, at first, the frame where the interaction between twopersons takes place is sorted out. Then two MHI templatesare formed, one from the first frame up to the frame of inter-action and other from the following one up to the last frame.The frame that has maximum number of pixels with motionforming a cluster is taken as the frame where the interac-tion takes place. These two MHIs are then concatenated forfurther manipulation. For better illustration of the wholeprocess, block diagrams are given in Fig. 1.

3.1 Motion History Image (MHI) Concept In form-ing an MHI, usually frame subtraction method is used. InMHI image, older/former movement portions in a video be-comes darker than newer recent movement or moving re-gions. By subtracting two neighboring and consecutive im-ages we can find the image containing part of the movingobject. Then this image is converted into binary image. Bylayering the successive binary images, the Motion HistoryImage can be produced. The MHI can be generated usingdifference of frames. In an MHI image, more recently mov-ing pixels are brighter and the image is a scalar valued im-age.

While computing the MHI as shown in equation below,we need to subtract two consecutive frames or do back-ground subtraction in order to determine any changes in be-tween frames. This is called update function here that isdenoted by Ψ(x, y, t). It defines any presence of motion cue.The parameter τ and the decay parameter δ play a very sig-nificant role in representing actions. τ decides the temporalduration of the MHI and δ subtracts the pixel value from itsprior value at time t-1 if there is no motion at that time t onthat specific pixel value. So, we can compute MHI usingthe following equation [2]:

Hτ(x, y, t) =

τ Ψ(x, y, t) = 1max(0,Hτ(x, y, t − 1) − δ) otherwise

(1)

(x, y) is the pixel position and t is the time that definesconsecutive frames. The parameter δ can have a value 1 ormore based on the speed of action and it can be set empir-ically. So, the final MHI image consists of all the motionhistory of each of the frames. The final MHI template canbe computed as Hτ(x, y, t). But in our proposed method, we

IIAE Journal, Vol.5, No.2, 2017

A Study on Recognition of Two-Person Interactions Based on Concatenated History Images 73

Figure 1: Block diagram of our proposed method

Figure 2: MHI and concatenated-MHI of UT interaction dataset for the six actions

have used concatenated MHI. In the beginning, the framewhere the interaction between the persons takes place is de-tected. When there is no motion in a frame, it contains nopixel with the value 255. But with the start of motion, thepixels with motion are assigned this value. When the inter-action takes place, a cluster is formed which contains max-imum number of 255 and nonzero pixel values in the imagetemplate. That particular frame is taken as the frame of in-terest. Then one MHI is formed from the first frame up tothis frame of interest where interaction takes place. Thenanother MHI is formed from the following frame up to thelast one. These two MHI templates are concatenated anda single image is formed. The following Fig. 2 shows theMHI image and the concatenated MHI image of 6 actions(punching, push, kick, handshake, point, and hug) in UTinteraction dataset.

3.2 Local Binary Pattern (LBP) LBP is a univer-sally used mechanism. It basically sums up local structures

of images. LBP has a good tolerance level regarding mono-tonic illumination changes and it is one of the most impor-tant properties of LBP [21]. The computational method ofLBP is quite simple. It identifies the pixel values of an im-age with numbers that is well-known as Local Binary Pat-tern (LBP codes). The local structure around each pixel getsencoded for a new value. A 3 × 3 neighborhood is formedaround every pixel. Each pixel is subtracted with the cen-ter pixel value. If the result is a negative number then itis encoded as 0, otherwise 1. Then all these binary codesare concatenated in a clockwise direction to form a binarynumber [21]. These derives binary numbers are called Lo-cal Binary Pattern. The LBP pattern is retrieved from jointdistribution of pixel values around based on the followingequation [22]∼[24]:

LBPP,R(xc − yc) =p−1∑p=0

s(gp − gc)2p (2)

IIAE Journal, Vol.5, No.2, 2017

74 T. Paul, U. Shammi, S. Serikawa, M. Ahad

Table 1: Confusion matrix of UT outdoor dataset#1 using MHI-LBP-HOG-SVM method

Pointing Handshaking Pushing Punching Kicking Hugging Unrecognized

Pointing 90% 10%

Handshaking 35% 15% 10% 40%Pushing 35% 15% 10% 50%

Punching 5% 35% 30% 30%Kicking 20% 10% 30% 40%Hugging 30% 10% 40% 20%

Total accuracy 40.83%

Here, from a gray-scale image, we can compute an LBPimage which is also a gray-scale image. Here, gc denote thegray level of a pixel (x, y), i.e., gc = I(x, y). The signs of thedifferences in a neighborhood are demonstrated as a P bitbinary number, which compute 2p values for the LBP code.On the other hand, gp demonstrates the gray value of a pointin a circular neighborhood of P sampling points and radiusR around the pixel (x, y). Therefore, gp = I(xp, yp) and s isthe thresholding step function.

3.3 Histogram of Oriented Gradient (HOG) and Sup-port Vector Machine (SVM) Histogram of OrientedGradient (HOG) is basically a feature descriptor. It countsthe number of occurrences of gradient orientation in smallpatch of an image [25]. In computing HOG at first theimage is divided into small connected regions called cells.Then a histogram of gradients within the cell is computed.A set of block histograms represents the descriptor. Accord-ing to Ref. [15], some recommended HOG parameters canbe defined as: 1D centered derivative mask as [−1, 0,+1],detection window size as 64 × 128, cell size can be set as8 × 8, and block size can be considered as 16 × 16 (2 × 2cells). It may vary in other settings.

In this work, we exploited the Support Vector Machine(SVM) for classification. SVM is a widely used classifier.The classification process is very complex. To separate dif-ferent cases it constructs a hyper plane in a multidimen-sional space. SVM needs two types of dataset. One is thetraining set and another one is the testing set. The trainingdata are already classified and SVM tries to find out the test-ing set which is unclassified. Usually dataset have differenttypes of features. So, each data is plotted as a point in spaceand the value of each feature is the value at a particular co-ordinate [26].

4. Results and AnalysisUniversity of Texas (UT) interaction dataset is a standarddataset for two-person actions. There are six types of ac-tions. These are:

1. Pointing

2. Pushing

3. Punching

4. Kicking

5. Hugging

6. Handshaking

All of these actions consists of two persons and are in out-door environment. There are 54 training videos in totalwhich means 9 video sequences for each action and 6 testvideos. The Fig. 3 shows some sample frames of UT inter-action dataset.

Each action is carried out by different persons in differentvideos. As a result, the inevitable challenge of variations incolors is encountered.4.1 Experimental Result of UT Interaction Dataset

Method 1: Using MHI-LBP-HOG-SVMAn MHI template for each action sequence has beenextracted with parameter value as τ = 255 and δ = 2,where we set the threshold at 30. The threshold valuehas been taken to 30 so that there remains low noiseand also important action information is retained. Af-ter creating Motion History Image (MHI) features,these have been converted into Local Binary Pattern(LBP) codes. Then Histogram of Oriented Gradient(HOG) features have been extracted. After that, Sup-port Vector Machine (SVM) is used for classification.9 samples for each action have been used to train theclassifier and 6 test samples for each class of action.A confusion matrix allows visualization of the perfor-mance of an Algorithm and is shown in Table 1. Thetotal accuracy of this method is only 40.83%. In 31.6%of the cases, this method failed to identify any class.

Method 2: Using Concatenated MHI-LBP-HOG-SVMIn this cMHI method, the frame where the interactionbetween the persons takes place is sorted out. Whenthere is no motion in a frame, it contains no pixel withthe value 255. Those pixels rest with value 0. Butwith the start of motion, pixels are assigned the value255. Although in absence of motion pixel values aredecreased by delta parameter, any nonzero value indi-cates the existence of motion at that pixel at any timeup until current frame. Before interaction, the nonzerovalued pixels are scattered all over the image. In theframe of interaction both the persons reach each otherfrom both sides of the frame and thus all the scatterednonzero valued pixels form a single block. So, the for-mation of a single block of nonzero valued pixels in theimage rather than scattered and discrete ones can indi-cate interaction and therefore, that particular frame isconsidered as the one where interaction takes place.

IIAE Journal, Vol.5, No.2, 2017

A Study on Recognition of Two-Person Interactions Based on Concatenated History Images 75

Figure 3: Sample frames of UT interaction dataset

Figure 4: Variation in the action: pushing

Then one MHI is formed from the first frame up tothis frame of interest. Then another MHI is createdfrom the following frame up to the last one. These twoMHI templates are concatenated and a single image isformed. The concatenated MHI holds much more spe-cific and clearer information than a conventional MHI.For example, in conventional method, the MHI forpunching and pushing are almost similar which leadto ambiguous outcome. But concatenated MHI pro-

vides two different images, giving additional data towork on. As a result, even if the body movements af-ter interaction of those persons in both the actions aresimilar, a clear distinction prior to interaction betweenthem is observed. Once we have got the concatenatedMHI templates, we have encoded the LBP codes fromit. Afterwards, HOG features have been extracted fromthe LBP codes. Finally, for classification, we have em-ployed the SVM classifier. The confusion matrix of

IIAE Journal, Vol.5, No.2, 2017

76 T. Paul, U. Shammi, S. Serikawa, M. Ahad

Table 2: Confusion matrix of UT outdoor dataset#1 using concatenated MHI-LBP-HOG-SVM method

Pointing Handshaking Pushing Punching Kicking Hugging Unrecognized

Pointing 100% 10%Handshaking 60% 10% 10% 20%

Pushing 60% 20% 20%Punching 10% 20% 40% 30%Kicking 50% 50%Hugging 20% 70% 10%

Total accuracy 63.33%

Figure 5: Demonstration of True Positive (TP), False Posi-tive (FP), False Negative (FN) and True Negative (TN).Here,the left column with ‘triangle’ demonstrates the relevantclasses, whereas the classes within the full-circle depicts theselected classes. Precision and recall are computed fromthese four parameters.

this method is shown in Table 2.

Here, we also compute the precision or positive pre-dictive value as well as recall or sensitivity for the re-sults for concatenated MHI-LBP-HOG-SVM method.It is known that precision can be defined as the fractionof recognized actions that are relevant. On the otherhand, sensitivity or recall is defined as the fraction ofrelevant recognized classes or instances which are re-trieved. From Fig. 5, we can compute the precisionand recall:

Precision =T P

T P + FP(3)

Recall =T P

T P + FN(4)

Based on the recognized action classes for different per-sons, we found that the precision is 0.87, and the corre-sponding recall is found to be 0.83. Now, we will computethe F-score (also called as F-measure or F1 score), which isthe harmonic mean of precision and recall:

F1 = 21

1recall

+1

precision

= 2precision ∗ recallprecision + recall

(5)

Table 3: Comparison of average accuracy on the UT outdoordataset#1

Methods Average Accuracy

Bag-of-words (BoW) 58.20%Co-occurrence of visual words 40.63%

Hough transform 88.00%MHI-LBP-HOG-SVM method 40.83%

Proposed method 63.33%

Figure 6: Background motion in the action: pushing

The computed F1 score is 0.84 based on the recognizedclasses for this dataset. If the value is best at 1, and worst at0.

In Table 3, we present a comparison of average accu-racy of different methods with our proposed method on UToutdoor dataset#1. It is seen that the Bag-of-words ap-proach has the average accuracy of 58.20% [17]. In the co-occurrence of visual words approach [17], the average ac-curacy is 40.63%. Hough-transform based method [18] hasachieved 88% average accuracy and our proposed methodhas an accuracy of 63.33%.

4.2 Discussion In order to recognize two person in-teractions, we implemented two different types of MHI.In method 1, the basic MHI of all the frames have beentaken and in method 2, which is proposed by us, we useda modified MHI based on dominant frame (where interac-tion takes place). The recognition rate of the second methodis higher than the first method. Because the second methodprovides clearer and more detailed MHI, it was more conve-nient for the following features and classifiers to work on.But overall accuracy of recognition is not optimum. Thefirst method yields 40.84% recognition of action where thesecond method yields 63.33%.

The accuracy is poor because there are few videos inthis dataset where some background movements are exis-tent (e.g., Fig. 6). Those random motions in the back-ground obstruct to find the exact frame where interaction

IIAE Journal, Vol.5, No.2, 2017

A Study on Recognition of Two-Person Interactions Based on Concatenated History Images 77

Figure 7: No movement by the person receiving kick

(a) (b)

Figure 8: Hand movement while pushing and punching lookalmost the same

takes place. In some videos while one person is kicking orpunching, the other person does not show any movement(For example, Fig. 7). It poses problem in finding out theactual frame of recognition. The hand movement of punch-ing and pushing is quite similar with few subtle differences(Fig. ??). That is why most of the punching was recog-nized as pushing. In addition to all those issues, the numberof training videos in this dataset is few. So even for a fewunrecognized or ambiguous cases, the percentage of recog-nition takes a huge fall.

5. Conclusions and Future WorkWe worked on UT interaction dataset which has of 60videos of six different actions. We used a template match-ing approach named Motion History Image (MHI) that cangive information about motion by subtracting consecutiveframes. In the second approach, while forming Motion His-tory Image (MHI), our first target was to find out the ex-act frame where interaction takes place. In that particularframe, a cluster of maximum pixels with motion is formed.Upon detecting that frame, we created two MHI templates,one from the first frame up to the frame of interaction andthen from the following frame to the last frame. These twoimages are then concatenated to form a single image. Thisfinal image which holds two different MHI was then fed toLBP, HOG and SVM. The proposed second method yieldsa better percentage of recognition than the basic MHI basedmethod. Due to the existence of some unwanted back-ground motions, the recognition rate is reduced.

Although our proposed method brought a better resultthan the first one, it is yet to meet an optimum accuracylevel. In future, we will try to improve the accuracy. We arecurrently working on finding out the exact point where theinteraction takes place. In that way, we will be able to locatethe territory of movement of each person which can helpto form better motion templates. Reduction of backgroundand unwanted motion information will improve the recog-

nition rate. We will implement our proposed method on twoperson interaction datasets where there is no random back-ground movement. For better recognition rate in UT inter-action dataset, we are looking forward to going through fewother existing strategies and come up with a unique combi-nation with necessary change and modification for a satis-factory result.

References

[1] M. A. R. Ahad, “Computer Vision and Action Recognition:A Guide for Image Processing and Computer Vision Com-munity for Action Understanding”, Springer, 2011.

[2] M. A. R. Ahad, “Motion History Images for Action Recog-nition and Understanding”, Springer, 2013.

[3] Sonali, and A. K. Bathla, “Human Action Recognition usingSupport Vector Machine and K-Nearest Neighbor”, Interna-tional Journal of Engineering and Technical Research, 2015.

[4] Y. Lui, and J. Beveridge, “Tangent bundle for human actionrecognition”, IEEE Automatic Face and Gesture Recogni-tion, 2011.

[5] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R.Basri., “Actions as space-time shapes”, ICCV, 2005.

[6] J. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learningof human action categories using spatial-temporal words”,IJCV, 79(3):299–318, 2008.

[7] K. Yun, J. Honorio1, D. Chattopadhyay, T. L. Berg, and D.Samaras, “Two-person Interaction Detection Using Body-Pose Features and Multiple Instance Learning”, CVPR, 2012

[8] J. M. Chaquet, E. J. Carmona, and A. F. Caballero, “A surveyof video datasets for human action and activity recognition”,2013.

[9] A. Bobick, and J. Davis, “An appearance-based representa-tion of action”, IEEE ICPR, 1996.

[10] M. Blank, L.Gorelick, E. Shechtman, M. Irani, and R. Basri,“Actions as Space-Time Shapes”

[11] W. Yang, Y. Wang, and G. Mori, “Recognizing Human Ac-tions from Still Images with Latent Poses”, CVPR, 2010.

[12] X. Tan and B. Triggs, “Enhanced Local Texture Feature Setsfor Face Recognition under Difficult Lighting Conditions”,Springer-Verlag Berlin Heidelberg, 2007.

[13] L. Wang, W. Hu, and T. Tan, “Recent developments in humanmotion analysis”, Pattern Recognition, 2003.

[14] D. Weinland, R. Ronfard, and E. Boyer, “Free Viewpoint Ac-tion Recognition Using Motion History Volumes”,ComputerVision and Image Understanding, 104, pp. 249-257, 2006

[15] N. Dalal and B. Triggs, “Histograms of Oriented Gradientsfor Human Detection”, CVPR IEEE, pp. 886-893, 2005.

[16] A. Fathi, and G. Mori, “Action recognition by learning mid-level motion features”, CVPR IEEE, 2008.

[17] K. N. Slimani, Y. Benezeth, and F. Souami, “Human Inter-action Recognition Based on the Co-occurrence of VisualWords”, CVPR IEEE, 2014.

IIAE Journal, Vol.5, No.2, 2017

78 T. Paul, U. Shammi, S. Serikawa, M. Ahad

[18] D. Waltisberg, A Yao, J. Gall, and L. V. Gool, “Variationsof a Hough-Voting Action Recognition System”, Springer,2010.

[19] S. Park, and J. K. Aggarwal, “Semantic-level Understand-ing Human Actions and Interactions using Event Hierarchy”,CVPR IEEE, 2004.

[20] K. Yun, H. Honrio, D. Chattopadhyay, T. L. Berg, and D.Samaras, “Two person Interaction Detection Using Body-Pose Features and Multiple Instance Learning”, CVPR IEEE,2012.

[21] D. Huang, C. Shan, M. Ardebilian, Y. Wang, and L. Chen,“Local Binary Patterns and Its Application to Facial ImageAnalysis: A Survey”, IEEE Transactions on Systems, Man,and Cybernetics, Part C (Applications and Reviews), vol. 41,no. 6, 2011.

[22] J. Han, and B. Bhanu, “Individual Recognition Using GaitEnergy Image”, IEEE Trans. PAMI, 2006.

[23] T. Maenpaa, “The Local Binary Pattern Approach to TextureAnalysis - Extensions and Applications”, IEEE Transactionson System, Man and Cybernetics, 2003.

[24] T. Maenpaa, and M. Pietikainen, “Texture Analysis with Lo-cal Binary Patterns”, Handbook of pattern recognition andComputer Vision, 3rd edition, World Scientific, Singapore,2005.

[25] HOG, https://en.wikipedia.org/wiki/Histogram_of_oriented_gradients, Access date: Nov., 2016.

[26] SVM, http://www.svm-tutorial.com/2014/11/svm-understanding-math-part-1/, Access date: Nov., 2016.

[27] Md Atiqur Rahman Ahad, M.N. Islam, and I. Jahan, “Actionrecognition based on binary patterns of action-history andhistogram of oriented gradient”, Journal on Multimodal UserInterfaces, Springer, vol. 10, no. 4, pp.335-344, 2016.

[28] RM Mueid, C Ahmed, and M Ahad, ”Pedestrian ActivityClassification Using Patterns of Motion and Histogram ofOriented Gradient”, Journal on Multimodal User Interfaces,Springer, pp. 1-7, 2015.

[29] Md Atiqur Rahman Ahad, ”Smart Approaches for HumanAction Recognition”, Pattern Recognition Letters, Elsevier,Vol. 34, No. 15, pp. 1769-1770, 2013.

[30] Md Atiqur Rahman Ahad, J Tan, H Kim, and S Ishikawa,“Motion History Image: Its Variants and Applications”, Ma-chine Vision and Applications, Springer, Vol. 23, No. 2, pp.255-281, 2012.

Tanmoy Paul (Non-member) has received B.Sin Electrical and Electronic Engineering fromUniversity of Dhaka in 2016. Currently, he isdoing his M.S. from the aforementioned depart-ment. He is a member of a microcontrollerand robotics enthusiast student group named“Electro-Surge”. Currently, he is interestedin Computer Vision and Image Processing and

looking forward to pursue his higher study in this field.

Ummul Afia Shammi (Non-member) has re-ceived B.S in Electrical and Electronic Engineer-ing from University of Dhaka in 2016. She is amember of a microcontroller and robotics enthu-siast student group named “Electro-Surge”. Sheis interested in Computer Vision and Image Pro-cessing and looking forward to pursue her higherstudy in this field. Currently, she is doing her

M.S in Communication and Signal Processing from the same de-partment.

Seiichi Serikawa (Member) received his B.S.and M.S. degrees from Kumamoto University in1984, 1986 and his Dr. Eng. from Kyushu In-stitute of Technology, Japan, in 1994. Currently,he is a professor of the Department of Electricand Electronic Engineering, Kyushu Institute ofTechnology. His research interests include hu-man visual information and sensing system. The

president of the institute of industrial application engineers (IIAE).

Md. Atiqur Rahman Ahad (Non-member)Senior Member, IEEE (Associate Professor,EEE, University of Dhaka (DU)). He workson computer/robot vision & imaging. He didB.Sc.(Honors) & Masters (DU), Masters (Uni-versity of New South Wales), PhD (Kyushu In-stitute of Technology), JSPS Postdoctoral Fel-low and Visiting Researcher. He published two

books (in Springer). Others: Editorial Board Member, Sci-entific Reports, Nature; Assoc.-Technical Editor, IEEE Com-Soc Magazine; Associate Editor, frontiers journal, EditorialBoard Member, Encyclopedia of Computer Graphics and Games,Springer, Editor-in-Chief, IJCVSP http://cennser.org/IJCVSP, IJEI,IJE; General Chair, 6th ICIEV http://cennser.org/ICIEV; icIVPRhttp://cennser.org/icIVPR, ICGET; Guest-Editor: Pattern Recogni-tion Letters, Elsevier; JMUI, Springer; J Healthcare Engineering,Hindawi; IJICIC; Member: OSA, ACM, etc. He volunteers somesocieties in BD/JP. More: http://aa.binbd.com

IIAE Journal, Vol.5, No.2, 2017