6
3D Gesture-based Interaction for Immersive Experience in Mobile VR Shahrouz Yousefi * , Mhretab Kidane , Yeray Delgado , Julio Chana and Nico Reski * * Department of Media Technology, Linnaeus University, Vaxjo, Sweden Email: shahrouz.yousefi@lnu.se, [email protected] ManoMotion AB, Stockholm, Sweden Email: [email protected], [email protected], [email protected] Abstract—In this paper we introduce a novel solution for real-time 3D hand gesture analysis using the embedded 2D camera of a mobile device. The presented framework is based on forming a large database of hand gestures including the ground truth information of hand poses and details of finger joints in 3D. For a query frame captured by the mobile device’s camera in real time, the gesture analysis system finds the best match from the database. Once the best match is found, the corresponding ground truth information will be used for interaction in the designed interface. The presented framework performs an extremely efficient gesture analysis (more than 30 fps) in flexible lighting condition and complex background with dynamic movement of the mobile device. The introduced work is implemented in Android and tested in Gear VR headset. I. I NTRODUCTION The rapid development and wide adoption of mobile devices in recent years have been mainly driven by the introduction of novel interaction and visualization technologies. Although touchscreens have significantly enhanced the human device interaction, clearly for next generation of smart devices such as Virtual/Augmented Reality (VR/AR) headsets, smart watches, and future smartphones/tablets, users will no longer be satis- fied with just performing interaction over the limited space of 2D touchscreen or using extra controllers. They will demand more natural interactions performed by the bare hands in the free space around the smart device [1]. Thus, the next gener- ation of smart devices will require a gesture-based interface to facilitate the bare hands for manipulating digital content directly. In general, 3D hand gesture recognition and tracking have been considered as classical computer vision and pattern recognition problems. Although substantial research has been conducted in this area, the state-of-the-art research results are mainly limited to global hand tracking and low-resolution gesture analysis [2]. However, in order to facilitate the natural gesture-based interaction, full analysis of the hand and fingers will be required, which in total incorporates 27 degrees of freedom (DOF) for each hand [3]. The main objectives behind this work are to introduce new frameworks and methods for intuitive interaction with future smart devices. We aim to re- produce the real-life experiences in the digital space with high accuracy hand gesture analysis. Based on the comprehensive studies [4], [5], [6], [7], and feedback from top researchers in the field and major technology developers, it has been clearly verified that today’s 3D gesture analysis methods are limited and will not satisfy the users’ needs in the near future. The presented research results enable large-scale and real-time 3D gesture analysis. This can be used for user-device interaction in real-time applications in mobile and wearable devices, where intuitive, and instant 3D interaction are important. VR, AR, and 3D gaming are among the areas that directly benefit from the 3D interaction technology. Specifically, with the presented research results we plan to observe how the integration of new frameworks such as Search methods to existing computer vision solutions facilitates the high degrees of freedom gesture analysis. In the presented work, Gesture Search Engine, as an innovative framework, for 3D gesture analysis is introduced and used in mobile platforms to facilitate the AR/VR appli- cations. Prototypes in simple application scenarios have been demonstrated based on the proposed technology. II. RELATED WORK One of the current enabling technologies to build gesture- based interfaces is hand tracking and gesture recognition. The major technology bottleneck lies in the difficulty of capturing and analyzing the articulated hand motions. One of the existing solutions is to employ glove-based devices, which directly measure the finger positions and joint angles by using a set of sensors (i.e. electromagnetic or fiber-optical sensors) [8], [9]. However, the glove-based solutions are too intrusive and expensive for natural interaction with smart devices. To overcome these limitations, vision-based hand tracking solutions need to be developed and video sequences should be analyzed. Capturing hand and finger motions in video sequences is a highly challenging task in computer vision due to the large number of DOF of the hand kinematics. Recently, Microsoft demonstrated how to capture full body motions using Kinect [10], [11]. Substantial development of hand tracking and gesture recognition are based on using depth sensors. Sridhar et al., [12] use RGB and depth data for track- ing of articulated hand motion based on color information and part-based hand model. Oikonomidis et al., [13] and Taylor et al., [14] track the articulated hand motion using RGBD information from Kinect. Papoutsakis et al., [15] analyze limited number of hand gestures with RGBD sensor. Model- based hand tracking using depth sensor is among the com- mon proposed solutions [14], [16]. Oikonomidis et al., [13], introduce articulated hand tracking using calibrated multi- 2016 23rd International Conference on Pattern Recognition (ICPR) Cancún Center, Cancún, México, December 4-8, 2016 978-1-5090-4846-5/16/$31.00 ©2016 IEEE 2122

3D Gesture-Based Interaction for Immersive Experience in

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 3D Gesture-Based Interaction for Immersive Experience in

3D Gesture-based Interaction for ImmersiveExperience in Mobile VR

Shahrouz Yousefi∗, Mhretab Kidane†, Yeray Delgado†, Julio Chana† and Nico Reski∗∗Department of Media Technology, Linnaeus University, Vaxjo, Sweden

Email: [email protected], [email protected]†ManoMotion AB, Stockholm, Sweden

Email: [email protected], [email protected], [email protected]

Abstract—In this paper we introduce a novel solution forreal-time 3D hand gesture analysis using the embedded 2Dcamera of a mobile device. The presented framework is basedon forming a large database of hand gestures including theground truth information of hand poses and details of fingerjoints in 3D. For a query frame captured by the mobile device’scamera in real time, the gesture analysis system finds thebest match from the database. Once the best match is found,the corresponding ground truth information will be used forinteraction in the designed interface. The presented frameworkperforms an extremely efficient gesture analysis (more than 30fps) in flexible lighting condition and complex background withdynamic movement of the mobile device. The introduced workis implemented in Android and tested in Gear VR headset.

I. INTRODUCTION

The rapid development and wide adoption of mobile devicesin recent years have been mainly driven by the introductionof novel interaction and visualization technologies. Althoughtouchscreens have significantly enhanced the human deviceinteraction, clearly for next generation of smart devices such asVirtual/Augmented Reality (VR/AR) headsets, smart watches,and future smartphones/tablets, users will no longer be satis-fied with just performing interaction over the limited space of2D touchscreen or using extra controllers. They will demandmore natural interactions performed by the bare hands in thefree space around the smart device [1]. Thus, the next gener-ation of smart devices will require a gesture-based interfaceto facilitate the bare hands for manipulating digital contentdirectly. In general, 3D hand gesture recognition and trackinghave been considered as classical computer vision and patternrecognition problems. Although substantial research has beenconducted in this area, the state-of-the-art research results aremainly limited to global hand tracking and low-resolutiongesture analysis [2]. However, in order to facilitate the naturalgesture-based interaction, full analysis of the hand and fingerswill be required, which in total incorporates 27 degrees offreedom (DOF) for each hand [3]. The main objectives behindthis work are to introduce new frameworks and methods forintuitive interaction with future smart devices. We aim to re-produce the real-life experiences in the digital space with highaccuracy hand gesture analysis. Based on the comprehensivestudies [4], [5], [6], [7], and feedback from top researchers inthe field and major technology developers, it has been clearlyverified that today’s 3D gesture analysis methods are limited

and will not satisfy the users’ needs in the near future. Thepresented research results enable large-scale and real-time 3Dgesture analysis. This can be used for user-device interaction inreal-time applications in mobile and wearable devices, whereintuitive, and instant 3D interaction are important. VR, AR,and 3D gaming are among the areas that directly benefit fromthe 3D interaction technology. Specifically, with the presentedresearch results we plan to observe how the integration ofnew frameworks such as Search methods to existing computervision solutions facilitates the high degrees of freedom gestureanalysis. In the presented work, Gesture Search Engine, as aninnovative framework, for 3D gesture analysis is introducedand used in mobile platforms to facilitate the AR/VR appli-cations. Prototypes in simple application scenarios have beendemonstrated based on the proposed technology.

II. RELATED WORK

One of the current enabling technologies to build gesture-based interfaces is hand tracking and gesture recognition.The major technology bottleneck lies in the difficulty ofcapturing and analyzing the articulated hand motions. Oneof the existing solutions is to employ glove-based devices,which directly measure the finger positions and joint anglesby using a set of sensors (i.e. electromagnetic or fiber-opticalsensors) [8], [9]. However, the glove-based solutions are toointrusive and expensive for natural interaction with smartdevices. To overcome these limitations, vision-based handtracking solutions need to be developed and video sequencesshould be analyzed. Capturing hand and finger motions invideo sequences is a highly challenging task in computervision due to the large number of DOF of the hand kinematics.Recently, Microsoft demonstrated how to capture full bodymotions using Kinect [10], [11]. Substantial development ofhand tracking and gesture recognition are based on using depthsensors. Sridhar et al., [12] use RGB and depth data for track-ing of articulated hand motion based on color information andpart-based hand model. Oikonomidis et al., [13] and Tayloret al., [14] track the articulated hand motion using RGBDinformation from Kinect. Papoutsakis et al., [15] analyzelimited number of hand gestures with RGBD sensor. Model-based hand tracking using depth sensor is among the com-mon proposed solutions [14], [16]. Oikonomidis et al., [13],introduce articulated hand tracking using calibrated multi-

2016 23rd International Conference on Pattern Recognition (ICPR)Cancún Center, Cancún, México, December 4-8, 2016

978-1-5090-4846-5/16/$31.00 ©2016 IEEE 2122

Page 2: 3D Gesture-Based Interaction for Immersive Experience in

camera system and optimization methods. Balan et al., [17]propose a model-based solution to estimate the pose of twohands using discriminative salient points. Here, the questionis whether using 3D depth cameras can potentially solve theproblem of 3D hand tracking and gesture recognition. Thisproblem has been greatly simplified by the introduction ofreal-time depth cameras. However, the technologies based ondepth information for hand tracking and gesture recognitionstill face major challenges for mobile applications. In fact,mobile applications have at least two critical requirements:computational efficiency and robustness. Therefore, feedbackand interaction in a timely fashion is assumed and any latencyshould not be perceived as unnatural to the human participant.It is doubtful if most existing technical approaches, includingthe one used in Kinect body tracking system would be thedirection leading to the technical development for futuresmart devices due to their inherent resource-intensive nature.Another issue is the robustness. The solutions for mobileapplications should always work no matter indoor or outdoor.This may somehow exclude the possibility of using Kinect-type sensors in uncontrolled environments. Therefore, theoriginal problem is how to provide effective hand tracking andgesture recognition with video cameras. A critical question iswhether we could develop alternative video-based solutionsthat may fit future mobile applications better.

1) Bare-hand Gesture Recognition and Tracking: Algo-rithms of hand tracking and gesture recognition can be groupedinto two categories: appearance-based approaches and 3Dhand model-based approaches [3], [18], [19], [20], [21], [22].Appearance-based approaches are based on a direct compar-ison of hand gestures with 2D image features. The popularimage features used to detect human hands include handcolors and shapes, local hand features, optical flow and soon. The early works on hand tracking belong to this type ofapproaches [4], [5], [23]. The gesture analysis step usuallyincludes feature extraction, gesture detection, motion analysisand tracking. Pattern recognition methods for detecting andanalyzing the hand gestures are mainly based on local orglobal image features. Simple features such as edges, corners,lines, and more complex features such as SIFT (scale-invariantfeature transform), SURF (Speeded Up Robust Features), andFAST (features from accelerated segment test) are widelyused in the computer vision applications [24], [25]. Fordynamic hand gestures, combination of local/global imagefeatures might be useful to detect hand gestures. In general, thedrawback of the feature-based approaches is that clean imagesegmentation is required in order to extract the hand features.This is not a trivial task when the background is cluttered.Furthermore, human hands are highly articulated. It is oftendifficult to find local hand features due to the self-occlusion,and some kinds of heuristics are needed to handle the large va-riety of hand gestures. Instead of employing 2D image featuresto represent the hand directly, 3D model-based approaches usea 3D kinematic hand model to render hand poses. An Analysis-By-Synthesis (ABS) strategy is employed to recover the handmotion parameters by aligning the appearance projected by

the 3D hand model with the observed image from the camera,and minimizing the discrepancy between them. Generally, it iseasier to achieve real-time performance with appearance-basedapproaches due to the fact of simpler 2D image features [2].However, this type of approaches can only handle simple handgestures, like detection and tracking of fingertips. In contrast,3D hand model based approaches offer a rich description thatpotentially allows a wide class of hand gestures. The bad newsis that the 3D hand model is a complex articulated deformableobject with 27 DOF. To cover all the characteristic handimages under different views, a very large image database isrequired. Matching the query images from the video input withall hand images in the database is computationally expensive.This is why the most existing model-based approaches focuson real-time tracking for global hand motions with restrictedlighting and background conditions. To handle the challengingsearch problem in a high dimensional space of human hands,the efficient index technologies used in information retrievalfield, have been tested. Zhou et al. proposed an approachthat integrates the powerful text retrieval tools with computervision techniques in order to improve the efficiency for imageretrieval [26]. An Okapi-Chamfer matching algorithm is usedin their work based on the inverted index technique. Athitsoset al., proposed a method that can generate a ranked list of3D hand configurations that best match an input image [27].Hand pose estimation is achieved by searching for the closestmatches for an input hand image from a large database of syn-thetic hand images. The novelty of their system is the abilityto handle the presence of clutter. Imai et al. proposed a 2Dappearance-based method to estimate 3D hand posture [28]. Intheir method, the variations of possible hand contours aroundthe registered typical appearances are trained from a number ofgraphical images generated from a 3D hand model. Althoughthe methods based on retrieval are very promising, they aretoo few to be visible in the field. The reason might be thatthe approach is too primary, or the results are not impressivedue to the tests just over very limited size of database.Moreover, it might be also a consequence of the success of 3Dsensors such as Kinect in real-time human gesture recognitionand tracking. The statistical approaches (random forest tree,for example) adopted in Kinect start dominating mainstreamgesture recognition approaches. This effect is enhanced bythe introduction of a new type of depth sensor from LeapMotion. This type of depth sensor can run at interactive rateson consumer hardware and interact with moving objects inreal-time. Despite of its attractive demo, Leap Motion sensorcannot handle full range of hand shapes. The main reason forthat is that such sensors usually detect and track the presenceof fingertips or points in free space when user’s hands enterthe sensor’s field of view. In fact, they can be used for generalhand motion tracking. Regarding the special requirementsfor mobile applications such as real-time processing, low-complexity and robustness, it seems that a promising approachto handle the problem of hand tracking and hand gesturerecognition is to use retrieval technologies for search. In orderto apply this technology to next generation of smart devices, a

2123

Page 3: 3D Gesture-Based Interaction for Immersive Experience in

Fig. 1. Overview of the presented system

Fig. 2. Database: Gesture types, rotations, states, and annotation of the joints

systematic study is needed regarding how retrieval tools shouldbe applied to handle gesture recognition, particularly, how tointegrate the advanced image search technologies [29]. Thekey issue is how to relate the vision-based gesture analysis tothe large-scale search framework. The proposed solution basedon the search framework for gesture analysis is explained inthe following sections.

III. SYSTEM DESCRIPTION

The proposed interactive process consists of different com-ponents, as demonstrated in Figure 1. An ordinary smartphonewith embedded camera is used to provide the required videosequence for analysis of hand gestures. The Pre-processingcomponent receives the video frames, applies efficient imageprocessing, provides a segmentation of the hand and finallyremoves the background noise. Our database consists of a largecollection of required hand gestures including the global andlocal information about the hand pose and joints. The Match-ing component analyzes the similarity of the pre-processedquery to the database entries and finds the best match in realtime. The annotated information to the best match will beanalyzed in Gesture Analysis and the detected static/dynamichand gestures will be detected and sent to the output. Theoutput will be used in the application.

A. Organization of the Database of Hand Gestures

Our database contains a large set of different hand gestureswith all potential variations in rotation, positioning, scaling,and deformations. Besides the matching between the queryinput and database, we aim to retrieve the 3D motion pa-rameters from the query image. Since query inputs do notcontain any pose information, the best solution is to associatethe motion parameters of the query to the best retrieved matchfrom the database. For this reason, we need to annotate thedatabase images with their ground truth motion parameters,global hand pose as well as position of the joints and additionalflags about each specific gesture. The flags represent the statesof each specific gesture in detail. In the following we explainhow the process of making the database is performed. Thecurrent version of the database features data of four differentgesture types: pinch (thumb and index finger), point (usingthe index finger) as well as grab normal (hand’s back facingthe user) and grab palm (hand’s palm facing the user). Eachgesture type is recorded in five rotations and different states,i.e., grab strength (pinch, grab) or tilt (point), for both left andright hands. Figure 2 illustrates the recorded gesture typesincluding descriptions about the selected rotations and states.Gestures of more than ten participants were recorded in alaboratory setup using a Samsung Galaxy S6. A Chroma keyscreen was used to assure near optimal conditions for thelater image processing in order to exclusively extract relevantgesture data from images. The actual image processing anddatabase creation was performed in MATLAB. Each gesturerecording was cropped to a square-sized dimension. Then,each of the 19 joints of the hand were manually marked (seeFigure 2) in order to collect information about their individualx and y coordinates. The coordinate data are normalized from0 to 1. Keeping track of the joint positions is of importancein order to build a joint skeleton of the hand. Once the jointannotation of the processed gestures was completed, binaryand edge representations were created based on a resized (100-by-100 pixels) version of the cropped recording, and stored asbinary data. Consequently, one database entry contains 10000elements as 0 or 1 values. Finally, the creation of a general flagdatabase assists with the unique identification of the gesture’stype, rotation, state, and hand (left, right) for each databaseentry. Practically, the recorded hand poses were processed,and database files were created. The output files contain theindividual information, i.e., joint annotation, binary image,edge image and flag matrix. Table I provides a statisticaloverview about the conducted gesture database.

Gesture type Rotations States Hands Sum

Pinch 5 5 2 650Point 5 3 2 550Grab(normal) 5 7 2 850Grab(palm) 5 7 2 850

TABLE IDATABASE: OVERVIEW ABOUT GESTURE RECORDINGS (INCL. SUM OF

DATABASE ENTRIES PER GESTURE TYPE)

2124

Page 4: 3D Gesture-Based Interaction for Immersive Experience in

B. Pre-processing

Pre-processing is an essential step to segment the handfrom the background. Therefore, an efficient search processand matching can be performed on the database of handgestures. Pre-processing starts by capturing the RGB frameand converting it to YCbCr color space. Since Cr and Cbchannels are less sensitive to illumination changes, they canbe used to define a segmentation function and localize thehand area. The luminance channel (Y) is ignored as it varies alot between frames. However, to increase the contrast betweenhand and background we defined a weighted image out of Cband Cr channels (see Equation 1). In order to analyze the handcolor, maximize the contrast between hand and background inthe weighted image, and segment a clear hand different sets ofsamples from both hand and background have been analyzed.In the implemented system each sample is a small patch of3-by-3 pixels and the median value of each patch representthe color information of the sample. For segmentation thefollowing equations are considered:

zi = αHCri − βHCbi (1)

ti = αBCri − βBCbi, i = {1, ..., n} (2)

where Z and T indicate the weighted image of the hand andbackground samples respectively. HCr and HCb represent theCr and Cb color information of the hand samples and BCrand BCb for the background. A large training set of differenthands, lighting conditions and backgrounds have been usedto tune the parameters in the equations and reach a clearsegmentation of the hand. Therefore, the following equationshould be maximized to tune α and β.

argmax ‖Z − T‖ (3)

Based on the calculated parameters a threshold on pixel valuesis considered to segment the hand from the background andform a binary image of the hand. The threshold value, alpha,and beta are first defined in an offline step for pre-sets ofdifferent environments. Therefore, if the segmentation functionin the default mode does not provide a clear segmentation ofthe hand, user can easily switch to other modes such as Bright,Dark, Complex background etc. The implemented system alsoincludes an adaptive analysis that re-samples the colors ofhand by taking the values from the last known gesture. Thisallows the system to automatically adapt to multiple changesthroughout the use of the system. The changes could be inlighting or background. By sampling and re-calculating the Cb,Cr and threshold values, the system becomes more tolerant tochanges in the environment. After creating the binary image,we find the largest object. Since the thresholds and contrastsare in favour of the hand, the hand will be the largest objectin the frame. This area in the frame is defined as ROI (regionof interest), as it is expected to hold the hand. The ROI isthen cropped out of the frame and normalized to a size thatmatches the width and height of the database gestures.

Fig. 3. Analysis of the sample images for tuning the segmentation parameters.

C. Real-time Matching and Similarity Analysis

The output of the Pre-processing provides a normalizedbinary vector of the region of interest in the query framethat contains the clear segmented hand with minimum levelof noise from the background. Ideally the segmented handshould be identical to specific entries in the database if weprovide an extremely large database of hand gestures. Inpractice, we build a similarity analysis to find the closestmatch from the database for a captured query gesture. In oursystem we have experimented L1 and L2 norms to matchthe query with the closest entry of the database in real time.Based on the conducted experiments, for each category of thehand gestures, the similarity level to a query to be recognizedas that specific gesture varies. For instance, similarity of theentries of the pointer category to a query gesture should behigher than 83% to maximize the probability of the correctrecognition as pointer. The similarity criteria varies betweendifferent categories of gestures from 80% to 86%. In order tosignificantly improve the efficiency of the search process, aselective search method is introduced. In the selective search,the priority to search in the database for a given query is onthe likelihood of the detected gesture from previous frames.Therefore, the dimension of the search can be significantlyreduced. This process is explained in the following section.

1) Gesture Mapping and Selective Search: In order toperform a smooth and fast retrieval, we analyze the databaseimages with dimensionality reduction methods to cluster thesimilar hand gestures. This approach indicates that which ges-tures are closer to each other and fall in the same neighborhoodin the high dimensional space. For dimensionality reduction

2125

Page 5: 3D Gesture-Based Interaction for Immersive Experience in

Fig. 4. The designed interface based on the proposed approach.

and gesture mapping different algorithms have been tested.The best achieved results that properly mapped our databaseimages to visually distinguishable patterns in 3D is performedby Isomap method. Based on the fact that the hand movementsfollow a continuous motion, it is expected that for each framethe best match from the database falls in the neighborhoodof the best matches from the previous frames. This approachwill significantly reduce the search domain and improves theefficiency of the retrieval. In the implemented system thismethod is considered and if the best match is detected withhigh certainty (high score) the next frame will start the searchfrom the neighborghood of the previous match and if none ofthe entries in the neighborhood meet the similarity criteria asthe best match the search domain will be extended to the restof the database till it finds the best match.

2) Motion Averaging: Suppose that for the query imagesQk−n-Qk (k > n), best database matches are selected. Inorder to smooth the retrieved motion in a sequence, theaveraging method is considered. Thus, for the k + 1th queryimage, position/orientation computed based on the estimatedposition/orientation of the n previous frames as follows:

PQk+1=

1

n

k∑i=k−n+1

PQi , OQk+1=

1

n

k∑i=k−n+1

OQi (4)

Here, PQ and OQ represent the estimated position and orien-tation for the query images, respectively. Position/orientationinclude all 3D information (translation and rotation parameterswith respect to x, y, and z axes). According to the experiments,for 3 ≤ n ≤ 5, averaging can be performed properly.

IV. EXPERIMENTAL RESULTS

A. Static and Dynamic Gesture Recognition

The capability of the framework to recognize hand gestureshighly depends on the size of the database. In other words,we can increase the resolution of tracking and number ofrecognized gestures by extending the database size. In theframework we classify the gestures into two types: static anddynamic. A static gesture refers to a hand pose recognizedin one frame such as open/closed hand (front and back view),open/closed pinch, and pointer (7 categories in total). However,

this approach is likely to fail in some occasions due to thepotential runtime noise. To reduce the noise and potentialfailures, the framework applies a noise reduction method thatallows reading the previous static hand gestures and removethe outliers. It is unlikely that in a short period of time(i.e 5 frames) the user performs different hand poses. If theframework detects 4 open hand and 1 closed hand in theprevious frames, the output of the last 5 frames would be anopen hand gesture. The analysis of a frame results in a statethat refers to a static gesture. A combination of states over thevideo sequence result in a dynamic gesture that can be a click,double click, swipe right/left, grab, hold up/down (7 in total).In order to detect a dynamic gesture, the framework analyzesthe states in the last n frames. For instance, in order to detecta grab gesture the stored states in the previous frames shouldcontain the transition from open hand state (static gesture) tothe closed hand state (static gesture). Based on the conductedexperiments n varies for different dynamic hand gestures andit should be selected in a way that maximizes the naturalnessof the interaction and minimizes the wrong detections.

B. Implementation for Android/HMD and Interface Design

The framework is developed in C++/OpenCV and testedby more than 10 users. The results show similar performanceon all users, although they feature different skin colors andhand shapes. The tests were conducted in different lightingconditions and backgrounds with users wearing the VR head-set or holding the smartphone in one hand and interactingwith the content by the other hand. The algorithm can handlethe movements of the user (moving camera) as long as theimage quality is preserved. In order to improve the qualityof detection, auto-focus of the camera is deactivated. Asdemonstrated in Figure 4, a simple interface is designed inAndroid to show the detected static and dynamic gestureson the camera view while user is performing different handgestures. A small box is also rendered on the bottom corner toshow the segmented hand in real time. We have designed twoapplications in Unity to showcase the technology in simplescenarios. The applications have been implemented and testedon Android phones (Samsung Galaxy Note4, and S6) and VRheadset (Samsung Gear VR). In the applications users canpick and place objects, select and activate different featuressuch as playing a set of songs in VR using hand and fingermotions. The rendered hand exactly follows the user’s motionin 3D space. As it is shown, system can handle the complexand noisy backgrounds properly. The main requirement forsegmentation of the hand is an acceptable level of contrastbetween hand and background. In different lighting conditions,even those cases where hand area is affected by backgroundnoise, our system can handle the detection of the gestureclass reliably. Since the system works on the normalizedhand patterns, there is no restriction on the hand scale in thealgorithm. The current database is stored on RAM due to itssize (less than 3 MB) and performance related reasons. Theaverage processing time for the end to end cycle includingthe designed interface in Unity is less than 35 ms per frame.

2126

Page 6: 3D Gesture-Based Interaction for Immersive Experience in

The battery consumption of the application is similar to thecurrent 3D games or open camera applications in Android.Based on our estimation, the presented system is able to handlesignificantly larger number of database entries. Various handpatterns from different users increases the database size butthis increase is not significant after a certain point that wecover the diversity. Based on our estimation the database of 10-20K entries will handle the high resolution 3D joint tracking.With the proposed method the processing can be handled inreal time for the extended database.

V. CONCLUSION AND FUTURE WORK

In comparison with the existing gesture analysis systems,our proposed technology is unique from different aspects.Clearly, the existing solutions based on using a single RGBcamera are not capable of handling the analysis of articu-lated hand motions. They can detect the global hand motion(translations) and tracking the fingertips [15], [16]. Moreover,they mainly work in stable lighting condition with static plainbackground. Another group of solutions rely on using extrahardware such as depth sensors. They usually perform globalhand motion tracking including the 3D rotation/translationparameters (6 degrees of freedom) and fingertip detection.On the other hand, they cannot be embedded to the existingmobile devices due to their size and power limitations. Sinceour technology does not require complex sensors and inaddition, it provides high degrees of freedom motion analysis(global motion and local joint movements), it can be usedto recover up to 27 parameters of hand motion. Due toour innovative search-based method, this solution can handlelarge-scale gesture analysis in real time on both stationaryand mobile platforms with minimum hardware requirement.A summary of the comparison between the existing solutionsand our contribution is depicted in Table. III. Performanceof the proposed solution is highly related to the quality ofthe database. This requires more time and effort to build acomprehensive collection of hand gestures that can representall possible cases in interactive applications. One effectivesolution is to use a 3D hand model to render all possiblehand postures with computer graphics technology, and convertthe generated gestures into binary representation. The major

Average processing time 30-40 fpsDB and Query frame size 100x100 and 320x240CPU usage in Galaxy S6 9-14%Number of Static and Dynamic hand gestures 7 and 7

TABLE IIBENCHMARK OF THE PROPOSED SYSTEM.

Features/Approach 2D Camera Depth Camera Our approach

Stationary/Mobile Yes/Limited Yes/No Yes/Yes2D/3D Tracking Yes/Limited Yes/Yes Yes/YesJoint analysis No Limited YesDegrees of freedom Low (2-6) Medium (6-10) High (10+)

TABLE IIICOMPARISON OF THE PROPOSED SYSTEM AND CURRENT TECHNOLOGIES.

problem with this approach is that the extracted featuresare not natural, which directly affects the search process.Recording the hand gestures from real users for converting intobinary hand shape images seems to be a reasonable approachfor extending the database in the future work.

REFERENCES

[1] K. Roebuck, Tangible User Interfaces: High-Impact Emerging Technol-ogy - What You Need to Know: Definitions, Adoptions, Impact, Benefits,Maturity, Vendors. Emereo Pty Limited, 2011.

[2] A. Sarkar, G. Sanyal, and S. Majumder, “Hand Gesture RecognitionSystems: A Survey,” 2013.

[3] B. Stenger, “Model-Based Hand Tracking Using A HierarchicalBayesian Filter,” 2006.

[4] S. Yousefi, F. A. Kondori, and H. Li, “Experiencing real 3D gesturalinteraction with mobile devices,” Pattern Recognition Letters, 2013.

[5] S. Yousefi, F. a. Kondori, and H. B. Li, “Camera-Based Gesture Trackingfor 3D Interaction Behind Mobile Devices,” 2012.

[6] S. Yousefi, “3D photo browsing for future mobile devices,” in Proceed-ings of the 20th ACM International Conference on Multimedia, 2012.

[7] S. Yousefi, H. Li, and L. Liu, “3D Gesture Analysis Using a Large-ScaleGesture Database,” in Advances in Visual Computing. Springer, 2014.

[8] A. Kolahi, M. H, T. R, M. A, M. B, and H. M, “Design of a marker-based human motion tracking system,” pp. 59–67, 2007.

[9] M. Knecht, A. Dunser, C. T, M. W, and R. G, “A Framework ForPerceptual Studies In Photorealistic Augmented Reality,” 2011.

[10] M. Tang, “Recognizing Hand Gestures with Microsoft’s Kinect,” pp.303–313, 2011.

[11] E. Parvizi and Q. Wu, “Real-Time 3D Head Tracking Based on Time-of-Flight Depth Sensor,” 2007.

[12] S. Sridhar, A. O, and C. T, “Interactive Markerless Articulated HandMotion Tracking using RGB and Depth Data,” in Proc. of the Int. Conf.on Computer Vision (ICCV), 2013.

[13] I. Oikonomidis, N. K, K. T, and A. A, “Tracking hand articulations:Relying on 3D visual hulls versus relying on multiple 2D cues,” inUbiquitous Virtual Reality (ISUVR), 2013, 2013.

[14] J. Taylor, R. S, and V. R, “User-Specific Hand Modeling from MonocularDepth Sequences.” Computer Vision and Pattern Recog. (CVPR), 2014.

[15] D. Michel, K. P., and A. A. Argyros, “Gesture Recognition Supportingthe Interaction of Humans with Socially Assistive Robots,” in Advancesin Visual Computing. Springer International Publishing, 2014.

[16] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun, “Realtime and robusthand tracking from depth,” in Computer Vision and Pattern Recognition(CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 1106–1113.

[17] L. Ballan, A. Taneja, J. Gall, L. V. Gool, and M. Pollefeys, “MotionCapture of Hands in Action using Discriminative Salient Points,” inEuropean Conference on Computer Vision (ECCV), Firenze, oct 2012.

[18] J. Song, G. Soros, F. Pece, S. Fanello, S. Izadi, C. Keskin, andO. Hilliges, “In-air Gestures Around Unmodified Mobile Devices,” inProceedings of the UIST 14, 2014.

[19] R. Y. R. Yang and S. Sarkar, “Gesture Recognition using Hidden MarkovModels from Fragmented Observations,” 2006.

[20] C. Hardenberg and F. B, “Bare-hand human-computer interaction,” 2001.[21] D. Iwai and K. Sato, “Heat Sensation in Image Creation with Thermal

Vision,” 2005.[22] M. Kolsch and M. Turk, “Fast 2D Hand Tracking with Flocks of Features

and Multi-Cue Integration,” 2004.[23] F. a. Kondori, S. Yousefi, and H. Li, “Real 3D interaction behind

mobile phones for augmented environments,” in Proceedings - IEEEInternational Conference on Multimedia and Expo, 2011.

[24] D. G. Lowe, “Distinctive Image Features from Scale-invariant Key-points,” 2004.

[25] H.˜Bay, A.˜Ess, T.˜Tuytelaars, and L.˜Gool, “SURF: Speeded Up RobustFeatures,” pp. 346–359, 2008.

[26] T. Huang, “Okapi-Chamfer Matching for Articulated Object Recogni-tion,” pp. 1026–1033, 2005.

[27] V. Athitsos and S. Sclaroff, “Estimating 3D hand pose from a clutteredimage,” 2003.

[28] A. Imai, N. Shimada, and Y. Shirai, “3-D hand posture recognition bytraining contour variation,” pp. 895–900, 2004.

[29] Y. Cao, C. Wang, L. Zhang, and L. Zhang, “Edgel Index for Large-ScaleSketch-based Image Search - Search - Cao et al. - Unknown.pdf,” 2011.

2127