8
Vision-based motion estimation for interaction with mobile devices q Jari Hannuksela * , Pekka Sangi, Janne Heikkila ¨ Machine Vision Group, Infotech Oulu and Department of Electrical and Information Engineering, University of Oulu, P.O. Box 4500, FIN 90014, Finland Received 5 September 2005; accepted 13 October 2006 Available online 19 January 2007 Communicated by Mathias Kolsch Abstract This paper introduces a novel interaction technique for handheld mobile devices which enables the user interface to be controlled by the motion of the user’s hand. A feature-based approach is proposed for global motion estimation that exploits gradient measures for both feature selection and feature motion uncertainty analysis. A voting-based scheme is presented for outlier removal. A Kalman filter is applied for smoothing motion trajectories. A fixed-point implementation of the method was developed due to the lack of floating-point hardware. Experiments testify the effectiveness of the approach on a camera-enabled mobile phone. Ó 2007 Elsevier Inc. All rights reserved. Keywords: User interfaces; Handheld devices; Global motion estimation 1. Introduction One of the major problems in the usability of hand- held mobile devices is their limited display size. Large objects that do not fit into the display need to be scrolled and zoomed by the user. Also interaction using the keypad is often cumbersome and slow, and therefore, some other approaches are needed. A natural way for interacting with a mobile device is to move it in the user’s hand and to use the measured motion as a control input. Special motion sensors such as accelerometers [1] provide a straightforward solution but require extra hardware to be installed. Today, computer vision is a more natural choice, because current mobile phones are often equipped with cameras that can provide visual input for estimating motion. In order to acquire useful motion data the system must be able to determine its ego-motion from arbitrary scenes. Many vision-based algorithms already exist for such motion estimation, but these generally need too much computing power to be applied in mobile phones. This is also a rather new application field and there are not many solutions available so far. Recently, Mo ¨ hring et al. [2] presented a tracking system for estimating 3- D camera pose using special color-coded markers. The frame rate, including marker detection, barcode reading and rendering is about 5 fps. Rohs [3] used a block- matching-based technique to measure the relative x, y, and rotational motion. The frame rate of the algorithm is 5 fps. Drab et al. [4] use a method called projection shift analysis to measure the x- and y-motion of the device. In this paper, we describe a computationally efficient algorithm for estimating global motion. Global motion refers here to the apparent dominant 2-D motion between frames, which can be approximated by some parameterized flow field model [5]. In our method, we adopt a region- based matching approach where a sparse set of features is used for estimating dominant motion. Unlike some other methods proposed in the literature, this solution allows the estimation of the camera rotation around the optical axis as well as estimation of the back and forth motion. As 1077-3142/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2006.10.014 q The financial support of the Academy of Finland (Project No. 110751) and the National Technology Agency of Finland is gratefully acknowl- edged. * Corresponding author. E-mail address: [email protected].fi (J. Hannuksela). www.elsevier.com/locate/cviu Computer Vision and Image Understanding 108 (2007) 188–195

Vision-based motion estimation for interaction with mobile devices

Embed Size (px)

Citation preview

Page 1: Vision-based motion estimation for interaction with mobile devices

www.elsevier.com/locate/cviu

Computer Vision and Image Understanding 108 (2007) 188–195

Vision-based motion estimation for interaction with mobile devices q

Jari Hannuksela *, Pekka Sangi, Janne Heikkila

Machine Vision Group, Infotech Oulu and Department of Electrical and Information Engineering, University of Oulu, P.O. Box 4500, FIN 90014, Finland

Received 5 September 2005; accepted 13 October 2006Available online 19 January 2007

Communicated by Mathias Kolsch

Abstract

This paper introduces a novel interaction technique for handheld mobile devices which enables the user interface to be controlled bythe motion of the user’s hand. A feature-based approach is proposed for global motion estimation that exploits gradient measures forboth feature selection and feature motion uncertainty analysis. A voting-based scheme is presented for outlier removal. A Kalman filter isapplied for smoothing motion trajectories. A fixed-point implementation of the method was developed due to the lack of floating-pointhardware. Experiments testify the effectiveness of the approach on a camera-enabled mobile phone.� 2007 Elsevier Inc. All rights reserved.

Keywords: User interfaces; Handheld devices; Global motion estimation

1. Introduction

One of the major problems in the usability of hand-held mobile devices is their limited display size. Largeobjects that do not fit into the display need to bescrolled and zoomed by the user. Also interaction usingthe keypad is often cumbersome and slow, and therefore,some other approaches are needed. A natural way forinteracting with a mobile device is to move it in theuser’s hand and to use the measured motion as a controlinput. Special motion sensors such as accelerometers [1]provide a straightforward solution but require extrahardware to be installed. Today, computer vision is amore natural choice, because current mobile phones areoften equipped with cameras that can provide visualinput for estimating motion.

In order to acquire useful motion data the systemmust be able to determine its ego-motion from arbitrary

1077-3142/$ - see front matter � 2007 Elsevier Inc. All rights reserved.

doi:10.1016/j.cviu.2006.10.014

q The financial support of the Academy of Finland (Project No. 110751)and the National Technology Agency of Finland is gratefully acknowl-edged.

* Corresponding author.E-mail address: [email protected] (J. Hannuksela).

scenes. Many vision-based algorithms already exist forsuch motion estimation, but these generally need toomuch computing power to be applied in mobile phones.This is also a rather new application field and there arenot many solutions available so far. Recently, Mohringet al. [2] presented a tracking system for estimating 3-D camera pose using special color-coded markers. Theframe rate, including marker detection, barcode readingand rendering is about 5 fps. Rohs [3] used a block-matching-based technique to measure the relative x, y,and rotational motion. The frame rate of the algorithmis 5 fps. Drab et al. [4] use a method called projectionshift analysis to measure the x- and y-motion of thedevice.

In this paper, we describe a computationally efficientalgorithm for estimating global motion. Global motionrefers here to the apparent dominant 2-D motion betweenframes, which can be approximated by some parameterizedflow field model [5]. In our method, we adopt a region-based matching approach where a sparse set of featuresis used for estimating dominant motion. Unlike some othermethods proposed in the literature, this solution allows theestimation of the camera rotation around the optical axisas well as estimation of the back and forth motion. As

Page 2: Vision-based motion estimation for interaction with mobile devices

Fig. 2. Overview of the method. The first two blocks make up the localmotion analysis phase where some image blocks are selected, and theirmotion is estimated using a block matching approach. The resultingmotion features provide input for the global motion analysis phase, whereinlier features are selected and used for estimating the dominant motion.The filtering provides control input for the end application.

J. Hannuksela et al. / Computer Vision and Image Understanding 108 (2007) 188–195 189

there are more degrees of freedom, the method provides amore flexible way for interaction (see Fig. 1a). Forinstance, lateral motion upwards scrolls the focus towardsthe upper part of the display, and back and forth motioncan be interpreted as zooming in and out. Also the rotationcomponent can be used, for example, to change the orien-tation of the display if needed.

To demonstrate the usefulness of our approach, themotion estimation method has been implemented on theNokia 3650 mobile phone shown in Fig. 1b. Only fixed-point arithmetic is used due to the lack of a hardware float-ing-point unit in this device. The real-time performanceand the measured motion estimation accuracy indicate thatthe proposed system can be applied for controlling the userinterface. As an application example, the implementationallows scrolling a large digital map by moving the devicein the hand so that only a small portion of the map is vis-ible at a time (see Fig. 1c). Of course, camera-based controlalways has some limitations. For example, the approachcannot be used if the user is walking, as apparent motionin the visual input does not just depend on hand motionsin such a situation. Moreover, the estimated motion forthe same translational movement depends on the scenedepth.

This paper focuses on the motion estimation methodbehind this user interaction concept. In Section 2, wereview the work on dominant motion estimation, and thenjustify our approach. Its computational steps, which areshown in Fig. 2, are described in Sections 3 and 4. Afterdiscussion of the fixed-point implementation in Section 5,experimental results are reported in Section 6.

2. Background

Motion-based control of a user interface requires amethod for estimating global motion from successiveframes. As 2-D motion between frames may consist of mul-tiple motions due to moving objects in a scene and motionparallax, one must consider solutions that estimate thedominant motion, which can be defined as a motion ofthe largest coherently moving region. Having parametersof that motion without knowing its support region is suffi-cient for the application area considered here.

In order to reduce the effect which other motion regionshave on the dominant motion estimate, some robust

Fig. 1. (a) Navigation options and corresponding axes of motion

schemes are needed. Two kinds of solutions, direct andindirect ones, can be considered here. In direct solutions,some robust error norm is applied to the motion-compen-sated frame difference in order to evaluate a particularmodel. Such an approach is taken in gradient-basedschemes (see [6,7], for example), where minimization of acost function is performed iteratively using spatial and tem-poral gradient information in a multiresolution framework.Indirect solutions use some precomputed local motioninformation as input, and the need for robust processingis emphasized due to possibly invalid measurements. Vot-ing schemes such as random sampling consensus (RAN-SAC) [8] or the Hough transform can be used forextracting inliers from data.

Various methods can be used for computing localmotion information. As not all parts of the image can pro-vide information about motion, it is useful to concentrateprocessing effort on some distinct features [9]. Then thereare two basic approaches for establishing correspondencesbetween frames: matching of features and local motionestimation. In the former approach, features are detectedfrom both frames first, and correspondences between themare determined. As a result, displacements of featuresdescribing motion between frames can be obtained. Forexample, work related to invariant interest points [10],where a set of region descriptors for detected features iscomputed and used for matching, belongs to this category.Such descriptors can deal with large geometric and photo-metric transformations.

, (b) Nokia 3650 smartphone, and (c) Snapshot of the map.

Page 3: Vision-based motion estimation for interaction with mobile devices

190 J. Hannuksela et al. / Computer Vision and Image Understanding 108 (2007) 188–195

Another solution is to apply some local motion estima-tion technique for the features detected from one frame.Several methods like gradient-based, energy-based, phase-based and matching-based techniques have been proposed[5]. A widely used technique in feature motion estimationhas been the KLT feature tracker [9], which iteratively min-imizes the SSD criterion using a gradient-based multireso-lution framework. Another approach for minimizing SSDis to use block matching. Considering the target platformof our application, it is interesting to study such solutionsas these devices can be equipped with hybrid video encod-ing tools, where block matching is needed to provide blockmotion estimates. Thus, hardware accelerators may beavailable to perform such computation, and such capabili-ty could also be used for other purposes. Therefore, ourchoice has been to investigate the use of feature-baseddominant motion estimation, where local motion estimatesare based on block matching.

3. Local motion analysis

The method used for dominant motion estimation con-sists of local and global motion analysis phases as dis-cussed above. In local motion analysis, basicinformation about motion between image pairs isobtained. A small set of blocks, called feature blocks, isselected from one image using image gradient informa-tion. Then their displacement is estimated using a blockmatching approach, and those displacement estimatesare complemented with confidence information. In thesecond phase, explained in Section 4, the results of localmotion analysis are refined to provide information forcontrolling the end application.

3.1. Selection of feature blocks

The purpose of using a small set of feature blocks is toreduce the computational cost of motion estimation.Choices related to selection can be considered to deal withthree issues: (1) the distribution of features over the image,(2) the criterion used for comparing feature candidates forlocal motion estimation and (3) the exploitation of tempo-ral continuity in the selection, which refers to the possibilityof tracking features over consecutive frames. In the systemdescribed here, only image-based selection has been usedwithout tracking, as it does not provide much advantageand increases complexity of implementation.

As our goal is to estimate dominant motion, featureblocks should be distributed over the image so that theprobability of sufficient representation of overall imagemotion is high. Therefore, it is reasonable to require thatfeatures are not concentrated over some small imageregion. Two approaches may be considered here. First,one may specify some minimum value for the distancebetween two features, which requires evaluation of dis-tances between features and control of the selection pro-cess based on that information. Another solution is to

specify a set of non-overlapping image regions and selectone or more features from each region. Compared to thedistance-based scheme, use of non-overlapping regionsprovides a computationally more straightforward solu-tion, and therefore it is adopted in our case. As theSSD measure has to be evaluated in all directions in blockmatching, features are taken from the central part of theanchor frame I0. It is subdivided into subregions, and oneblock is selected from each region, as illustrated inFig. 3a.

Various criteria for selecting good features for matchinghave been proposed and they typically analyze the richnessof texture within a window [9]. One approach is to considerfirst-order image derivatives, which is also done in ourmethod. The measure to be maximized within a subregionis defined as

GðBÞ,Xx2BðI0ðxþ uhÞ � I0ðxÞÞ2 þ ðI0ðxþ uvÞ � I0ðxÞÞ2;

ð1Þ

where x denotes pixel coordinates, uh = [1,0]T anduv = [0,1]T. The strongest response for this criterion is ob-tained for blocks involving high contrast such as corners oredges (see Fig. 3).

The main reason for using (1) as the criterion is its com-putational simplicity. As corners are typically most suitablefor local motion estimation, measures such as the Harrisdetector [11] and eigenvalue analysis of 2 · 2 normal matri-ces [9] could provide better features. However, these meth-ods require more computation as correlation of horizontaland vertical image gradients has to be evaluated, and theymay also give a strong response for non-corner features. Asthe reliability of features is always questionable, we com-plement feature motion estimates with uncertainty mea-sures so that also weaker features like edges can beutilized in global motion estimation. The confidence analy-sis presented below can reuse the measure GðBÞ computedfor a block, which also provides some computationaladvantage.

3.2. Estimating motion of feature blocks

In order to determine the displacement of a featureblock, evaluation of the SSD measure is performed foreach feature i over a suitable range of integer displacementsin x- and y-directions, and we call the SSD surfaceobtained in this way the motion profile. The SSD measurecan be defined as

Dðd;BiÞ,Xx2Bi

ðI1ðxþ dÞ � I0ðxÞÞ2; ð2Þ

where I1 denotes the target frame, and d = [u,v]T is theevaluated displacement. The displacement d that minimizesthe criterion, dmin

i , is used as a feature motion estimate di.We consider that this provides sufficient accuracy for targetapplications. If desired, quadratic interpolation of SSD

Page 4: Vision-based motion estimation for interaction with mobile devices

a b

Fig. 3. Local motion analysis: (a) 16 subregions and selected feature blocks, (b) feature motion estimates and associated error covariances illustrated usingarrows and ellipses, respectively.

Analyzed blockMotion profile

–10 0 10

–10

0

10

Thresholded profile

–10 0 10

–10

0

10

–10 0 10

–10

0

10

Motion uncertainty

Fig. 4. Principle of motion uncertainty analysis. Axes in the top right andbottom figures correspond to horizontal and vertical displacements. In thebottom right image, the covariance matrix Ci is illustrated with an ellipse.The analyzed block here contains an edge, which is reflected by the finalanalysis result.

J. Hannuksela et al. / Computer Vision and Image Understanding 108 (2007) 188–195 191

values in the vicinity of dmini can be used for obtaining sub-

pixel accuracy estimates.Minimization of the SSD in motion estimation is based

on an assumption that the true motion is close to the dis-placement candidate giving the minimum. However, dueto noise and the aperture problem, there might be othergood matches. For example, one can easily get a minimumvalue for Dðd;BiÞ at any point along a straight edge. Thismeans that there is information available only in the direc-tion of the edge normal. In the case of homogeneousregions, a good match may be obtained for anydisplacement.

For these reasons, we base our confidence analysis ondetermination of good matches, that is, selection of thosedisplacement candidates, which possibly have the truemotion in their vicinity. Once this set, denoted Vi, is avail-able, it may be used for computing a 2 by 2 covariancematrix Ci, which summarizes analysis results. These matri-ces are then used for inlier analysis and global motion esti-mation. To be more specific,

C i ¼1

Mi

Xd2V i

ðd � �d iÞðd � �d iÞT þ1

12I ; ð3Þ

where Mi is the number of elements in Vi, �d i is the meandisplacement computed over Vi and I denotes an identitymatrix. The last term is included as a good match providesinformation of limited accuracy. As displacements givingsmall values for the SSD always take precedence over thosegiving larger values, the set Vi is defined as

V i ¼ fdjDðd;BiÞ 6 T ig ð4Þ

where Ti is a threshold, which depends on feature blockcontent. So, the overall uncertainty analysis proceeds asshown in Fig. 4, and the problem is reduced to finding areasonable way of selecting the threshold Ti.

It has been noted that displaced frame differences aredriven by the spatial intensity gradient terms in additionto the residual motion [12]. As discussed in [13], simpleblock gradient measures can be used for predicting howlarge values the motion-compensated block difference mea-sures can take for the displacement which is closest to the

true displacement. The formula we adopt for computingthe threshold is based on this idea and is specified as

T i ¼ Dðdmini ;BiÞ þ k1GðBiÞ þ k2; ð5Þ

where GðBiÞ denotes the block gradient measure computedusing (1), and k1 and k2 are specific constants. The firstterm in rule (5) is included for practical reasons as it guar-antees that the set Vi cannot be empty. The coefficient k2 isa noise-related term. In our work, these coefficients werederived using some training images.

4. Global motion analysis

As a result of the local motion analysis phase, we have aset of local motion features represented by triplets Fi = (pi,di, Ci), i = 1, . . .,N. In the second phase, the goal is to esti-mate global dominant motion using this information. Forour purpose, a similarity model (a four-parameter affine

Page 5: Vision-based motion estimation for interaction with mobile devices

192 J. Hannuksela et al. / Computer Vision and Image Understanding 108 (2007) 188–195

model) is considered to be sufficient for approximatingmotion between frames as it can represent 2-D motion con-sisting of translation, rotation, and scaling [14]. With thismodel, the displacement d of a feature located atp = [x,y]T is represented using

d ¼ dðh; pÞ ¼1 0 x y

0 1 y �x

� �h; ð6Þ

where h = [h1,h2,h3,h4]T is a vector of model parameters.Here, h1 and h2 are related to common translational mo-tion, and h3 and h4 encode information about 2-D rotation/ and scaling s, h3 = s cos/ � 1 and h4 = s sin/. Estimatesof h for the dominant motion are used as an input to filter-ing, the final step of global motion analysis, which providesoutputs for user interface control. Recalling Fig. 1a, theparameters obtained directly are the x–y-translation (h1,h2) and the camera rotation around the z-axis (/). Scalings is related to the translation in the z-direction, DZ, as(s � 1) = DZ/Z, where Z is the scene depth.

4.1. Outlier analysis

As discussed in Section 2, feature-based dominantmotion estimation requires some scheme for discardingthose feature motion measurements that would increasemotion estimation error. We assume that the majority offeatures are associated with the global motion we wantto estimate. In order to select those inlier features, amethod similar to the RANSAC method [8] is used. Inour approach, pairs of features Fi are chosen for instanti-ating similarity motion model hypotheses, which are thenvoted for by other features. The hypothesis that gets themost support is considered to be close to the dominantglobal motion and is used for selecting the inlier features.Motion hypothesis hk,l for a feature pair (Fk,Fl) isgenerated by solving a system of equations, which is basedon (6). As the number of features can be relatively small,we generate and evaluate hypotheses for all featurecombinations.

Let us denote with mi(hk,l) the number of votes that a fea-ture Fi gives to the hypothesis hk,l. The covariance matrixCi associated with the feature Fi provides informationabout the feature motion uncertainty in different direc-tions. It is therefore reasonable to base the calculation ofmi(hk,l) on the Mahalanobis distance between the estimateddisplacement, di, and hypothesized displacement d(hk,l,pi)given by (6). In order to simplify calculations, we use thesquared Mahalanobis distance

dðF i; hk;lÞ ¼ dTi;k;lC

�1i d i;k;l; ð7Þ

where di,k,l = di � d(hk,l,pi). Using this distance, the num-ber of votes is computed using

miðhk;lÞ ¼T v � dðF i; hk;lÞ if dðF i; hk;lÞ < T v

0 otherwise

�; ð8Þ

where Tv is a user-defined threshold (in our experiments,Tv = 4.0).

The hypothesis hk,l that maximizes the sumPN

i¼1miðhk;lÞis used for selecting features F 0i, which are passed to theglobal motion estimation step. These inlier features arethose that give some support for the best hypothesis, thatis, mi(hk,l) is non-zero for them.

4.2. Global motion estimation

In order to compute interframe motion, we assume thatmotion of any trusted feature F 0i ¼ ðpi; d i;C iÞ is a realiza-tion of the random vector

d i ¼ H ½pi�hþ gi; ð9Þ

where h is the true motion, and gi is the observation noisewith E(gi) = 0 and Eðgig

Ti Þ ¼ C i. Observations of feature

motions are assumed to be independent. The best linearunbiased estimator for the motion is

h ¼ ðHTWHÞ�1HTWd; ð10Þ

where d is the vector of motion observations composed ofdi, W is the inverse of the block-diagonal matrix composedof Ci, and the observation matrix H is composed of H[pi].Uncertainties associated with feature motions can be repre-sented with the covariance matrix

C h ¼ ðHTWHÞ�1

: ð11Þ

Using the matrix decomposition discussed in Section 5,(10) and (11) are evaluated efficiently.

4.3. Motion filtering

The motion estimates are affected by noise which ischaracterized by the covariance matrix C h. We apply Kal-man filtering to these estimates in order to get smootherresults. Filtering is done separately for each parameter,which reduces the computational complexity compared tosimultaneous filtering of the four-element state vector asmatrix inversions are replaced by scalar divisions. So, thereare four filters, for translation along the x-, y-, and z-axes,and rotation around the z-axis. In each case, we assumethat otherwise constant motion is subject to random accel-eration errors. So, the state-variable models have the form

xkþ1 ¼ xk þ dwk; ð12Þ

where xk is the desired value of the motion parameter forthe frame pair (k � 1,k), and d is the time interval betweenconsecutive frames. The system noise wk is Gaussian dis-tributed with zero-mean and variance qk = E{wk

2}.Observation models for the filters are of the form

zk ¼ xk þ vk; ð13Þ

where zk is the motion parameter estimated using (10) andvk is Gaussian distributed measurement noise with zero-mean and variance rk = E{vk

2}. For x- and y-translations,variances of the observation noise are obtained directly

Page 6: Vision-based motion estimation for interaction with mobile devices

J. Hannuksela et al. / Computer Vision and Image Understanding 108 (2007) 188–195 193

from the first two diagonal terms of C h given in (11). Vari-ances of the z-translation and rotation are derived from C h

using covariance mapping

C v ¼ JC hJT; ð14Þwhere Cv is covariance matrix for scaling s and rotation /and J is the Jacobian matrix.

We are also interested in the integration of motion toobtain position information. In order to avoid large biasbetween filtered and measured positions, we control theprediction as follows

xkþ1jk ¼ xkjk þ daXk�1

i¼1

ðzi � xijiÞ; ð15Þ

where xkþ1jk is the predicted estimate and xkjk is the previousestimate. The control parameter a is selected to suppressintegration error but still allowing filtering. In practise, asuitable value is between 0.1 and 0.3. The termPk�1

i¼1 ðzi � xijiÞ is a cumulative error between measurementand estimate.

Fig. 5. Motion simulation. A base image with sliding window locations onthe left and a frame (4 bpp) with ground truth motion (arrows) on theright.

5. Implementation

Mobile devices such as smartphones are typicallysmall in size, which leads to limited hardware resources.Limitations such as low computational power andsystem memory impose many constraints on softwaredevelopment. Also the quality of the visual input is usu-ally poor.

The Nokia 3650 smartphone used in our study is basedon Series 60 with Symbian 6.1 OS, and contains a 104-MHz ARM-based 32-bit RISC CPU, 3.4 Mb shared mem-ory and a VGA camera, which is located on the back of thephone. The camera server of the device provides low-reso-lution RGB images with a size of 160 by 120 pixels and thecolor depth of 12 bits per pixel. The maximum frame ratefor image acquisition is 15 fps. Due to the lack of a float-ing-point unit, all floating-point arithmetic is emulated onthis platform, resulting in very slow performance, andtherefore use of fixed-point arithmetic is preferred. In prac-tise, this means that integers have to be used for operationsin the inner loops in order to get good performance. Fur-thermore, the target processor does not have a divideinstruction so its use should be avoided.

The C/C++ software component that we implementeduses fixed-point computations, and is a callable library thatother applications can utilize to get control input for theirtasks. Input frames are converted to luminance framesusing 4 bits per pixel because color information is not usedin motion estimation. To balance computational complex-ity with performance, 16 feature blocks of 6 by 6 pixels forfeature motion estimation are used. The maximum dis-placement is set to 12, which means a 25 by 25 pixel searcharea. Based on the experiments and the dynamic rangeavailable, threshold parameters in (5) were chosen to be(k1,k2) = (0.25, 4).

The global motion estimation described in Section 4.2 isprobably the most critical part of the method when numer-ical accuracy is considered. We determine motion parame-ters h by solving the equation defined in (10) with Choleskydecomposition. The covariance matrix is also easily com-puted from (11). Cholesky decomposition is very efficientwhen the square matrix to be decomposed is symmetricand positive definite as in this case. This technique can betwo times faster than other similar decompositions. Anotheradvantage is that the numerical accuracy is easier tomaintain compared to some other techniques such as QRfactorization.

6. Experiments

In our experiments, the method was evaluated usingsynthetic and real image sequences. With syntheticsequences, ground truth data for global motion wasobtained. An image sequence was generated by sliding awindow over a high-resolution base image as shown inFig. 5. Each frame (160 by 120 pixels) was synthesized fromthe image region inside the window via interpolation, sub-sampling and the addition of Gaussian noise. Three kindsof camera motions were simulated: (1) 2-D translationalong x- and y-axes, (2) translation in the z-direction,and (3) translation with slight rotation / around the z-axis.Performance was evaluated quantitatively for each motionpattern and the three base images shown in Fig. 6. The per-formance measures used were based on the root-mean-square-error (RMSE) of the motion vector field, that is,the square root of the average value of kH ½p�ðh� hÞk2

2 overthe image coordinates p, where h is the interframe motionestimate without filtering and h is the ground truth motion.

Mean and maximum RMSE values computed oversequences of 30 images are given in Table 1. Consideringthe estimation of rotation and scaling, RMSE equal toone pixel corresponds approximately to a pure rotationalerror of 1� and a pure scaling error of 1.8%. It can be seenthat the performance is sufficient for user interface controlpurposes. It is slightly better for the base image A than forB and C. In general, performance depends on the texturecontent of the scene and it is worst with scenes that are fea-tureless or contain just some periodic texture.

To test the performance of the outlier analysis, an objecthaving independent motion was added to the original syn-

Page 7: Vision-based motion estimation for interaction with mobile devices

Fig. 6. Example frames from sequences, one for each base image.

Table 1Results with synthesized sequences

Case RMSE (pixels)

Motion Base Mean Max

x–y-translation A 0.190 0.296B 0.302 0.467C 0.307 0.510

z-translation A 0.162 0.381B 0.267 0.621C 0.239 0.429

x–y-translation and rotation A 0.210 0.349B 0.282 0.495C 0.255 0.442

0 2 4 6 8 1085

90

95

100

Outlier coverage [%]

RM

SE

bel

ow li

mit

[%]

Fig. 7. Experiment with outlier analysis. (Left) Example of a movingoutlier object (image coverage 5%). (Right) The number of estimatedmotions having RMSE below the specific limits, the dashed curvecorresponds to the limit 0.5, and the solid curve to 1.0.

194 J. Hannuksela et al. / Computer Vision and Image Understanding 108 (2007) 188–195

thesized sequences. The object was a rectangular patch,whose size was increased gradually in order to see whenthe method starts to break down. An example of a movingpatch and the experimental result are shown in Fig. 7. Asthe illustration shows, larger errors (RMSE > 1) begin tooccur, when the object covers about 5–6% of the image.Without outlier analysis, estimation results contained large

–20 –10 0 10 20 30 40–35

–30

–25

–20

–15

–10

–5

0

5

x coordinate [pixels]

y co

ordi

nate

[pix

els]

0 5 10 10

5

10

15

20

25

30

35

fra

rota

tion

[deg

]

a b

Fig. 8. Trajectories for scene A: (a) translational motion, (b) forward–backwameasured motions and solid line, filtered motions.

errors in about 40% of cases already with 0.5% coverage.So, performing the selection of inliers improves the behav-ior of the method.

We also verified the correct operation of our approachincluding motion filtering with real image sequences usingthe platform described in Section 5. In real sequences, theuser makes similar movements to the synthetic sequences:(a) translational movement (letter z), (b) forward-back-ward motion and (c) translational motion with slight rota-tion. We have used the same scenes (A, B and C) as in thesynthetic case. The length of sequences was 30 frames. Forexample, trajectories for scene A are shown in Fig. 8. Foreach test case a dominant motion component is illustrated.For example, for translational movement only the trajecto-ry in the x- and y-coordinates is shown. In real sequencesthere was no ground truth data available, but the trajecto-ries seem to follow the movements that the user made. Iffiltering is used there is a small lag between measuredand filtered motion during movement. Notice that thislag would be much larger without using the control termin (15), which prevents the cumulative error between mea-sured and estimated motion from increasing freely. Withreal image sequences, the frame rate achieved was about10 fps. The frame rate varies slightly due to the asynchro-nous camera server and other processes running in the OS.

7. Conclusions

We have proposed a novel solution for controlling theuser interfaces of mobile devices. The main idea of ourapproach is to use dominant motion estimated from imagedata. The method utilizes efficient techniques for featureselection, feature motion analysis and outlier removal.The method was implemented using fixed-point arithmetic

5 20 25 30me

0 5 10 15 20 25 300

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

frame

z tr

ansl

atio

n

c

rd motion and (c) translational motion with slight rotation. Dashed line,

Page 8: Vision-based motion estimation for interaction with mobile devices

J. Hannuksela et al. / Computer Vision and Image Understanding 108 (2007) 188–195 195

for a camera-enabled mobile phone. Experiments with syn-thetic and real image sequences indicate that our motionestimation algorithm can estimate with sufficient accuracythe three translation and one rotation parameters. Oneadvantage of our method compared to some other methodsdeveloped for similar platforms is the ability to measurerotation and translation in the z-direction for scaling pur-poses. Despite inherent limitations of the camera-basedcontrol scheme, we can say that it provides a viable alterna-tive for operating the user interface of a mobile device. Acamera is built into most mobile phones. Also, the existingplatforms can be used just by downloading the softwarecomponent that produces the motion input for somethird-party applications.

References

[1] K. Hinckley, J. Pierce, M. Sinclair, E. Horvitz, Sensing techniques formobile interaction, in: Proc. of the 13th annual ACM symposium onUser Interface Software and Technology, 2000, pp. 91–100.

[2] M. Mohring, C. Lessig, O. Bimber, Optical tracking and video see-through ar on consumer cell phones, in: Proc. of Workshop onVirtual and Augmented Reality of the GI-Fachgruppe AR/VR, 2004,pp. 193–204.

[3] M. Rohs, Real-world interaction with camera-phones, in: 2ndInternational Symposium on Ubiquitous Computing Systems, 2004,pp. 39–48.

[4] S.A. Drab, N.M. Artner, Motion detection as interaction techniquefor games & applications on mobile devices, in: Proc. of PervasiveMobile Interaction Devices, 2005.

[5] H. Haußecker, H. Spies, Motion, in: B. Jahne, H. Haußecker, P.Geißler (Eds.), Handbook of Computer Vision and Applications, vol.2, Academic Press, San Diego, 1999, Ch. 13.

[6] M.J. Black, P. Anandan, The robust estimation of multiple motions:Afflne and piecewise-smooth flow fields, Computer Vision and ImageUnderstanding 63 (1) (1996) 75–104.

[7] J. Odobez, P. Bouthemy, Robust multiresolution estimation ofparametric motion models, Journal of Visual Communication andImage Representation 6 (4) (1995) 348–365.

[8] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigmfor model fitting with applications to image analysis and automatedcartography, Commun. ACM 24 (1981) 381–395.

[9] C. Tomasi, T. Kanade, Detection and tracking of point features,Tech. Rep. CMU-CS-91-132, Carnegie-Mellon University, 1991.

[10] K. Mikolajczyk, C. Schmid, Scale & affine invariant interest pointdetectors, International Journal of Computer Vision 60 (1) (2004)63–86.

[11] C. Harris, M. Stephen, A combined corner and edge detection, in:Proc. 4th Alvey Vision Conference, 1988, pp. 147–151.

[12] J.-M. Odobez, P. Bouthemy, Direct model-based image motionsegmentation for dynamic scene analysis, in: Proc. Asian Conferenceon Computer Vision, 1995, pp. 306–310.

[13] P. Sangi, J. Heikkila, O. Silven, Motion analysis using framedifferences with spatial gradient measures, in: Proc. InternationalConference on Pattern Recognition, vol. 4, 2004, pp. 733–736.

[14] Q. Zheng, R. Chellappa, A computational vision approach to imageregistration, IEEE Transactions on Image Processing 2 (1993) 311–326.