Facial Position and Expression Based Human Computer ... Position and Expression... · IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS 1 Facial Position and Expression Based Human

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS 1

Facial Position and Expression Based HumanComputer Interface for Persons with Tetraplegia

Zhen-Peng Bian, Junhui Hou, Lap-Pui Chau, and Nadia Magnenat-Thalmann

Abstract— A human computer interface (namely Facial posi-tion and expression Mouse system, FM) for the persons withtetraplegia based on a monocular infrared depth camera ispresented in this paper. The nose position along with the mouthstatus (close/open) is detected by the proposed algorithm tocontrol and navigate the cursor as computer user input. Thealgorithm is based on an improved Randomized Decision Tree(RDT) which is capable to detect the facial information efficientlyand accurately. A more comfortable user experience is achievedby mapping the nose motion to the cursor motion via a non-linear function. The infrared depth camera enables the systemto be independent of illumination and colour changes both fromthe background and on human face, which is a critical advantageover RGB camera based options. Extensive experimental resultsshow that the proposed system outperforms existing AssistiveTechnologies (ATs) in terms of quantitative and qualitativeassessments.

Index Terms—Camera mouse, hand-free control, assistivetechnology (AT), perceptual user interface, Fitts’ law, human-computer interaction (HCI), severe disabilities,computer access.

I. INTRODUCTION

Nowadays, personal computers play a very important role inmodern life. However, persons with tetraplegia, suffering fromtraumatic brain injury, cerebral palsy, neurological injury orstroke, are very difficult to use personal computers’ standardinput interface, such as the keyboard and the mouse, duringtheir rehabilitation and everyday life activities. Therefore, itis highly desired to develop Assistive Technologies (ATs) forpersons with tetraplegia.

A. Related Work

Some ATs have been developed to help persons withtetraplegia by using their limited voluntary signals and motionsto control computers. Based on different voluntary signals andmotions, there are roughly four categories of ATs for personswith tetraplegia [1], [2]:

1) Physiological Signal Based ATs: Various types of phys-iological signals are employed to control computers, suchas ElectroMyoGram (EMG) [3], [4], ElectroEncephaloGram(EEG) [5], [6] and ElectroOculoGram (EOG) [7], in whichthe signals are generated from muscles, brain and eyes,

Z.-P. Bian, Junhui Hou and L.-P. Chau are with the School ofElectrical and Electronics Engineering, Nanyang Technological Universi-ty, 639798 Singapore (email: [email protected], [email protected],[email protected]).

N. Magnenat-Thalmann is with the Institute for Media Innovation,Nanyang Technological University, 639798, Singapore (email: [email protected]).

respectively. Some of these systems can be used for totallyparalysed subjects [6]. One of the major drawbacks of thesesystems is difficult to extract the interaction signal since thesignal-noise-rate (SNR) is very low. Other major drawbacks ofthese systems are low portability and requirement for highlyspecialized hardware.

2) Voice Command Based ATs: Speech recognition andnon-verbal vocalization recognition are used to control com-puters [8], [9]. However, they are unreliable in noisy environ-ments. Moreover, using voice signals to navigate a cursor isnot as flexible as using a motion tracking method [8]. TheseATs could be better used in combination with other ATs [8].

3) Mechanical Motion Based ATs: With the mechanicalmotion based ATs, the user controls a computer via theswitches or analog devices, such as sip-and-puff [10], mouthstick/pad [11] and lip control system [12]. One of the problemsof sip-and-puff and mouth stick/pad interfaces is the hygienicissue. To address the hygienic issue, the authors of [12]proposed a Lip Control System (LCS). However, LCS requireswearing an accessory in front the lip.

4) Motion Tracking Based ATs: These ATs track motions ofbody parts, such as eye [13], [14] / tongue [1], [8] / head [15],[16] / face [17], [18], [19] trackers. In [13], [14], eye gazewas directly used to select the target. Although the operationis fast, it needs calibration before use and the head of the usershould be fixed during use. If the head moves, re-calibration isneeded. In addition, the methods of eye motions or eye gazerequire extra eye motions, and the motions for interaction andthe user’ normal visual tasks are interfered with each other[20]. In [1], [8], the ATs tracked the motion of the tongue.However, the tongue interface in [1], [8] required an accessoryembedded into the tongue for long term use, so that it causedthe hygienic issue.

Head tracking systems can provide alternative interfacesfor persons with tetraplegia who have the voluntary motionfunction of heads. The head of human has multiple degreesof freedom [21]. Nevertheless, some existing ATs require theusers to put on or wear accessories, such as head band, glasses,cap and markers.

Camera-based methods can enable users to control comput-ers effectively without wearing any accessory to achieve non-contact experience. In [15], based on a RGB video, CameraMouse tracked a manually selected and automatically updatingonline template, in which a guardian was required to start thetracking for normal operation. While using Camera Mouse, thesubject is required to look at the monitor when the trackingpoint is on the face. If the subject rotates his head at alarge angle to look at other places, it will lead to the loss


of tracking and a manual re-selection is needed. To make theextraction of features compact, in [17], the skin colour wasused to detect the users’ face and nose, which is vulnerableto complex illumination and similar colour objects. In [18],the user’s face was tracked by a cost function of errorsbetween 2D features in RGB video and a 3D face model.However, the tracking method is also vulnerable to colour andillumination environment, and requires initialization step. In[19], the authors alleviated the “feature drift” and illumination-dependent problems of [15]. However, that system is not veryrobust yet, especially when the motion of the user is fast.

B. Overview of FMThe proposed human computer interface (namely Facial

position and expression Mouse system, FM) employs a monoc-ular commercial depth camera, such as SoftKinetic [22] andKinect [23], and is based on the facial position and expression.Fig. 1 shows the overview of our proposed interface system.Several challenging issues are well addressed, i.e.,

1) FM is independent of colour and illumination influencessince it is based on an infrared depth camera, even themain interference source is the infrared light such as di-rect sunlight [24]. Moreover, the depth image simplifiesthe task of facial feature extraction, which makes it morerobust than its rivals, i.e., RGB camera based methods.

2) A fast and robust Randomized Decision Tree (RDT)is proposed to automatically detect the position andthe expression from a single image, which can avoidthe “feature drift” problem that exists in most trackingalgorithms.

3) The human computer interface combines the advantagesof facial expression based interfaces and head motionbased interfaces to address the problem of small rangeof head motion for low resolution of camera. An efficientuser experience is achieved by a non-linear functionmapping the motion of the nose to the motion of thecursor . The mouth status enables the user to conve-niently adjust the pose of the head, and to efficientlyprovide commands.

Compared with our preliminary work [25], there are threemajor contributions: (1) The novel feature of RDT (see Sub-section II-B) and the pyramid processing (see Subsection II-D)remarkably improve the accuracy and speed of the detectionalgorithm. (2) The non-linear mapping function improves theoperation performance. (3) More comprehensive evaluationsof the proposed interface (see Section IV).

When compared with other interface devices for people withsevere mobility impaired access to computers, the strengthof the proposed system is its convenience (no guardian,no accessory on body, no calibration, no initialization) androbustness (insensitive to illumination or colour). It allowspersons with tetraplegia to use computers efficiently to enrichtheir lives.

This paper is organized as follows: Section II introducesthe detection algorithm of facial position and expression. Sec-tion III introduces mouse events based on facial position andexpression. Experimental results are presented in Section IV.Conclusions are drawn in Section V.

Depth

camera

Depth image

Human

Remove

background

Extract

position by

mean-shift

Detect the

status of mouth

by voting

Depth images with labels

(head,mouth-close,mouth-

open,nose,body)

Classify

pixels into

labels

Control cursor

and trigger

commands

RDT

Training phase

Test phase

...

.

..

f < τf ≥ τ

Nose position and

mouth status

Train

Position

Position

Labels

map

Fig. 1. Overview of our proposed interface system.

II. DETECTION OF FACIAL POSITION AND EXPRESSION

As shown in the flow chart in Fig. 1, the first step of theinterface is to remove the background. If the pixel’s depthz(x, y) is larger than a maximum threshold Tmax or smallerthan a minimum threshold Tmin, this pixel is considered asbackground for our application. The mask function M(x, y)is

M(x, y) =

{1(Foreground) Tmin 6 z(x, y) 6 Tmax

0(Background) otherwise.

(1)We set Tmin = 50 cm since the minimum reliable depth valueof the camera for experiments is 50 cm, and set Tmax =150 cm by assuming that the distance between the subjectand the computer is smaller than 150 cm. After removing thebackground, the detection algorithm of the facial position andexpression is used.

The nose position is used to navigate the cursor along withthe commands triggered by the status of the mouth. Both theposition and the status are detected from a single depth imagebased on RDT framework, and the calibration before use isnot needed. The proposed detection approach of the positionand the status is robust and has low computational complexity.In the following four subsections, we present four parts of thedetection approach: (1) the training data of the proposed RDT-based algorithm, (2) the novel feature of the proposed RDT-based algorithm, (3) the decision method of nose position andmouth status after RDT classification, and (4) the pyramidprocessing. For more details on RDT, we refer the interestedreader to [26], [27].

A. Training Data

In [26], the RDT algorithm was successfully developedto detect the joint positions by the authors. For training, afixed predefined label was used to label the same body partin the human model no matter what the human poses. Inour preliminary work [25], the RDT algorithm was further


(a) (b)

Fig. 2. Labelled images for training. (a) Close the mouth. (b) Open the mouth.Green: mouth-close, red: mouth-open, white: nose, blue: head, brown: body.

developed to detect two categories of information together,i.e., expressions and positions. In the training phase of ouralgorithm, the mouth is labelled corresponding to differentstatuses of the mouth, as shown in Fig. 2. In other words,the information of both the position and the expression isembedded into labels. Totally, there are five labels for theRDT algorithm, i.e., nose, mouth-open, mouth-close, head andbody. The number of labels is small and the labelling is mucheasier than [26]. The synthetic depth images simulated by thehuman model can be used for training as in [26]. In addition,the real depth images can also be labelled for training, whichcorresponds more closely to the depth images in test phasethan the synthetic images. The real personal depth imagesusing for training can improve the accuracy for a specificperson with tetraplegia. Fig. 2 shows two examples of realdepth images with labels.

As shown in Fig. 2, the nose is exactly labelled in asmall circle. Nevertheless, the mouth is roughly labelled bya polygon and the labels of the mouth in Fig. 2(a) and2(b) are different according to different mouth statuses. Theunderlying reason is that the position of the nose is employedto navigate the cursor, so that the accuracy of the nose positionis important. However, the accuracy of the mouth positionis not as important as the nose position. In order to avoidoverfitting, the pixels along the nose and the mouth are notlabelled, as shown as black band in Fig. 2.

B. Feature of RDTFig. 3 shows the structure of our RDT. Each split node

n has a parameter pn = ((∆xn,∆yn), τn). (∆xn,∆yn) isthe offset with respect to the location of test pixel (x0, y0).τn is the scalar threshold to be compared with the featurevalue fn of the test pixel. pn is optimized from training. Eachcomparison decides whether the test pixel goes to the leftor right branch child of node n. In the end, the test pixelreaches a leaf, which stores the classification information, i.e.,a posterior distribution over the labels.

An efficient feature is very important to the RDT. Wepropose to use the depth different between the test pixel andone of its neighbours as the feature of RDT, i.e.,

f((x0, y0)|(∆x,∆y)) = z(x0, y0)− z((x0, y0) +(∆x,∆y)

z/R),

(2)

...

.

.

.

f 0< τ0f0 ≥ τ0

f1 ≥ τ1 f 1< τ1f2 ≥ τ2 f 2< τ2

f n=z(+)-zn(●)

Test patch in test image

.

Test pixel, at the

centre of patch

Neighbour

Feature

Distribution over

the labels in the leaf

Fig. 3. Structure of our RDT. Each split node n contains a comparisonbetween the feature value fn and threshold τn. Each leaf contains a posteriordistribution.

where (x0, y0) represents the location of a test pixel in a giventest image, (∆x,∆y) represents the offset with respect to(x0, y0), z(x, y) represents the pixel (x, y) depth information,z represents a normalization variable, and R represents aconstant. The offset value is normalized by 1

z/R .In the training phase, z is equal to the depth of the nose

in the image. After training, the offset parameter (∆x,∆y) isnormalized by 1

z/R , and stored as some RDTz w.r.t. differentz. Thus, for a given RDTz , the feature becomes

f((x0, y0)|(∆xz,∆yz))=z(x0, y0)−z((x0, y0)+(∆xz,∆yz)),(3)

where (∆xz,∆yz) has been normalized offline.In the test phase, z is used as the index to look up the

corresponding RDTz . Before test, the depth of the nose isunknown. The depth of the test pixel is employed as z to lookup the corresponding RDTz . It is reasonable to use z(x0, y0)as z: (1) If the test pixel is on the nose, z(x0, y0) is equal tothe depth of the nose; (2) If the test pixel is on the face, theerror between z(x0, y0) and the depth of the nose is very smallcompared to the depth of the nose. When a rough position ofthe nose for the current frame is known, e.g., from the noseposition of the previous frame, the rough depth of the noseis employed as z. For this case, the selection of RDTz takesonly once for one test image. Furthermore, when the size ofthe image is fixed, the two dimensional offset (∆xz,∆yz)can be converted to a one dimensional value, resulting in lesscomputational complexity.

Both in the training and the test, the nose is always as theimportant reference, which can improve the detection accuracyof the nose position compared with the detection in [25].

C. Decision Method of Nose Position and Mouth Status

As shown in Fig. 4, the output of the RDT classifier is adistribution Pil over the labels l for each test pixel (xi, yi).After the per-pixel classification, a mean-shift algorithm and avoting algorithm will be used to extract the positions and themouth status respectively.

1) Position: To extract the positions, we use a mean-shiftclustering approach based on the distribution. A weightedkernel is included in the mean-shift clustering approach. For


(a) (b)

(c) (d)

Fig. 4. Results of RDT classification. (a)(c) Input depth images shown asmesh after removing pixels with near/far distance. (b)(d) Labels map afterRDT classification, each colour stands for the highest probability label, green:mouth-close, red: mouth-open, white: nose, blue: head, brown: body. (a)(b)Close the mouth. (c)(d) Open the mouth.

the positions, the density estimator is defined as

Pl ∝M∑i=1

wilk(L− Li)z(Li)2, (4)

where the label l 3D position is denoted as Pl, the total amountof test pixels of a given test depth image is denoted as M ,the weighting of the pixel (xi, yi) is denoted as wil, a kernelfunction is denoted as k(•), a 3D location is denoted as L,the test pixel (xi, yi) 3D location is denoted as Li, and z(Li)is the depth of Li. wil is equal to the probability Pil fromthe classifier of RDT. For the mouth position, the mouth labelmerges labels of different statuses of the mouth.

2) Mouth status: To detect the status of the mouth , a votingmethod is used. Its voting metric can be represented by

S∗m = argmax

Sm

(wSm

M∑i=1

wiSmk(Pm − Li)z(Li)2), (5)

where Sm ∈ {mouth-close,mouth-open} is the label of themouth status, wSm is the weighting of the mouth status, wiSm

is the weighting of the test pixel (xi, yi), and Pm is the mouthposition. The mouth status result S∗

m is determined by thestatus with larger voting value. The proposed method is freefrom the challenges of (1) facial landmarks extraction, and (2)the mouth status inference based on the extracted landmarks.Thus, it is more efficient than landmarks-based methods, e.g.,[18].

D. Speed Up by Pyramid Processing

As a computer interface, it should be implemented withonly a little CPU resource, reserving CPU resource for other

Classify

foreground with

down sampling

Extract mouth

position

Classify neighbourhood

of nose without down

sampling

Extract nose

position

(rough)

Detect mouth

status

Extract head

position

Extract nose

position

(fine)

Fig. 5. Pyramid processing when the rough positions of the nose and themouth are unknown.

applications. The above mentioned detection algorithm isperformed on the all foreground pixels. In practice, the runtimeof the proposed detection algorithm can be speeded up bypyramid processing.

The nose and the mouth only occupy small areas in thedepth image, even when compared with the foreground. Thus,the nose position and the mouth status can be detected by clas-sifying a small amount of pixels via the RDT algorithm if therough positions of the nose and the mouth are known. Whenthe rough positions are unknown, the pyramid processing isimplemented as in Fig. 5. The rough positions of the noseand the mouth can be quickly detected by down sampling.After classifying pixels, the mean-shift algorithm is used toextract the head position, merging labels of head, mouth-close, mouth-open and nose. The head position is the startingpoint of the mean-shift for the mouth position, merging labelsof mouth-close and mouth-open. Using the mouth position,the voting algorithm is employed to detect the mouth status.And the mouth position is the starting point of the mean-shift for the nose position. The accuracy of the nose positionis important. Hence, the finer position of the nose is furtherdetected by classifying neighbourhood pixels around the roughposition of the nose without down sampling. The numberof neighbourhood pixels is much smaller than that of thewhole foreground. The speed of the detection operation canbe improved several times by this pyramid processing.

The density estimators for the positions and the relationship,e.g., the 3D distance, among the positions of head, mouth andnose are used to measure the confidence level of detectionresults.

If the previous frame can provide the reliable rough posi-tions of the nose and the mouth for the current frame, only theneighbourhood pixels around these two positions are classifiedby the RDT algorithm. For the mouth, down sampling is usedsince the size of the mouth is large enough. For the nose, thedetection is done without down sampling.

With the pyramid processing, the proposed detection al-gorithm has very low computational complexity, and has thepotential for hardware implementation in a camera.


III. MOUSE EVENTS BASED ON FACIAL POSITION ANDEXPRESSION

The operation function of the proposed interface is present-ed in this section. The nose position and the mouth statusare combined to navigate the computer cursor and triggerinteractive commands.

A. Mapping Function of Motion

In [18], the authors tested three cursor control modes,i.e., joystick mode, direct mode and location mode. Theirexperimental results showed that the best performance modeis the location mode, in which the behaviour is very similar tothe operation of the mouse. The location mode provides usersthe freedoms of controlling the cursor acceleration, speed andmotion insensitive to the directions. Therefore, the locationmode based on the 3D position of the nose is chosen as thecursor control mode of FM.

A bottleneck of the interface is the depth image resolutionsince the interface utilizes the small amplitude motion of thehead. If the function mapping the nose motion to the cursormotion is linear, e.g., [18], a small mapping gain will leadto a small cursor motion range, so that the cursor can notcover the whole screen. However, a large mapping gain willconvert a noise to a large motion of the cursor, which willaffect the cursor precision and the user experience. Therefore,the precision and range of cursor motion become a problemof trade-off.

To address the above mentioned problem, we propose touse a non-linear mapping function, i.e.,

D = Gdp1

(1− exp

(−( d

n

)p2))

, (6)

where D is the spatial distance between the current cursorposition and the last cursor position, G is the mapping gain,d is the spatial distance between the current nose position andthe last nose position in the xy plane, p1 and p2 are powers,and n is a coefficient relative to the noise level. The term(1− exp(−( dn )

p2)) is used to set ⌊D⌋ = 0 when d is a smallvalue, which can efficiently eliminate the jitter of the cursorto make the interface more comfortable and more accurate.The total mapping gain increases with the speed of the nosemotion. When the cursor is far from its target, the user tendsto move his nose faster. The resulting higher mapping gaincan help speed up the cursor to approach the target. Thus,compared to the linear mapping function, the user can movethe cursor at a larger range without a larger amplitude of headmotion. When the cursor is close to its target, the user canfinely adjust the cursor location by moving his nose at lowspeed. The cursor location is controlled by the nose positionand its motion speed, allowing more degrees of freedom tocontrol the cursor than the simple linear mapping function.

B. Commands Triggered by Mouth Status

The close and open statuses of the mouth are used to enableand disable the motion of the cursor, respectively. When thetarget of the cursor is very far away from the cursor, the subjectcould open the mouth, so that he can turn his head back to a

(a)

(b)

Fig. 6. Examples from Bosphorus 3D face database. They are shown as mesh.(a) Negative. (b) Positive .

comfortable position before continuing to move the cursor tothe target. Hence, the mouth status enlarges the range of thecursor motion. When the subject opens his mouth to disablethe cursor motion, the mouth status can allow the subject toconveniently adjust his head to a more comfortable posturewithout moving the cursor. Furthermore, the mouth status canbe used to quickly stop the cursor and fix the location of thecursor to avoid overshooting to enhance the pointing accuracywhen the cursor is moving on the target.

Commands can be predefined based on the mouth statuscombined with temporal information. For example, the mousedouble click event is provided by open-close-open-close mouthstatus sequence in a short time, and the single click is providedby open-close-open. All the mouth actions do not interferewith users’ normal visual tasks.

IV. EXPERIMENTS

In this section, we experimentally estimate the performanceof both the detection method and the interface operation. Theresearch was approved by the Institutional Review Board ofNanyang Technological University, Singapore.

A. Evaluation of Detection Method

To verify our extraction algorithm, we did the experimentsusing the Bosphorus 3D face database [28]. Twenty chal-lenging poses, i.e., the pose type Lower-Face-Action-Units(LFAU), were selected as training and test samples. Thenegative label includes nineteen poses and the positive labelincludes the mouth stretch LFAU 27, as shown in Fig. 6.Totally, there were 105 subjects, 1876 samples of negativeand 105 samples of positive.

The experiments were performed by using 10-fold cross-validation on the entire dataset. The database was randomlydivided into ten subsets of approximately equal size. Eachsubject appeared in only one subset. The training and test weredone ten times, each time nine subsets for training and the rest


one for test. The classification accuracy reported was averagefrom the ten different random splits.

TABLE ICOMPARISON OF DETECTION ALGORITHMS. “p” DENOTES THE USE OF

PYRAMID PROCESSING.

[26] [25] Ours Oursp

FN of mouth 2 2 2 2FP of mouth 6 5 5 5

Total errors of mouth 8 7 7 7Error rate of mouth(%) 0.40 0.35 0.35 0.35

Mean error of nose(mm) 3.2 2.4 1.9 1.9Time(ms) 5.4 2.9 1.5 0.24

Frame rate(fps) 185 345 667 4167

Table I shows the accuracy and speed in terms of differencefeatures of RDT. In Table I, False Negative (FN) means thenumber of positive events detected as negative events. FalsePositive (FP) means the number of negative events detectedas positive events. We assume the face of Bosphorus databaseis 0.5 m from the camera. In Table I, “p” denotes the methodthat uses the pyramid processing to speed up the algorithm.For the interface application, the cursor is navigated in the xyplane. Therefore, the mean error of the nose position is theerror in the xy plane. The speed was obtained from a singlethread program implemented in C++. The number of pixelsper image of Bosphorus data is approximated by the numberof foreground pixels from Kinect v2 when the subject is sitting0.8 m from the camera. Thus, the speed in Table I is close tothe speed of the detection using Kinect v2.

Comparing the first and second columns of data in Table I,the method of [25] has better results in all terms except FNthan the method of [26]. Comparing the second and thirdcolumns, the proposed method is 1.9× faster than the methodof [25]. The mean error of the nose position is reduced by21% while the error results of the mouth status are the samecompared with the method of [25]. From the last two columns,we can note that the pyramid processing can dramaticallyspeed up the algorithm, i.e., 6.2× faster, without any loss ofaccuracy. If only the neighbourhood pixels around the mouthand the nose are classified by the RDT algorithm, the framerate can be higher than 1× 104 fps.

Table II shows the error results of the mouth status com-pared with Support Vector Machine (SVM) classifiers [29],which are used to detect the mouth status using the groundtruth landmarks of the mouth. In Table II, the best resultsof SVM classifiers are 12 total errors. There are only 7 totalerrors of the proposed algorithm. Although the SVM classifiersuse ground truth 3D landmarks of the mouth, the total errorsresults of the proposed algorithm are better. If the landmarksare detected with error by an algorithm, the results of SVMclassifiers should be degenerated.

To further demonstrate the performance of the proposedfacial position/expression extraction algorithm, a video democan be downloaded 1.

1http://www.ntu.edu.sg/home/elpchau/FM.mp4

TABLE IICOMPARISON WITH SVM.

Error of mouth Proposed SVM-quadratic

SVM-ploynomial

SVM-linear

SVM-rbf

FN 2 6 4 1 35FP 5 6 12 17 9

Total errors 7 12 16 18 44Error rate (%) 0.35 0.61 0.81 0.91 2.22

B. Evaluation of Interface Operation

The interface operation was estimated according to Fitts’Law [30] and ISO/TS 9241-411 standard [31].

1) Task: The ISO/TS 9241-411 standard establishes unifor-m guidelines and test procedures to evaluate computer pointinginterfaces, in which the multi-directional tapping task is used,as shown in Fig. 7. The procedure of the tapping task was topoint at and select the target indicated in blue as shown inFig. 7(b). Each subject was instructed to move the cursor tothe centre of the target as quickly and precisely as possible,and then to select the pointing target as quickly as possible(click the left button using the mouse, dwell on the pointingtarget in the monitor for 500 ms, as suggested in [13], [12],using FM or Camera Mouse). After the tapping task, eachsubject was interviewed and given a questionnaire, which wasdefined in ISO/TS 9241-411 standard to evaluate the comfortin the use of interfaces and was adapted to involve the specificbody parts used during the experiments.

The primary parameters of a target are the distance (D) andwidth (W ) of the target. The distance of a target is defined bythe distance between the centres of two continually appearingtargets, as shown in Fig. 7(b). In the task, three differentdistances (D) and three different widths (W , diameters asshown in Fig. 7(b)) are used to generate six D-W pairconditions. Table III shows the design values, which includeall the conditions for multi-directional tasks of [1] and [12].Each D-W condition randomly appeared during experiments.

TABLE IIIDESIGN VALUES OF TAPPING TASK.

D(pixels) 100 305 305 534 534 534W (pixels) 57 76 57 76 57 30ID(bits/s) 1.46 2.33 2.67 3.00 3.37 4.23

Our X X X X X X[12] X X X X[1] X X X

2) Performance Metrics: According to ISO/TS 9241-411standard, the metric for comparison is throughput (TP ), whichincludes both the accuracy and speed of user operation perfor-mance. We used one additional metric, i.e., Task CompletionTime (TCT ) as used in [1], [8], [12].

Throughput: TP indicates the information amount whichthe users deliver to the computer via an interface. Higher TPis better. TP is defined as

TP =ID

MT, (7)

where MT is the mean movement time over all the trails forthe same condition targets, and ID is the index of difficulty of

http://www.ntu.edu.sg/home/elpchau/FM.mp4


−200 0 200

−200

0

200

1

2

3

4

5

6

7

89

10

11

12

13

14

150

Pixels

Pixels

(a)

−200 0 200

−200

0

200

Pixels

Pixels

D

W

(b)

Fig. 7. Graphical User Interface (GUI) screen for the tapping task in ourexperiments. (a) 90 possible targets and their order. (b) Targets turn blue oneat a time.

the target. The movement time refers to the time the subjectspends moving the cursor to point a target from the centre ofthe last target. The movement time should not count the timeof initiation delay before moving the cursor and the selectingtime or dwelling time for the selection of the pointing target[1], [32]. The Shannon’s formula is used to define ID

ID = log2(D

W+ 1). (8)

ID is measured in bits. All the ID values used in ourexperiments can be found in Table III.

The mean TP is calculated as

TP =1

S

S∑i=1

( 1C

C∑j=1

IDij

MTij

), (9)

where S represents the number of subjects, and C representsthe number of target conditions.

Task Completion Time: TCT represents the total timeused for completing each test round. Lower TCT is better.TCT shows the overall users’ control over the movements ofthe cursor . TCT starts from the appearance of the first target,and ends at the disappearance last target.

3) Interfaces: The openCV was used to develop the GUIof the experiments. The size of the computer LCD monitorwas 14.0”, and the resolution was 1600×900 pixels. Thepointing task window had a size of 610×610 pixels with whitebackground. The subjects were sitting 80 cm from the monitor.

Three interfaces were used, i.e., FM, Camera Mouse and astandard optical mouse. For FM, the commercial depth cameraKinect v2 with 512×424 pixels was used and mounted abovethe LCD monitor. Camera Mouse is a popular interface fordisabilities, and has been downloaded more than 2.5 milliontimes to date [33]. It was used to compare with the proposedinterface FM. The input RGB camera of Camera Mouse was a1280×720 pixels HD webcam. For Camera Mouse, the subjecthad to fix the face to freeze the cursor on the target for500 ms to select the target. In contrast, for FM, the subjectcould open the mouth or fix the nose freeze the cursor. Thestandard mouse is a good baseline to validate our data analysesand experimental methodologies as [1], [12]. For the mouse,the performance range has been well established between 3.7bits/s and 4.9 bits/s [32], [1], [8], [12]. For each subject, thetest order of different interfaces (FM, Camera Mouse and themouse) was randomized. The subject had to finish all tests ofone interface, and then started to test another interface. Beforethe test, the subjects practised more than one hour for FM orCamera Mouse, more than ten minutes for the mouse. Duringthe test, subjects were always working in a good illuminationcondition with suitable colour background to eliminate theenvironmental effect for Camera Mouse.

Table IV shows the details of the test trials. There were90 trials each round. Each interface was tested for twentyrounds per subject. Sixteen able-bodied participants (10 males,6 females, graduate students and staffs from Nanyang Techno-logical University, age range 23∼36 years) took the tests. Theyhad no prior experience to use head or face interface, and theyuse PC for more than 5 h daily. The total number of trials was86400. The head and face of the target users of the proposedinterface should be able to move voluntarily as the able-bodiedusers. Furthermore, the participants of the experiments shouldalso complete the baseline interface, i.e., operating the mouseby hand, which is required for comparison. Therefore, theexperiments were taken by able-bodied participants as in [4],[12].

TABLE IVTEST TRIALS IN OUR EXPERIMENTS.

Trials Rounds Interfaces Participants Total trials

90 20 3 16 86400

4) Results: The operation results shown in Table V. Themouse throughput is 4.39 bits/s, which is in line with manyother experiments [32], [1], [8], [12] and indicates that our dataanalysis, GUI functionality and methodology are adequate.TP and TCT of FM are 115% and 39% better than that ofCamera Mouse, respectively. In the test of Camera Mouse,there were a few times of loss tracking the selected pointdue to feature drift, and re-selection of the tracking pointwas required. For these cases, the unsuccessful rounds were


TABLE VRESULTS OF THE EXPERIMENT ON INTERFACE OPERATION. TPn (TCTn)

IS THE VALUE OF TP (TCT ) NORMALIZED BY CAMERA MOUSE TP(TCT ).

Camera Mouse FM Mouse

TP (bits/s) 1.52±0.12 3.26±0.26 4.39±0.36TPn(%) 100 215 289

TCT (min) 4.68±0.35 2.88±0.25 1.53±0.12TCTn(%) 100 61 33

0 20 40 60 80 100

EMG (neck muscle)[4]

Head Orientation (head)[4]

LCS (lip)[12]

HeadTracker (head)[16]

TDS (tongue)[1]

dTDS (tongue+speech)[8]

Camera Mouse (face)

FM* (face)[25]

SmartNav (head)[16]

EyeTracker (eye)[14]

FM (face)

Mouse

16

20

21

22

27

28

35

52

56

65

74

100

Normalized TP / %

FM (face)

Fig. 8. Benchmarking ATs in terms of normalized TP with respect to themouse TP . “FM*” is our preliminary work.

discarded and the subjects were required to do the same roundsagain. The proposed method avoids this tracking loss problem.In short, Table V demonstrates that the proposed method isremarkably better than Camera Mouse. Four major reasonscontribute to the improvement:

1) The proposed approach enables the subjects to conve-niently adjust the head to comfortable poses by them-selves. For Camera Mouse, the head pose relative to thecursor is adjusted only at the beginning of operation.

2) The proposed mapping function speeds up the cursormotion when the target is far from the cursor, andcompresses the noise when the target is close. ForCamera Mouse, the mapping function is linear.

3) The proposed extraction algorithm can extract the po-sition and the expression from a single image whileavoiding the Camera Mouse’s “feature drift” problemduring tracking, which is a common problem of trackingalgorithms. For Camera Mouse, when the feature drifts,the subjects feel uncomfortable.

4) The proposed mouth status allows the subjects to stopthe cursor quickly. Camera Mouse needs a longer timeto stop the cursor when the cursor is moving fast.

To better understand the performance of FM, Fig. 8 bench-marks FM against some other ATs in terms of their normalizedTP with respect to the mouse TP . In Fig. 8, EyeTracker [14]needs to fix the head of user, and SmartNav [16] uses aninfrared camera to track a reflector placed on the head of user.FM has the best performance in all of the ATs. Compared withour preliminary work FM*, the performance of FM is 22%

1

2

3

4

5

30

210

60

240

90

270

120

300

150

330

180 0

0.5

1

1.5

30

210

60

240

90

270

120

300

150

330

180 0

Mouse FM Camera Mouse

Fig. 10. TP for different target directions in the tapping task with FM,Camera Mouse and the mouse. Top: original TP . Bottom: Normalized TPby mean TP .

better. There are two underlying reasons of the improvement:(1) The accuracy of nose position of FM is higher. (2) Thenonlinear mapping function improves the performance thanthe linear mapping function used in [25].

Fig. 9 shows some cursor trajectories of three interfacesin the tapping task. The critical differences among the threetrajectories are the precision while pointing to the selectedlocations. All three methods are suffering from overshootingproblem. However, the proposed FM method has similarperformance as the mouse but outperforms Camera Mouse.The reason is that FM and the mouse can stop the cursor intime. For FM, the subject can open his mouth to stop thecursor. Moreover, the term (1 − exp(−( dn )

p2)) of mappingfunction can effectively eliminate the jitter of the cursor. Incontrast, Camera Mouse cannot stop the cursor in time, andit is easy for the cursor to overshoot the target. To avoidovershooting, the speed has to be reduced, resulting in lowerTP .

Fig. 10 shows TP for different target directions in thetapping task with FM, Camera Mouse and the mouse. Thetop figure shows the original various performances by thetarget direction, and the bottom figure shows the normalizedperformances with respect to the mean performances. FM hasapproximately equal TP for different target directions, similar


−200 0 200

−200

0

200

Pixels

Pixels

(a)

−200 0 200

−200

0

200

Pixels

Pixels

(b)

−200 0 200

−200

0

200

Pixels

Pixels

(c)

Fig. 9. Cursor trajectories of the tapping task with D=534 and W=57. Red dots are the selected locations. (a) Camera Mouse. (b) FM. (c) Mouse.

1 2 3 4 50

0.5

1

1.5

2

2.5

3

MT

/ s

ID / bits/s

Camera Mouse

FM

mouse

MT=0.17ID+0.15

MT=0.38ID−0.19

MT=0.55ID+0.31

R2=0.99

R2=0.99

R2=0.96

Fig. 11. Comparison of the interfaces in terms of MT versus ID and theirregression models based on Fitts’ law.

to the mouse and slightly better than Camera Mouse, whichillustrates the performance of FM is independent of targetdirection.

Fig. 11 shows the relationship between MT and ID. Themore linear the relationship is, the better the performance ofthe interface fits the Fitt’s law. From Fig. 11, we can see thatthe coefficient of determination R2 of FM is the same as thatof the mouse, i.e., R2 = 0.99, which is higher than that ofCamera Mouse, i.e., R2 = 0.96. The intercepts for all threeinterfaces are inside the normal range −0.20 ∼ 0.40 s [32].

Fig. 12 shows the qualitative result of the questionnaire forFM and Camera Mouse. The questions are rated from 1 to 7,where 1 and 7 are the least and most favourable responsesrespectively. One-way analysis of variance (ANOVA) withpairwise comparison is used to analyse the differences betweenthe results of FM and Camera Mouse. Those comparisons thatare significantly different (p < 0.05) are denoted with “*” onthe horizontal axis. Only the “Neck fatigue” and “Accuratepointing” are not significantly different. FM receives higherscores in all terms except the “Mouth fatigue” since CameraMouse does not require the mouth motion. The subjects preferFM to Camera Mouse in terms of ”Like to use” in Fig. 12 that

(worst)1

2

3

4

5

6

(best)7

Like

to u

se *

Gen

eral

impre

ssio

n

vs m

ouse *

Nec

k fa

tigue

Mouth

fatig

ue *

Gen

eral

com

fort *

Oper

atio

n spee

d *

Target

sel

ectio

n *

Acc

urate

poin

ting

Phys

ical

effort

*

Men

tal e

ffort

*

Sm

oothnes

s *

FM

Camera Mouse

Fig. 12. Assessment results of the questionnaire for FM and Camera Mouse.Higher score is better. Response 7 is the most favourable while response 1 theleast favourable. Those comparisons that are significantly different (p < 0.05)are denoted with “*” on the horizontal axis.

the response is 6.1 vs 4.9, which is significantly different. Inaddition, the average qualitative score of all the responses forFM is 5.7, which is significantly different and 14% better thanthat for Camera Mouse, i.e., 5.0.

V. CONCLUSION

In this paper, a human computer interface is proposed forpersons with tetraplegia based on facial position and expres-sion information from a depth camera. The proposed RDT-based detection algorithm is able to detect the nose positionand the mouth status from a single depth image with accuracyand efficiency, without the requirement for calibration orinitialization. The proposed feature of RDT improves theaccuracy and speed of detection dramatically. The proposedcontrol function improves the operation performance. Theinterface combines the advantages of facial expression basedinterfaces and head motion based interfaces. Furthermore, theinterface is independent of colour and illumination changes,which can enhance the robustness compared with interfacesbased on RGB cameras. Experimental results show that the


proposed detection algorithm is faster and more accurate thanthe state-of-the-art algorithms. The quantitative and qualitativeresults show that the proposed system outperforms the state-of-the-art ATs by 9∼58% and 14% in terms of TP and theaverage qualitative score, respectively.

ACKNOWLEDGMENT

The authors would like to acknowledge the Ph.D. grant fromthe Institute for Media Innovation, Nanyang TechnologicalUniversity, Singapore.

REFERENCES

[1] B. Yousefi, X. Huo, E. Veledar, and M. Ghovanloo, “Quantitative andcomparative assessment of learning in a tongue-operated computer inputdevice,” IEEE Transactions on Information Technology in Biomedicine,vol. 15, no. 5, pp. 747–757, Sept 2011.

[2] J. Music, M. Cecic, and B. M., “Testing inertial sensor performance ashands-free human-computer interface,” WSEAS Trans. Comput., vol. 8,pp. 715–724, Apr. 2009.

[3] P. McCool, G. Fraser, A. Chan, L. Petropoulakis, and J. Soraghan,“Identification of contaminant type in surface electromyography (EMG)signals,” IEEE Transactions on Neural Systems and RehabilitationEngineering, vol. 22, no. 4, pp. 774–783, July 2014.

[4] M. Williams and R. Kirsch, “Evaluation of head orientation and neckmuscle EMG signals as command inputs to a human-computer interfacefor individuals with high tetraplegia,” IEEE Transactions on NeuralSystems and Rehabilitation Engineering, vol. 16, no. 5, pp. 485–496,Oct 2008.

[5] V. Mihajlovic, B. Grundlehner, R. Vullers, and J. Penders, “Wearable,wireless EEG solutions in daily life applications: What are we missing?”IEEE Journal of Biomedical and Health Informatics, vol. PP, no. 99, pp.1–1, 2014.

[6] J. Mak and J. Wolpaw, “Clinical applications of brain-computer inter-faces: Current state and future prospects,” IEEE Reviews in BiomedicalEngineering, vol. 2, pp. 187–199, 2009.

[7] Y. Nam, B. Koo, A. Cichocki, and S. Choi, “GOM-Face: GKP, EOG, andEMG-based multimodal interface with application to humanoid robotcontrol,” IEEE Transactions on Biomedical Engineering, vol. 61, no. 2,pp. 453–462, Feb 2014.

[8] X. Huo, H. Park, J. Kim, and M. Ghovanloo, “A dual-mode humancomputer interface combining speech and tongue motion for peoplewith severe disabilities,” IEEE Transactions on Neural Systems andRehabilitation Engineering, vol. 21, no. 6, pp. 979–991, Nov 2013.

[9] J.-S. Park, G.-J. Jang, J.-H. Kim, and S.-H. Kim, “Acoustic interferencecancellation for a voice-driven interface in smart TVs,” IEEE Transac-tions on Consumer Electronics, vol. 59, no. 1, pp. 244–249, February2013.

[10] M. Mazo, “An integral system for assisted mobility automatedwheelchair,” IEEE Robotics Automation Magazine, vol. 8, no. 1, pp.46–56, Mar 2001.

[11] C. Lau and S. O’Leary, “Comparison of computer interface devices forpersons with severe physical disabilities,” Amer. J. Occupat. Therapy,vol. 47, no. 11, pp. 1022–1030, Nov 1993.

[12] M. Jose and R. de Deus Lopes, “Human-computer interface controlledby the lip,” Biomedical and Health Informatics, IEEE Journal of, vol. 19,no. 1, pp. 302–308, Jan 2015.

[13] X. Zhang and I. S. MacKenzie, “Evaluating eye tracking with ISO 9241part 9,” in Proceedings of HCI International. Springer, 2007, pp. 779–788.

[14] I. S. MacKenzie, “Evaluating eye tracking systems for computer input,”in Gaze Interaction and Applications of Eye Tracking: Advances inAssistive Technologies. IGI Global, 2012, pp. 205–225.

[15] M. Betke, J. Gips, and P. Fleming, “The Camera Mouse: visual trackingof body features to provide computer access for people with severedisabilities,” IEEE Transactions on Neural Systems and RehabilitationEngineering, vol. 10, no. 1, pp. 1–10, March 2002.

[16] S. Guness, F. Deravi, K. Sirlantzis, M. Pepper, and M. Sakel, “Evaluationof vision-based head-trackers for assistive devices,” in 2012 AnnualInternational Conference of the IEEE Engineering in Medicine andBiology Society (EMBC), Aug 2012, pp. 4804–4807.

[17] T. Morris and V. Chauhan, “Facial feature tracking for cursor control,”Journal of Network and Computer Applications, vol. 29, no. 1, pp. 62– 80, 2006.

[18] J. Tu, H. Tao, and T. Huang, “Face as mouse through visual facetracking,” Computer Vision and Image Understanding, vol. 108, no.12, pp. 35 – 40, 2007, special Issue on Vision for Human-ComputerInteraction.

[19] S. Epstein, E. Missimer, and M. Betke, “Using kernels for a video-basedmouse-replacement interface,” Personal and Ubiquitous Computing,vol. 18, no. 1, pp. 47–60, 2014.

[20] L. R. Hochberg et al., “Neuronal ensemble control of prosthetic devicesby a human with tetraplegia,” Nature, vol. 442, pp. 164–171, July 2006.

[21] E. Kandel, J. Schwartz, and T. Jessell, “Principles of neural science.”New Yor: McGraw-Hill, 2000.

[22] SoftKinetic, “http://www.softkinetic.com/.”[23] Microsoft, “http://www.microsoft.com/en-us/kinectforwindows/.”[24] A. Robledo, C. S., and J. Guivant, “Outdoor ride: Data fusion of a

3D Kinect camera installed in a bicycle,” in Proceedings of the 2011Australasian Conference on Robotics and Automation, 2011.

[25] Z.-P. Bian, J. Hou, L.-P. Chau, and N. Magnenat-Thalmann, “Hu-man computer interface for quadriplegic people based on face posi-tion/gesture detection,” in Proceedings of the ACM International Con-ference on Multimedia, ser. MM ’14. New York, NY, USA: ACM,2014, pp. 1221–1224.

[26] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,A. Kipman, and A. Blake, “Real-time human pose recognition in partsfrom single depth images,” in 2011 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2011, pp. 1297–1304.

[27] Z. Bian, J. Hou, L. Chau, and N. Magnenat-Thalmann, “Fall detectionbased on body part tracking using a depth camera,” IEEE Journal ofBiomedical and Health Informatics, vol. PP, no. 99, pp. 1–1, 2014.

[28] A. Savran, B. Sankur, and M. Taha Bilge, “Comparative evaluation of 3Dvs. 2D modality for automatic detection of facial action units,” PatternRecogn., vol. 45, no. 2, pp. 767–782, Feb 2012.

[29] C. M. Bishop, in Pattern Recognition and Machine Learning. Springer,2006.

[30] P. M. Fitts, “The information capacity of the human motor systemin controlling the amplitude of movement.” Journal of ExperimentalPsychology, vol. 47, no. 6, pp. 381 – 391, 1954.

[31] ISO/TS, “Ergonomics of human-system interaction - part 411 evalu-ation methods for the design of physical input devices, ISO/TS 9241-411:2012,” ISO/TS, May 2012.

[32] R. W. Soukoreff and I. S. MacKenzie, “Towards a standard for pointingdevice evaluation, perspectives on 27 years of Fitts’ Law research inHCI,” Int. J. Hum.-Comput. Stud., vol. 61, no. 6, pp. 751–789, Dec.2004.

[33] Camera Mouse, “http://www.cameramouse.org/.”

Documents

Facial Position and Expression Based Human Computer ... Position and Expression... · IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS 1 Facial Position and Expression Based Human