Simulating A Smartboard by Real-time Gesture Detection in ...vireo.cs.cityu.edu.hk/LEdit/TMM08-double.pdf · lecture presentation, this algorithm enables the on-the-ﬂy editing of

IEEE TRANSACTIONS ON MULTIMEDIA 1

Simulating A Smartboard by Real-time Gesture Detection in Lecture Videos

Feng Wang & Chong-Wah Ngo & Ting-Chuen Pong

Abstract— Gesture plays an important role for recognizinglecture activities in video content analysis. In this paper, wepropose a real-time gesture detection algorithm by integratingcues from visual, speech and electronic slides. In contrast tothe conventional “complete gesture” recognition, we empha-size detection by the prediction from “incomplete gesture”.Specifically, intentional gestures are predicted by the modifiedHMMs (Hidden Markov Model) which can recognize incompletegestures before the whole gesture paths are observed . The multi-modal correspondence between speech and gesture is exploited toincrease the accuracy and responsiveness of gesture detection. Inlecture presentation, this algorithm enables the on-the-fly editingof lecture slides by simulating appropriate camera motion tohighlight the intention and flow of lecturing. We develop a real-time application, namely simulated smartboard, and demonstratethe feasibility of our prediction algorithm using hand gestureand laser pen with simple setup without involving expensivehardware.

Keywords: Lecture video, gesture detection, real-time simu-lated smartboard.

I. I NTRODUCTION

In the past few years, multimedia in education has at-tracted numerous research attentions. With the aid of hardwareand software, researchers attempt to find innovative ways ofteaching, and thus enhance the quality of learning. Typicaldemonstrated systems include Classroom 2000 [1], BMRClecture browser [20], and SCORM [22]. Due to the popularityof e-learning, lecture videos are widely available for onlineaccess. A lecture is captured by different devices equippedin the classroom, such as video cameras, microphones andelectronic board. The recorded data is then used for offlinevideo editing or online broadcasting. To automate this process,video content analysis is usually essential to understand theactivities during the lecture.

Numerous issues have been addressed for the contentanalysis of lecture or instructional videos. These issues in-clude topical detection, synchronization, gesture detection,pose estimation, video summarization, and editing. In topicaldetection, a lecture video is structured according to the topicsof discussion by audio [13], visual [10], [13], [16], text [25]or cinematic expressive cues [19]. The detected topics aresynchronized (or linked) with external documents for effectiveindexing, retrieval and editing [4], [10], [16], [25]. To facilitatebrowsing and summarization, keyframes [9], mosaics [11]and statistical highlights [2], [7] are also extracted. Textsin the projected slides [25] and whiteboard [8] are detected

F. Wang is with the Department of Computer Science, Hong Kong Uni-versity of Science & Technology, Clear Water Bay, Kowloon, HK. Email:[email protected]

C. W. Ngo is with the Department of Computer Science, City University ofHong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong. Tel: (852)2784-4390. Fax:(852)2788-8614. Email: [email protected]

T. C. Pong is with the Department of Computer Science, Hong KongUniversity of Science & Technology, Clear Water Bay, Kowloon, HK. Email:[email protected]

and recognized to be aware of the content under discussion.Gesture detection [3], [26] and pose estimation [18], [27] areemployed to recognize the presenter’s activities. Based on theresults of content analysis, lecture videos are then edited ineither offline [5], [29] or online manners [21], [32], [18] tobetter deliver the quality of teaching and learning.

Advances in video content analysis have brought us oppor-tunities to create not only efficient ways of learning, but alsoconvenient tools for presentation. In this paper, we first pro-pose a gesture prediction algorithm by fusing multi-modalitycues including visual, speech and slides. The algorithm is thenapplied for the real-time simulation of smartboard where thelecture slides projected on the screen are on-the-fly editedby simulated camera motion and gesture annotations. Theediting aims to capture the flow of lecturing naturally withoutinvolving heavy hardware setup.

Gesture is a popular kind of lecture activity. Most gesturesused by presenters belong toDietic Gesturesto direct theaudiences’ attentions to the content under discussion. In thiscontext, gesture has been addressed in [3], [12], [5], [18].Most of them employ simple features such as frame differenceand skin-color for gesture detection. In [3], frame differenceis employed to detect gestures present in the slide region.Once a gesture is detected, appropriate commands are sentto a camera on the floor to zoom in/out. In [12], although nospecific gesture detection is mentioned, some gestures can benoticed by the frame difference calculated by a wide anglecamera to control another camera for video capture. In [18],an algorithm is proposed to keep detecting three skin-colorblocks in the whiteboard region, which are assumed to be thepresenter’s face and hands. Gestures are extracted accordingto the relative locations of the three skin-color blocks.

The challenge in gesture detection is that there is no salientfeature to describe the hand in the video due to the small sizeand irregular shape. Any simple or single feature is not robustenough. Frame difference can be triggered by any movingobjects, while some noisy skin-similar colors are usually intro-duced. In addition, intentional gestures are always intertwinedwith non-gesture movement of hands, which makes existingapproaches less tolerant to noise. For instance, a presenter canmove freely in front of class and thus the hand may interactwith the slide even when there is no dietic gesture pointing tothe screen. This indeed results in the difficulty of detetermingthe intention or the information to be conveyed by the gesture.

In [26], [29], we have proposed an algorithm for off-linegesture detection and recognition in lecture videos. To berobust, frame difference and skin-color are combined to detectand track the candidate gestures. Gesture recognition is thenemployed to recognize and extract intentional gestures fromthe gesture paths. Here we define a gesture as an intentionalinteraction between the presenter’s hand and some objectinstances (e.g.a paragraph or a figure) in the slide. Three kinds


of gestures (lining, circling and pointing) are recognized bythree HMMs (Hidden Markov Model). By gesture recogni-tion, we extract intentional gestures and eliminate non-gesturemovements. A gesture is detected only when it is verified bygesture recognition.

In this paper, we address the detection and recognitionof gestures in lecture videos for real-time applications. Themajor difference between this work and [26] lies in twoaspects. First, to cope with the efficiency requirements of real-time applications, the responsiveness and accuracy of gesturedetection are jointly considered. For example, in an automaticcamera management system, when a gesture is present, acamera is expected to focus on the region of interaction. Theaction from the camera should be as rapid as possible sothat the interaction can be captured and used for editing inreal-time. In [26], [29], for off-line video editing, a gestureis verified when the full gesture path is completed, which istoo late for the camera to respond to the gesture. To tackle theresponsiveness problem, our new algorithm predicts and reactsas early as possible before a gesture path is fully observed.Second, besides visual cue, we combine speech and electronicslides to improve the accuracy and reduce the response timeof gesture detection.

Gesture in lecture videos has been proven to be a veryimportant and useful hint in recognizing the lecture activitiesfor offline [5], [29] or online [21], [32], [18] video editingto provide the students with an efficient and effective wayof learning. However, how to employ gestures and otherlecture activities to create tools for teaching has seldombeen attempted before. In [24], an experimental system calledMagic Boardsis proposed. Given a lecture video with someinteractions around a chalkboard, the user who is familiar withthe interactions is presented with a small set of images. Eachimage represents a single idea written on the board. The userthen replaces each image with a picture, video or animationto better represent each idea. A new video is rendered usingthe replaced pictures and videos in place of the writing on theboard. Due to the offline nature, the system cannot be usedin course of the presentation. The whole procedure is manual,which could be highly time-consuming.

Based on our initial work in [28] for gesture prediction,we propose and develop a simulated smartboard for real-timelecture video editing. With the smartboard, a presenter caninteract, by hand or using a laser pen, with the projectedslides to generate a novel view of the slide. The interactionsinclude annotating the slides, controlling the slide show,and highlighting specific compound objects in slides. Withresponse to the interactions, the projected slides are sort ofedited automatically by analyzing the video streams capturedby a camera mounted at the back of the classroom. Technically,the incomplete gestures in videos are first detected illusivelythrough prediction. The appropriate actions such as camerazoom and superimposition of gesture annotation are thensimulated and projected to the screen on-the-fly. This processis illustrated in Figure 1. In (a), a candidate gesture interactingwith the slide is detected and tracked. In (b), the gesture pathis then superimposed on the electronic slide and recognizedas “circling” gesture. The Region-Of-Interest (ROI) pointed

by the gesture is localized. In the next few frames, the ROIis highlighted by a simulated zoom-in on-the-fly as shownin (c). Due to the real-time capability, the smartboard canbe used during presentation and simulate most functions ofa real smartboard. Compared to the high expenditure of asmartboard, our setup becomes relatively simple and economicwith the content-based solution.

The remainder of this paper is organized as follows. In Sec-tion II, we revise the HMMs for complete gesture recognitionin our previous work [26] to recognize incomplete gestures. InSection III, speech and electronic slides are fused with visualcue to improve the accuracy and responsiveness of gesturedetection. Section IV describes the design and implementationof the simulated smartboard. Experimental results are shownin Section V. Finally, Section VI concludes this paper.

II. GESTUREDETECTION BY V ISUAL CUE

To extract intentional gestures, three kinds of gesturesfrequently used by presenters are defined:lining, circlingand pointing. Intentional gestures are then recognized andextracted by gesture recognition. Similar to our previousworks [26], [29], skin-color and frame difference are fused asevidence for candidate gesture detection and tracking. In [29],gesture recognition is activated after the full gesture path isobserved, without considering the responsiveness constraint. Inthis section, we modify the HMMs (Hidden Markov Model) in[26], [29] to recognize the incomplete gestures so that gesturescan be detected and verified earlier to cope with the efficiencyrequirements for real-time applications.

A. HMM for Complete Gesture Recognition

HMM has been proven to be useful in sign gesture recog-nition [23]. A detailed tutorial on HMM can be found in [31].In [26], we employ HMM to recognize the dietic gestures inlecture videos. Three HMM models are trained to recognizethe three defined gestures.

Given an observationO = (o1, o2, . . . , oT ), where eachoi(i = 1, 2, . . . , T ) is a sampled point on the gesture path,gesture recognition problem can be regarded as that of com-puting

arg maxi{P (gi|O)} (1)

wheregi is the i-th gesture. By using Bayes’ Rule, we have

P (gi|O) =P (O|gi)P (gi)

P (O)(2)

Thus, for a given set of prior probabilitiesP (gi), the mostprobable gesture depends only on the likelihoodP (O|gi).Given the dimensionality of the observation sequenceO,the direct estimation of the joint conditional probabilityP (o1, o2, . . . |gi) from examples of gestures is not practicable.However, if a parametric model of gesture production suchas a Markov model is assumed, then estimation from data ispossible since the problem of estimating the class conditionalobservation densitiesP (O|gi) is replaced by the much simplerproblem of estimating the Markov model parameters.

In HMM-based gesture recognition, it is assumed thatthe observation sequence corresponding to each gesture is


zoom−in

pathGesture

by a simulated

and ROI localization

(c)

(b)

(a)

Gesture recognition

Gesture highlight

Gesture tracking

on electronic slide

Superimpose gestures

"Circling"

k+75k+60k+45

k+30k+15kFrame

recognition

Gesture

Fig. 1. Real-time simulated smartboard based on gesture prediction.

generated by a Markov model as shown in Figure 2. A Markovmodel is a finite state machine which changes state oncefor every observation point. The joint probability thatO isgenerated by the modelM moving through the state sequenceX = x1, x2, . . . , xT is calculated simply as the product of thetransition probabilities and the output probabilities

P (O,X|M) =T∏t=1

bxt(ot)axtxt+1 (3)

wherebxt(ot) is the probability thatot is observed in statext,andaxtxt+1 is the probability that the state transition fromxtto xt+1 is taken. However, in practice, only the observationsequenceO is known and the underlying state sequenceXis hidden. This is why it is called aHidden Markov Model.Given thatX is unknown, the required likelihood is computedby summing over all possible state sequences, that is

P (O|M) =∑

X

T∏t=1

bxt(ot)axtxt+1 (4)

In [31], by using Bayes’ Rule and approximation, Equa-tion 1 is deduced to calculating the likelihood

P̃ (O|M) = maxX{T∏t=1

bxt(ot)axtxt+1} (5)

whereX = x1, x2, . . . , xT is the state sequence thatO movesthrough the modelM . Notice that if Equation 1 is computable,then the recognition problem is solved. Given a set of HMM

13

a 12

N2123a N−1,Na

3

11a

exitentry

a a a22 33 NN

a

Fig. 2. The HMM Model for complete gesture recognition.

models{Mi} corresponding to gestures{gi}, Equation 1 issolved by using Equation 2 and assuming that

P (O|gi) = P (O|Mi) (6)

In HMM models, aij and bi(.) are estimated by Baum-Welch algorithm in the training phase. For gesture recognition,Equation 5 is calculated by employing the Viterbi algorithm.The details of the two algorithms can be found in [31].

B. Modified HMM for Incomplete Gesture Recognition

The HMM models introduced in Section II-A are usedfor gesture recognition in offline systems. Each gesture isrecognized after the complete path is observed. However, thisis too late in some real-time applications. For instance, ina system of automatic camera control for lecture capture,once a gesture is present, the camera is expected to zoom inon the corresponding region to highlight the interaction in ashort period. For real-time processing, besides accuracy, theresponse time is an important criterion. To respond to thepresenter’s gestures as soon as possible, gesture recognition


exit

11

3a N−1,Na 23

1 2 N12a

13a

NN3322 aaa

entry

a

Fig. 3. Modified HMM Model for incomplete gesture recognition.

and verification need to be carried out before the gesture isfinally completed. We modify the HMM models in Section II-A to predict and recognize incomplete gestures.

Different from complete gestures, an incomplete gestureusually cannot move through the HMM model to the stateexit in Figure 2, but just reaches one of the intermediate states1, 2, . . . , N . Given an observationO which corresponds to anincomplete gesture, the incomplete gesture recognition can beregarded as that of computing

arg maxi,S

P (gi,S |O) (7)

whereS is the last intermediate state thatO reaches when itgoes through the HMM model corresponding to gesturegi. Inthe topology of HMM model shown in Figure 2, the final non-emitting stateexitcan be reached only through the last stateN ,which means a complete gesture must reach the stateN beforeit is completed. To recognize incomplete gestures that maystop at any intermediate state, we modify the HMM modelsby adding a forward state transition from each intermediatestate to the stateexit in Figure 3. The joint probability thatOmoves through an HMM modelM and stops at an intermediatestateS can be approximated by

P̃ (O|M,S) = maxX{S∏t=1

bxt(ot)axtxt+1} (8)

wherexS+1 is theexit state.Figure 4 shows the trajectory of a circling gesture. (c) is

the complete gesture, and (a), (b) are two incomplete gesturesstopping at different intermediate states. By comparing (a) and(b), we are more confident that the current observation willcompose a gesture if it moves further through the HMM modelor stops at a state nearer to the stateN. Based on this finding,we take into account the stopping state for the probabilitycalculation and modify Equation 8 to be

P̃′(O|M,S) = max

X{(

S∏t=1

bxt(ot)axtxt+1) · e SN } (9)

whereeSN is introduced to assign a higher probability to the

gesture when the stopping stateS is nearer to the final stateN . Let a

′ij = aije

j−iN , and Equation 9 can be rewritten as

P̃′(O|M,S) = max

X{S∏t=1

bxt(ot)a′xtxt+1

} (10)

C

(a) (b) (c)

A

B

Fig. 4. (a)(b): incomplete circling gestures; (c) complete circling gesture.

Thus, the Viterbi algorithm can still be used for the calculationof Equation 10. The probability thatO is an incomplete gesturemodeled byM is

P̃ (O|M) = maxS

P̃′(O|M,S) = max

X,S{S∏t=1

bxt(ot)a′xtxt+1

}(11)

Equation 1 can be solved by Bayes’ Rule and assuming that

P (O|gi) = P̃ (O|Mi) (12)

C. Gesture Verification by Recognition

Given an observation sequenceO (incomplete or completegesture path), for the three defined gestures, three confidencevalues are calculated based on the modified HMMs thatindicate how likelyO will be a gesturegi

Ci = P (gi, O) (13)

Ci values are used to verify whetherO is an intentional gestureor not. A confidence valueCvisual on the presence of a gestureis calculated by visual cue as

Cvisual =Cmax∑i Ci· Cmax (14)

where Cmax = maxi Ci. When an incomplete gesture ismoving on,Cvisual is calculated for each sample point. Todetermine the best point to predict, a gating parameter isrequired. Due to the tradeoff between responsiveness andaccuracy, finding an optimal parameter is application tailored.We determine the gating parameterCgate empirically, and startgesture verification wheneverCvisual > Cgate. In general, thefurtherO moves through the corresponding HMM model, themore confident that there is a gesture; however, the longerresponse time is required.

III. G ESTUREDETECTION BY FUSING V ISUAL , SPEECH

AND ELECTRONIC SLIDES

The proposed gesture prediction from incomplete observa-tion is mainly relying on visual cue. In lecture videos, speechalso provides complementary hints for detecting the presenceof gestures. In this section, we first describe the multi-modalrelations among visual, speech and slides. Then we present ourapproach to employ these relations to improve the accuracyand responsiveness of gesture detection by combining speechand semantic information extracted from electronic slides.


correspondenceGesture−speech

"construct"Speech

String matching

and RegistrationSynchronization

SlidesVisual

Fig. 5. Relations between multi-modal cues.

A. Relations Between Visual, Speech and Slides

Imagine the scenario where a presenter interacts with theprojected slides while spelling out keywords. In multi-modalenvironment, ideally we can sample the gesture and speech,then verify the gesture with the words found in electronicslides to determine the region of focus. By jointly consideringthese cues, both the accuracy and response time of gestureprediction can be improved. The scenario is illustrated inFigure 5. When a gesture is pointing to a paragraph in theslide, a keyword “construct” is found in speech, which lies inthe paragraph interacting with the gesture. This case frequentlyhappens during a lecture since the presenter tends to read outthe keywords to be highlighted. We exploit such correspon-dence between gesture and speech to boost the performanceof prediction.

Figure 5 models the relations between three multi-modalcues. Gesture-speech correspondence plays an essential rolein our task. It is worth to notice that the correspondencecannot be directly estimated, but rather relies on the semanticinformation from slides which indirectly bridges the gesture-speech relationship. Visual-slide correspondence recovers thegeometric alignment between video and slide, while speech-slide correspondence matches keywords found in voice andtext sources. By mapping both speech and gesture to slides,the correspondence between them are indirectly estimated. Invisual-slide correspondence, the alignment is achieved throughthe synchronization of videos and slides with video text analy-sis based on our work in [25], [26]. Basically texts from bothvideo and slides are extracted for matching, and the spatialcorrespondence is recovered by computing homography. Inspeech-slide correspondence, speech are online recognizedand matched with words available in current slides with editdistance.

B. Gesture-Speech Correspondence for Gesture Detection

Two pieces of information are available: video text in-teracting with gesture through visual-slide alignment, andspeech transcript generated by an ASR (Automatic SpeechRecognition) engine. By matching both information with slidesrespectively, the region-of-interest where they coincide can

be estimated. In our approach, the layout of each electronicslide is first structured into groups of semantic instance. Dueto visual-slide correspondence, the semantic region that arecognized gesture highlights can be located. The words foundin the region are then matched against the speech transcript. Aconfidence value is computed to hint the existence of gesture-speech correspondence. If such correspondence is found, thereis more likely an intentional gesture present and thus booststhe accuracy and responsiveness of gesture detection.

With the PowerPoint slide in Figure 5 as an example, thelayout is semantically grouped into separate object instances.By constructing the one-to-one mapping between the instancesin slides and videos through homography projection, we caneasily organize and structure the layout of videos [29]. InFigure 5, the projected slide region is partitioned into differentregions-of-interest (ROIs). Each ROI is a unit that gesture mayinteract with at any instance of time. The video texts in ROIsare known since the mapping between ROIs and semanticinstances in the electronic slide is aware of. Whenever acandidate gesture is interacting with an ROI, we search thematches between the words in ROI and speech transcript.Since the significance of matched words can vary, we weightthe importance of words with traditional information retrievaltechniques. Initially, the methods of stemming and stop wordremoval are applied to the texts in every slide. By treatingeach ROI as a document, TF (Term Frequency) and IDF(Inverse Document Frequency) are computed for the remainingkeywords to distinguish different ROIs. A confidence valueCspeech is calculated based on the the TF and IDF values ofthe keywords as

Cspeech =∑

w∈W

log(1 + TFw · IDFw)1 + ∆tw

(15)

whereW is the set of matched keywords, and∆tw is thetime interval between the presence of the gesture and of thekeywordw in speech. Because the gesture and speech are notalways temporally aligned, we also consider speech at timebefore and after the gesture is predicted. For robustness,∆twis used to degrade the importance of words which are notsynchronized with gesture.

C. Gesture Detection by Fusing Speech and Visual

If the correspondence between visual and speech discussedin Section III-B can be detected, we are more confident thatthere is a gesture present, and thus speech can be combinedwith visual for gesture verification. When a candidate gestureis detected by visual cue, we keep tracking it and calculatingthe visual evidenceCvisual in Equation 14. Meanwhile thetranscripts of speech are generated by ASR. We search thekeywords in the transcripts that match with the texts in theROI interacting with the candidate gesture. Once a correspon-dence between speech and visual is detected,Cspeech is thencalculated. The visual-speech confidenceC is computed andfused as

C = λvCvisual + λsCspeech (16)

where λv + λs = 1 and both parameters are weights tolinearly rationale the significance of visual and speech. Since


the quality of speech is not high and the output of ASR couldbe noisy, higher weight is expected for visual evidence.

IV. SIMULATED SMARTBOARD

With the advance of HCI (Human-Computer Interaction)technologies, smartboard has been equipped and used in someconference rooms as a convenient tool for presentation. Apresenter can write, make annotations and control the showingof the slides by using the touch screen and accessory tools suchas markers and a wiper. In this section, we demonstrate thatthe smartboard can be software simulated with content-basedsolution as presented in Sections II and III.

A. System Setup and Interface

The system is set up in a classroom with a normal slidescreen, an LCD projector, and a stationary video camera facingthe screen. Figure 6 shows the interface of the simulatedsmartboard. Prior to presentation, the electronic slides areuploaded to the computer. When the application is started,the slides are projected and displayed on the screen as shownin Figure 6. Various tools, including the virtual buttons forcircling, lining, rubber, forward, backward, and a scrollingbar are projected to the screen for the presenter to use. Noticethat it is optional to use the buttons when making gestures. Oursystem can automatically predict and recognize the gesture ofcircling, lining and pointing. The gesture will be superimposedon top of the slide as soon as the gesture is detected. Thebuttons, if being pressed, can improve the response time ofthe system. To interact with the interface, a presenter canuse either hand or laser pen, for instance, to press buttonsand make gestures. The interactions are captured by the videocamera and transferred to the computer through an IEEE 1394cable for real-time processing. When an interaction is detectedand recognized, the corresponding operation is performed andprojected to the slide screen.

Scroll bar

slideElectronic

Forward

Wiper

Backward

Circle

Line

Fig. 6. Interface of the simulated smartboard.

B. Functions

The system provides three levels of interactions: (1) auto-matic gesture recognition and camera motion simulation, (2)

detection of predefined gestures and actions, and (3) slideannotation with laser pen. Some example video clips canbe found online [30]. In (1), a presenter can interact withthe slide by gesturing near the ROIs. The system predictsand recognizes three types of gestures:circling, lining andpointing based on the algorithms proposed in Sections II andIII. Naturally the response can be quicker if the presenterreads out some of the words in the ROI when gesturing.Once a gesture is detected, the ROI under interaction willbe highlighted with simulated camera motion like zoom-inby employing the similar editing rules in our previous work[29]. For instance, in Figure 7(c), a lining gesture pointing to atextline is detected. After a delay of several frames, the textlineis shown in the center of the frame with a higher resolution.By highlighting the ROI, on one hand, the presenter can drawthe audience’s attentions to the content under discussion; onthe other hand, the texts in higher resolution are easier to readas illustrated in Figure 7(c).

In (2), a presenter can predefine the intended gestures bypressing buttons in the interface shown in Figure 6. Thebuttons can be pressed with finger or laser pen. The functionsof buttons are briefly summarized as below:

• Circle: to draw a circle around an object (a word, equationor figure) on the slide.

• Line: to draw a line under an object (text or equation) onthe slide.

• Wiper: to delete a marker (a circle, a line or an annotation)on the screen.

• Forward/Backward: to go to the next/previous slide.• Scroll bar: to scroll up or down the slide.

To press a button, the hand needs to stay on top of it for atleast0.5 second so that the system can distinguish whether thehand is pressing the button or just sliding over it. Figure 7(a)shows an example. When the buttonforward is pressed by thepresenter, the next slide will be shown on the screen.

In (3), the presenter may make simple annotations bydrawing or writing on the screen with a laser pen as illustratedin Figure 7(b). The path of the laser pen is then tracked anddisplayed on the screen. Due to the precision and efficiencyconsiderations, we do not support annotations or handwritingsin small scales. Practically the laser pen should not be movedtoo fast. The presenter is expected to turn on and off thelaser pen before and after making annotations. Therefore, thedetected and tracked laser paths are assumed to be intentionalwhen the pen is on.

C. Implementation Issues

To implement the three levels of functions described in theprevious section, the system needs to be aware of three kindsof interactions: hand gesture highlighting, button-pressing, andlaser pen annotation. The hand gestures are identified jointlyby skin-color and frame difference cues [26], [29], while thetrace of laser pen is mainly detected by tracking the color ofthe projected spot. The color cue from laser pen is not alwaysreliable due to the reflection and varying background. Thus, ifthe hand holding a laser pen is not occluded, we also detectthe hand around the laser spot to refine the detection of laser


(a) (b)

(c)

Fig. 7. Functions of the simulated smartboard. (a) Button gesture for slidecontrol; (b) Annotation by using laser pen; (c) Dietic gesture for contenthighlighting: detected gesture (left); content highlighting by zoom-in (right).

pen. If the laser pen is pointed in a distance from the screenand the hand cannot be detected along with the laser spot,laser light color becomes the only useful cue.

The detection of button pressing by hand or laser pen canbe handled effectively. Since the buttons are fixed in position,only few regions from the incoming video streams need tobe continuously monitored. To avoid false alarms such as thecase when a hand slides over a button but without pressing,the trajectory of hand or laser pen needs to be aware ofbefore any potential detection. In addition, we assume that thebutton pressing should last for at least 0.5 second to validatethe action. Whenever pressed, a button will be displayed inthe state of being pressed to indicate the activation of thecommand. After the gesture is finished and displayed on theslide, the button is then deactivated.

In response to the interactions, appropriate actions such asscrolling, zooming, writing, adding and removing annotationstake place directly on the projected slides. To facilitate theeasy manipulation of slides, the uploaded electronic slides areautomatically converted to and saved as images. The semanticinstances such as the textlines, figures and tables are extractedand indexed with their content and spatial positions. With thisorganization, we can efficiently operate the slides to projectand display the editing decisions in real time.

The simulated smartboard can analyze 5-10 frames persecond. From our experiment, the frame rate is enough forreal-time tracking of hand gesture and laser pen movement.Currently we sample 8 frames per second for content analysis.The response time of the system is mainly dependent on theresponsiveness of gesture detection and laser pen tracking,which can be considered as acceptable according to ourexperiments in Section V. Once a gesture for highlighting is

recognized, a gradual camera zoom-in is simulated which takesabout2−4 seconds to avoid abrupt visual changes. The speedof simulated camera motion can be adjusted according to theuser’s preference. The response to the button-pressing gestureand laser pen annotation is usually more efficient withoutrecognition or simulating camera motion.

V. EXPERIMENTS

We conduct two separate experiments for performance eval-uation. The first experiment (Section V-A) is to verify theperformance of gesture prediction with video-taped lectures.In this setup, the presenters are not aware of the smartboard orthe algorithms for tracing their gestures. They present as usualin normal classroom environment. Our aim is to investigatethe accuracy and responsiveness of our prediction algorithm.In the second experiment (Section V-B), the presenters caninteract freely with the simulated smartboard, while theyobserve the editing effects projected on the slide screen duringthe presentation. We investigate, under the real scenario, theperformance of our system with respect to different settingssuch as gesturing with hand instruction and laser pen.

A. Gesture Detection in Video-taped Lectures

We conduct experiments on 5-hour videos consisting of 15presentations given by 10 lecturers and tutors. The presentersinclude 5 males and 5 females. The presentations are given in1 seminar room and3 classrooms of different sizes, layoutsand lighting designs. A stationary camera is used to capturethe projected slide and the presenter. There are five videoscaptured in the seminar room. Due to the reflection from thescreen, performing text recognition and gesture detection inthese videos are more difficult compared with those capturedin the classrooms.

We compare two approaches: gesture prediction with visual-only cue, and with multi-modal cues. In the dataset, there are2060 gestures manually identified and marked. The averageduration of a gesture is 4.62 seconds. For each gesture class,we use200 samples to train the HMM. The training data iscomposed of gestures from another lecture video and someman-drawn figures such as ellipses and lines. Our previousexperiments in [26] show that these training data works forrecognizing real gestures from lecture videos.

We employ Microsoft Speech SDK 5.1 for speech recog-nition. The overall recognition rate is no more than20% dueto the environmental noise in audio track. Considering thefact that the performance of speech recognition is not highin general, visual cue is given a higher weight in multi-modality fusion. As a result, the parameters in Equation 16are empirically set toλv = 0.7 and λs = 0.3. In addition,a parameterCgate is also required to gate the confidence ofdetection in both Equations 14 and 16. Practically gestureswith lower detection confidence will be regarded as noises.A larger value ofCgate usually means more competitiveperformance but with the expense of longer response time.Empirically, we setCgate = 0.45.

Table 1 and Figure 8 summarize and compare the per-formance by showing the recall and precision of gesture


detection when varying the responsiveness constraints (de-lays). Obviously, allowing longer response time leads to betterperformance. With visual-only cue, 65% of all gestures canbe correctly predicted (precision = 0.75) with a delay of2.5 seconds after the gestures start. In contrast, when multi-modal cues are jointly considered,85% of gestures can berecognized (precision = 0.77) with only 1.8 seconds delay.From Figure 8, to achieve the same accuracy, less reponsetime is required when speech is combined with visual cue.The results indicate the significance of multi-modal cues wherethe majority of gestures can be correctly recognized half waybefore the complete gesture paths are observed. An interest-ing note found in the experiments is that most presenters,in addition to gesturing, tend to highlight keywords whenexplaining important concepts under lecturing. This indeednot only improves the recognition rate of speech, but alsogreatly signifies the benefit of gesture-speech correspondence,which boosts the recognition performance while shortening theresponse time.

TABLE I

RESULTS OF GESTURE DETECTION. (Ng : TOTAL NUMBER OF GESTURES;

Nd : NUMBER OF GESTURES DETECTED; Nf : NUMBER OF GESTURES

FALSELY DETECTED; Nm : NUMBER OF GESTURES MISSED; Nc : NUMBER

OF GESTURES CORRECTLY DETECTED. Recall = NcNg

, P recision = NcNd

)

Ng 2060

Gesture Average 4.62 sec

duration Standard Deviation 1.83 sec

Method Delay (sec) 0.6 1.2 1.8 2.5 4.0

Nd 449 736 1255 1783 1966

Nf 217 320 412 435 308

Visual Nm 1828 1644 1217 712 402

Nc 232 416 843 1348 1658

Recall (%) 11.3 20.2 40.9 65.4 80.5

Precision (%) 51.7 56.5 67.2 75.6 84.3

Nd 980 1391 2275 2252 2170Visual Nf 376 501 533 447 319

+ Nm 1456 1170 318 255 209Speech Nc 604 890 1742 1805 1851

+ Recall (%) 29.3 43.2 84.6 87.6 89.9Slides Precision (%) 61.6 64.0 76.6 80.2 85.3

The parameterCgate can impact the performance if thevalue is not selected within a proper range. Figure 9 showsthe sensitivity of the parameterCgate towards the performanceand the response time. WhenCgate > 0.4, the algorithmperforms satisfactorily in terms of bothprecision andrecallvalues. A small gating confidence usually introduces manyfalse alarms, while a very high value results small declineof the recall, which means some gestures are missed. Whenthe gating confidence becomes larger, higherprecision isachieved with an expense of longer response time. As seenin Figure 9, setting the value ofCgate in the range of 0.4-0.6can compromise the accuracy and speed.

Currently, the proposed gesture prediction algorithm can

0.5 1 1.5 2 2.5 3 3.5 40

10

20

30

40

50

60

70

80

90

100

Delay (sec)

(%)

Recall (visual)

Precision (visual)

Recall (visual+speech+slides)

Precision (visual+speech+slides)

Fig. 8. Recall and precision of gesture detection. (The results are the same asin Table I, but presented to visualize the relationships between the performanceof sing- and multi-modality approaches.)

process approximately13 frames per second. The frames areevenly sampled along the timeline, and the sampling rateis enough to depict the trajectories of gestures. Given theprocessing speed, our algorithm is efficient enough for real-time applications.

B. Performance of Simulated Smartboard

We evaluate the performance of simulated smartboard inproviding the three levels of interactions: hand-based gesturehighlighting (Dietic gesture), gesturing with button pressing(Button gesture), and laser pen annotation. A total of 3-hourvideos of presentation using the smartboard are captured forperformance analysis in terms of gesture detection and laserpen tracking.

1) Gesture Detection:Table II shows the calculatedRecallandPrecisionvalues of gesture detection. Forbutton gesture,few gestures are missed due to the occlusion caused by thepresenter and other objects. Some false alarms are insertedwhen the hand stays near or is projected onto the buttons.For dietic gesture, since a rather highCgate(= 0.5) value isused to avoid false alarms in real scenarios, about85% of allgestures can be correctly detected.

TABLE II

ACCURACY OF GESTUREDETECTION. (Ng : THE TOTAL NUMBER OF

GESTURES; Nd : THE NUMBER OF GESTURES DETECTED; Nc : THE

NUMBER OF GESTURES CORRECTLY DETECTED; Nm : THE NUMBER OF

GESTURES MISSED; Ni : THE NUMBER OF GESTURES FALSELY INSERTED;

Recall = NcNg

;Precision = NcNd

.)

Ng Nd Nc Nm Ni Recall (%) Precision (%)

Button gesture 108 113 101 7 12 93.5 89.4

Hand gesture 104 97 88 12 9 84.6 90.7


0.5 0.6 0.70.70.60.40.3 0.5

Gating confidence

4

6

0

1

3

2

5

Response time (sec)

0

0.2

0.4

0.6

0.8

1.0

0.3

Gating confidence0.4

Precision

Recall

Fig. 9. Performance and response time of gesture detection with different gating confidence values.

Figure 10 summarizes the response time for gesture detec-tion. The video rate is 24 frames per second, and the responsetime is expressed as the number of frames being delayedafter an action is taken. For button gesture, we calculate theresponse time as the time interval from the hand interactingwith the button to the activation of the command. To detectdietic gesture for content highlighting, we setCgate = 0.5.A gesture is verified only after the calculated confidencevalue is larger than this gating evidence. The responsivenessis evaluated by the time interval from the beginning of agesture to the gesture being detected. With the fixedCgate,the response time is out of direct control and can be up to thelifetime of the gesture. As shown in Figure 10, the responsetime for more than90% of all gestures is less than3 secondswhich is considered as acceptable, and most gestures can bedetected in the halfway of the gesture paths.

Fraction (%)

Response time (frames)

Dietic gesture

Button gesture13.97

SDMean

5.9926.59

52.76

Fig. 10. Response time of gesture detection.

2) Laser Pen Tracking:In tracking laser spot, the projectedannotation is usually deviated slightly from the real trajectoryof the spot. To investigate this effect, we manually mark

the laser spot trajectories and calculate the distance betweenthe real and projected annotations. The deviation is generallywithin 0-5 pixels with the mean and standard deviation of 3.19and 1.85 respectively. The deviation is considered acceptableand does not distort the intended annotations when assessingthe effects of captured videos subjectively.

We sample8 frames every second for laser spot detectionand tracking. The sample rate is enough to depict the an-notation given that the laser pen is not moving too fast. Toevaluate the responsiveness, for each point on the path of theannotation, we calculate the average time being delayed todisplay the laser point on the slide after it is projected by thepen. From experiments, the response time is usually less thanone second with the mean and standard deviation of17.43 and4.44 frames respectively.

VI. CONCLUSION

We have presented a real-time gesture detection algorithmpowered with prediction ability. Using the proposed modifiedHMM models, an intentional gesture can be efficiently pre-dicted and verified. By further considering the gesture-speechmulti-modal correspondence, most gestures can be effectivelyrecognized half-way before the whole gesture path is com-pleted. With this algorithm, we demonstrate the flexibilityof building a simulated smartboard which supports on-the-fly presentation editing through three levels of interactionparticularly with hand gesturing and laser pen annotation. Withsimulated smartboard as example, the experimental resultsindicate that our algorithm can cope with the requirement ofreal-time applications with high enough recall and precisionperformance.

For the future work, the performance of speech recognitionshould be improved, and the fusion of different cues needs tobe further studied. Other applications such as automatic cam-era management can be developed by employing the algorithmproposed in this work. Besides gesture, head posture proves tobe another useful hint for lecture video content analysis [29].In [27], we have proposed an efficient algorithm for head poseestimation which could be combined with gesture for real-timelecture video editing in the future.


ACKNOWLEDGEMENT

The work described in this paper was partially sup-ported by the grants DAG01/02.EG16, HIA01/02.EG04,SSRI99/00.EG11, and a grant from the Research Grants Coun-cil of the Hong Kong Special Administrative Regions, China(CityU 118906).

REFERENCES

[1] G. D. Abowd, C. G. Atkeson, A. Feinstein, and C. Hmelo, “Teaching andLearning as Multimedia Authoring: The Classroom 2000 Project,”ACMMultimedia Conf., pp. 187-198, 2000.

[2] M. Chen, “Visualizing the Pulse of a Classroom,”ACM Multimedia Conf.,2003.

[3] M. Chen, “Achieving Effective Floor Control with a Low-BandwidthGesture-Sensitive Videoconferencing System,”ACM Multimedia Conf.,2002.

[4] B. Erol, J. J. Hull, and D. S. Lee, “Linking Multimedia Presentationswith their Symbolic Source Documents: Algorithm and Applications,”ACM Multimedia, 2003.

[5] M. Gleicher and J. Masanz, “Towards Virtual Videography”,ACM Mul-timedia Conf., 2000.

[6] M. Gleicher, R. M. Heck, and M. N. Wallick, “A Framework for VirtualVideography”,Int. Symp. on Smart Graphics, 2002.

[7] L. He, E. Sanocki, A. Gupta, and J. Grudin, “Auto-Summarization ofAudio-Video Presentations,”ACM Multimedia Conf., 1999.

[8] L. He and Z. Zhang, “Real-Time Whiteboard Capture and ProcessingUsing a Video Camera for Remote Collaboration”,IEEE Trans. onMultimedia, vol.9, no. 1, pp.198-206, 2007.

[9] S. X. Ju, M. J. Black, S. Minneman, and D. Kimber, “Summarization ofVideotaped Presentations: Automatic Analysis of Motion and Gesture,”IEEE Transactions on Circuits and Systems for Video Technology, 1998.

[10] T. Liu, R. Hjelsvold, and J. R. Kender, “Analysis and Enhancement ofVideos of Electronic Slide Presentations,”Int. Conf. on Multimedia &Expo, 2002.

[11] T. Liu and J. R. Kender, “Spatio-temporal Semantic Grouping ofInstructional Video Content,”Int. Conf. on Image and Video Retrieval,2003.

[12] Q. Liu, Y. Rui, A. Gupta, and J. J. Cadiz, “Automatic Camera Manage-ment for Lecture Room Environment,”Int. Conf. on Human Factors inComputing Systems, 2001.

[13] T. F. S. -Mahmood and S. Srinivasan, “Detecting Topical events in digitalvideo,” ACM Multimedia Conf., 2000.

[14] E. Machnicki and L. Rowe, “Virtual Director: Automating a Webcast,”Multimedia Computing and Networking, 2002.

[15] J. Martin and J. B. Durand, “Automatic Gesture Recognition Using Hid-den Markov Models,”Int. Conf. Automatic Face and Gesture Recognition,2000.

[16] S. Mukhopadhyay and B. Smith, “Passive Capture and Structuring ofLectures,”ACM Multimedia Conf., 1999.

[17] C. W. Ngo, T. C. Pong, and T. S. Huang, “Detection of Slide Transitionfor Topic Indexing,” Int. Conf. on Multimedia & Expo, 2002.

[18] Masaki Onishi and Kunio Fukunaga, “Shooting the Lecture Scene UsingComputer-controlled Cameras Based on Situation Understanding andEvaluation of Video Images,”Int. Conf. on Pattern Recognition, 2004.

[19] D. Q. Phung, S. Venkatesh, and C. Dorai, “Hierarchical Topical Segmen-tation in Instructional Films based on Cinematic Expressive Functions,”ACM Multimedia Conf., 2003.

[20] L. A. Rowe and J. M. Gonzlez,“BMRC Lecture Browser,”http://bmrc.berkeley.edu/frame/projects/lb/index.html

[21] Y. Rui, A. Gupta, and J. Grudin,“Videography for Telepresentations”,Int. Conf. on Human Factors in Computing Systems, 2003.

[22] T. K. Shih, T. H. Wang, C. Y. Chang, T. C. Kao, and D. Hamilton,“Uniquitous e-Learning With Multimodal Multimedia Devices,”IEEETrans. on Multimedia, vol. 9, no. 3, pp. 487-499, 2007.

[23] T. Starneret. al., “Real-time American Sign Language RecognitionUsing Desk and wearable computer-based video”,IEEE Trans. on PatternAnalysis and Machine Intelligence, 1998.

[24] M. N. Wallick and M. L. Gleicher, “Magic Boards”,SIGGRAPH, 2005.[25] F. Wang, C. W. Ngo, and T. C. Pong, “Synchronization of Lecture Videos

and Electronic Slides by Video Text Analysis,”ACM Multimedia Conf.,2003.

[26] F. Wang, C. W. Ngo, and T. C. Pong, “Gesture Tracking and Recognitionfor Lecture Video Editing,”Int. Conf. on Pattern Recognition, 2004.

[27] F. Wang, C. W. Ngo, and T. C. Pong, “Exploiting Self-Adaptive Posture-based Focus Estimation for Lecture Video Editing”,ACM MultimediaConf., 2005.

[28] F. Wang, C. W. Ngo, and T. C. Pong, “Prediction-based GestureDetection in Lecture Videos by Combining Visual, Speech and ElectronicSlides”, Int. Conf. on Multimedia & Expo., 2006.

[29] F. Wang, C. W. Ngo, and T. C. Pong, “Lecture Video Enhancement andEditing by Integrating Posture, Gesture, and Text”,IEEE Transaction onMultimedia, Vol 9, Issue 2, pp. 397-409, 2007.

[30] Video Content Analysis for Multimedia Authoring of Presentations,http://webproject.cse.ust.hk:8006/Smartboard.htm.

[31] S. Young, “HTK: Hidden Markov Model Toolkit”, Cambridge Univ.Eng. Dept. Speech Group and Entropic Research Lab. Inc., WashingtonDC, 1993.

[32] IBM Auto Auditorium System, www.autoauditorium.com.

Feng Wang received his PhD in Computer Sci-ence from the Hong Kong University of Scienceand Technology in 2006. He received his BSc inComputer Science from Fudan University, Shanghai,China, in 2001. Now he is a Research Fellow inthe Dept. of Computer Science, City University ofHong Kong. Feng Wang’s research interests includemultimedia content analysis, pattern recognition, ITin education.

Chong-Wah Ngo (M’02) received his Ph.D inComputer Science from the Hong Kong Universityof Science & Technology (HKUST) in 2000. Hereceived his MSc and BSc, both in Computer En-gineering, from Nanyang Technological Universityof Singapore in 1996 and 1994 respectively.

Before joining City University of Hong Kong asassistant professor in Computer Science departmentin 2002, he was a postdoctoral scholar in BeckmanInstitute of University of Illinois in Urbana-Cham-paign (UIUC). He was also a visiting researcher

of Microsoft Research Asia in 2002. CW Ngo’s research interests includevideo computing, multimedia information retrieval, data mining and patternrecognition.

Ting-Chuen Pong received his Ph.D. in ComputerScience from Virginia Polytechnic Institute and StateUniversity, USA in 1984. He joined the University ofMinnesota - Minneapolis in the US as an AssistantProfessor of Computer Science in 1984 and waspromoted to Associate Professor in 1990. In 1991,he joined the Hong Kong University of Science &Technology, where he is currently a Professor ofComputer Science and Associate Vice-President forAcademic Affairs. He was an Associate Dean ofEngineering at HKUST from 1999 to 2002, Director

of the Sino Software Research Institute from 1995 to 2000, and Head of theW3C Office in Hong Kong from 2000 to 2003. Dr. Pong is a recipient of theHKUST Excellence in Teaching Innovation Award in 2001.

Dr. Pong’s research interests include computer vision, image processing,pattern recognition, multimedia computer, and IT in Education. He is arecipient of the Annual Pattern Recognition Society Award in 1990 andHonorable Mention Award in 1986. He has served as Program Co-Chair ofthe Web and Education Track of the Tenth International World Wide WebConference in 2001, the Third Asian Conference on Computer Vision in 1998,and the Third International Computer Science Conference in 1995.