Human-centered interaction with documents

8/9/2019 Human-centered interaction with documents

1/9

Human-Centered Interaction with DocumentsAndreas Dengel, Stefan Agne, Bertin Klein

Knowledge Management LabDFKI GmbH

Kaiserslautern, Germany

+49 631 205 3216{dengel,agne,klein}@dfki.de

Achim Ebert, Matthias Deller Intelligent Visualization Lab

DFKI GmbHKaiserslautern, Germany

+49 631 205 3424{ebert,deller}@dfki.de

ABSTRACT In this paper, we discuss a new user interface, a complementaryenvironment for the work with personal document archives, i.e.for document filing and retrieval. We introduce our implementation of a spatial medium for document interaction,explorative search and active navigation, which exploits andfurther stimulates the human strengths of visual information

processing. Our system achieves a high degree of immersion of

the user, so that he/she forgets the artificiality of his/her environment. This is done by means of a tripartite ensemble of allowing users to interact naturally with gestures and postures (asan option gestures and postures can be individually taught to thesystem by users), exploiting 3D technology, and supporting theuser to maintain structures he/she discovers, as well as providecomputer calculated semantic structures. Our ongoing evaluationshows that even non-expert users can efficiently work with theinformation in a document collection, and have fun.

Categories and Subject DescriptorsH.1.2 [ User/Machine Systems] : Human factors, Humaninformation processing; H.5.2 [ User Interfaces] : Graphical user interfaces (GUI), Haptic I/O, Input devices and strategies (e.g.,mouse, touchscreen), Interaction styles (e.g., commands, menus,forms, direct manipulation), User-centered design; I.3.6[Methodology and Techniques] : Interaction techniques

General Terms Multimodal interaction, Interactive search, Human-CenteredDesign

Keywords Immersion, 3D user interface, 3D displays, data glove, gesturerecognition

1. INTRODUCTIONVisual processing and association is an important capacity inhuman communication and intellectual behavior. Visualinformation addresses patterns of understanding as well as spatialassemblies. This also holds for office environments wherespecialists are seeking for best possible information assistance for improved processes and decision making. However, in the lastdecades the paradigm of document management and storage hasradically changed. In daily working this has led to electronic, non-tangible document processing and virtual instead of physicalstorage. As a result, the spatial clues of document filing andstorage are lost. Furthermore, documents are not only a means for information storage but are an instrument of communicationwhich has been adapted for human perception over the centuries.Reading order, logical objects and presentation are combined inorder to express the intentions of a documents author. Differentcombinations lead to individual document classes, such as

business letters, newspapers or scientific papers. Thus, it is notonly the text which captures the message of a document but alsothe inherent meaning of the layout and the logical structure.

When documents are stored in a computer they become invisiblefor human beings. The only way to retrieve them is to use searchengines allowing the user to get a keyhole perspective to thecontents where all the inherent strengths of document structure for reading and understanding are disregarded. Towards this end, weneed better working environments which while working with

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.

HCM'06 , October 27, 2006, Santa Barbara, California, USA. Copyright 2006 ACM 1-59593-500-2/06/0010...$5.00.

Figure 1. Our demonstration setup: 2D display,stereoscopic display and data glove

35


2/9

documents consider and stimulate our strengths in visualinformation processing.

One of the main advantages of such a virtual environment isdescribed with the term immersion, standing for the lowering of

barriers between human and computer. The user gets theimpression of being part of the virtual scene and can ideallymanipulate it as he would do in his real surroundings, withoutdevoting conscious attention to use an interface. One reason whyvirtual environments are not yet as common as one should think

based on the advantages they present might be the lack of adequate hardware and interfaces to interact with immersiveenvironments, as well as methods and paradigms to intuitivelyinteract in three-dimensional settings. 3D applications can becontrolled by a mapping to a combination of mouse and keyboard,

but the task of selecting and placing objects in a three-dimensional space with 2D interaction devices is cumbersome,requires effort, and the conscious attention of the user. Morecomplex tasks, e.g., opening a document, require even moreabstract mappings that have to be memorized by the user.

In the following, we like to discuss a new perceptual user interface as a complementary working environment for documentfiling and retrieval. Using 3D display technology our long termgoal is to provide a spatial media for document interaction as wellas to consider and further stimulate the strengths of visualdocument processing at the same time. The approach supportsexplorative search and active navigation in document collections.

2. STATE OF THE ARTMany researchers already have addressed this problem fromdifferent perspectives. In [1], Welch et al. proposed new desktopswhere people can spread papers out in order to look at themspatially. High-resolution projected imagery should be used asubiquitous aids to display documents not only on your desk but atwalls or even on the floor. People at distant place would be ableto collaborate on 3D displayed objects in which graphics and text

can then be projected. Krohn [2] developed a method to structureand visualize large information collections allowing the user toquickly recognize whether the found information meets her or hisexpectations or not. Furthermore the user can give feedback through graphical modification of the query. Shaw et al [3]describe an immersive 3D volumetric information visualizationsystem for the management and analysis of document corpora.Based on glyph-based volume rendering, the system enables the3D visualization of information attributes and complexrelationships. The combined two-handed interaction by three-space magnetic trackers and stereoscopic viewing provide for enhancing the user's 3D perception of the information space. Thefollowing two sections treat in more detail the two core aspects of visualization and interaction.

2.1 VisualizationThe information cube introduced by Rekimoto et al. [4] can beused to visualize a file system hierarchy. The nestedbox metaphor is a natural way of representing containment. One problem withthis approach is the difficulty in gaining a global overview of thestructure, since boxes contained in more than three parent boxesor placed behind boxes of the same tree level are hard to observe.

Card et al. [5] present a hierarchical workspace called WebForager to organize documents with different degrees of interestat different distances to the user. One drawback of this system,

however, is that the user gets (apart from the possibility of onesearch query) no computer assistance in organizing the documentsin space as well as in mental categories.

3D NIRVE, the 3D information visualization presented bySebrechts et al. [6] organizes documents that result from a

previous search query depending on the categories they belong to. Nevertheless, placing the documents around a 3D sphere provedto be less intuitive than simple text output.

Robertson et al. [7] developed Task Gallery, a 3D windowmanager that can be regarded as a simple conversion of theconventional 2D metaphors to 3D. The 3D space is used to attachtasks to the walls and to switch between different tasks by movingthem on a platform. The user's movements and the only advantageof this approach over the 2D windows metaphor is possibility toquickly relocate tasks using spatial memory.

The Tactile 3D [8] system, a commercial 3D user interface for theexploration and organization of documents; it is still indevelopment. The file system tree structure is visualized in 3Dspace using semitransparent spheres that represent folders and thatcontain documents and other folders. The attributes of documentsare at least partly visualized by different shapes and textures.Objects within a container can be placed in a sorting box and thus

be sorted by various sorting keys in the conventional way,forming 3D configurations like a double helix, pyramid, or cylinder. The objects that are not in the sorting box can beorganized by the user.

The (still ongoing) discussion on the usefulness of 3Dvisualization in Information Visualization is very controversial(e.g., [6], [9], [10], [11]). Nevertheless, Ware's results [9] showthat building a mental model of general graph structures isimproved by the factor 3 compared to 2D visualization. Thestudies also show that 3D visualization supports the use of thespatial memory and that user enjoyment is generally better with3D than with 2D visualizations which has been recently re-discovered as a decisive factor for efficient working.

2.2 InteractionAt the moment, research on interaction is focused mainly onvisual capturing and interpretation of gestures. Either the user or his hands are captured by cameras so their position or the postureof the hands can be determined with appropriate methods. Toachieve this goal, there are several different strategies. For theuser, the most natural and most comfortable way to interact is theuse of non-invasive techniques. Here, the user is not required towear any special equipment or clothing. However, the applicationhas to solve the problem of interpreting the cameras live videostreams in order to identify the users hands and the gesturesmade. Some approaches aim to solve this segmentation problem

by assuring a special uniform background to distinguish the user's

hand from it [12,13], others don't need a specially prepared, butstatic background [14], still others try to determine the hands position and posture by feature recognition methods [15]. Newer approaches use a combination of these methods to enhance thesegmentation process and find the user's fingers in front of varying backgrounds [17]. Other authors simplify thesegmentation process by introducing restrictions, often byrequiring the user to wear marked gloves [18,19], usingspecialized camera hardware [16] or by restricting the capturing

process to a single, accordingly prepared setting [20].

36


3/9

Although promising, all of these approaches have the commondrawback that they pose special needs to the surrounding in whichthey are used. They require uniform, steady lighting conditions,high contrast in the captured pictures and have difficulties whenthe user's motions are so fast that his hands are blurred on thecaptures. Apart from that, these procedures demand a lot of computing power as well as special and often costly hardware. In

addition, the cameras for capturing the user have to be firmlyinstalled and adjusted, so these devices are bound to one placeand the user has to stay within a predefined area to allow gesturerecognition. Often, a separate room has to be used to enable therecognition of the users gestures.

Another possibility to capture gestures is by the use of specialinterface devices, e.g. data gloves [21,22]. The handicap of

professional data gloves, however, is the fact that they are not per se equipped with positioning sensors. This limits the range of detectable gestures to static postures, unless further hardware isapplied. The user has to wear additional gear to enabledetermination of position and orientation of his hand, often withelectromagnetic tracking devices like the Ascension Flock of Birds [23]. These devices allow a relatively exact determination

of the hands position. The problem with using electromagnetictracking, however, is the circumstance that they require the user to wear at least one extra sensor attached to the system by cable,which makes this equipment uncomfortable to wear and restrictsits use to the adjacency of the transmitter unit. Additionally,electromagnetic tracking devices have to be firmly installed andcalibrated, and they are very prone to errors if there are metallicobjects in the vicinity of the tracking system.

3. VISUALIZATION OF DOCUMENTSPACESA collection of documents can be regarded an information space.The documents as well as the relations between them carryinformation. Information is an asset that improves and grows

when used. Information visualization techniques, together withthe ability to generate realistic real-time, interactive applicationscan be leveraged in order to create a new generation of documentexplorers. An application environment that resembles realenvironments can be used more intuitively by persons without any

prior computer knowledge. The users perceive more informationabout the documents, such as their size, location, and relation toother documents, visually using their natural capabilities toremember spatial layout and to navigate in 3D environments,which moreover frees cognitive capacities by shifting part of theinformation-finding load to the visual system. This leads to amore efficient combination of human and computer capabilities:Computers have the ability to quickly search through documents,compute similarities, calculate and render document layouts, and

provide other tools not available in paper document archives. Onthe other hand, humans can visually perceive irregularities andintuitively interact with 3D environments. Last but not least, awell designed virtual-reality-like graphical document explorer ismore fun to the average user than a conventional one and thusmore motivating.

3.1 Visualization of DocumentsThe opening screen of our implemented prototype is avisualization of all documents in the state of being stored in theform of books standing at a bookcase in the back of the room. The

user can pre-select documents of this collection by typing a searchquery in the gray search query panel. The search panel is invoked

by a gesture, which moves it to the bottom of the screen with thegray search field on the left side. The pre-selected documents can

be thought of as belonging to a higher semantic zoom level thanthose in the bookcase and are therefore displayed with moredetail. These documents are moved out of the bookcase (slow-

in/slow-out animation), are rotated so that their textured front page faces the user, and are moved to their place in the startconfiguration. While the pre-selection search query has to betyped in the left search query panel, additional search queriestyped in the color-coded right search query panels can be used for more detailed structuring requests.

Figure 2 shows documents in the so-called PlaneMode . Theresults of the color-coded search queries are visualized by smallscore bars in front of the documents and, moreover, documentswhich match more search queries are moved closer to the user than others, bringing the most relevant documents to the front andthus to the focus of the user.

Important documents show an animated pulsing behavior and thusinstantly catch the user's eye. The color of the documentrepresentation is used to represent the document category and thefrontpage is a thumbnail of the first page. The thickness of thedocument indicates how many pages it contains. The yellownessof the documents encodes a scalar value like date (yellowing

pages). When the mouse pointer is moved over a document, alabel with the file name of the document is displayed directly infront of the document and an enlarged preview texture of the first

Figure 2. PlaneMode

Figure 3. ClusterMode

37


4/9

page is displayed in the upper right corner of the screen. There isalso the possibility to mark documents with red crosses in order todefine user- or task-specific interests and not to lose sight of themwhen they move to different places in different modes. To get acloser view of single documents, they can be moved to the front(the reading position). Our prototype allows the user to saveinteresting document configurations and restore them later.

ClusterMode , a variation of PlaneMode, is shown in figure 3.Here, the most relevant documents are not only moved to thefront but also to the center of the plane, which results in a

pyramid-shape. Depending on which of the four color-codedsearch-queries match a document, the document is moved to thecluster with the colors of these search-queries. E.g., documentswhich match the blue and white search-query, but not the black and brown search query are in the cluster with the blue-white-label.

A variation of ClusterMode can be seen in figure 4: The clustersare semitransparent rings in which the documents rotate. (In thisfigure the wall structure is being removed, the document spaceand thus also the clusters are rotated, and a different color-patternhas been used.) There are two different semantics of connectingthe clusters to each other: The first is to connect the clusters withonly one color with all clusters that also contain that color by linesegments in the same color. The second possibility is to connectclusters that have the same color codes except one additionalcolor by a strut in this dividing color. Another feature of ClusterMode is that the user can change between static mode (the

position of the clusters is optimized in order to avoid occlusion)and dynamic mode (The position of the clusters reorganizesdynamically. When the user clicks on a cluster this cluster ismoved to the focus position.)

3.2 Visualizations of RelationsUsers enter into the information of a document space and work with it, by discovering and maintaining relations betweendocuments. E.g. a new scientific article might be discoveredrelevant to a project and interesting for a co-worker. Thus, usersneed to maintain relations that they personally discover. Beyondthese relations, state-of-the-art technology, implemented in our

system, also allows to support users with a semantic engine,which calculates similarity relations between documents.

Relations between documents can intuitively be represented byconnecting the document under the mouse pointer by curves tothe documents to which it is related (see figure 5). These curvescreate a mental model like that of thought flashes moving theuser's attention from the current document to related documents.This mental model is supported not only by the way the curvesare rendered with shining particles, but also by the fact that thecurves are animated, starting at the document under the mouse

pointer and propagating to the related documents. The advantageof this way of representing relations is that documents are visuallystrongly connected by curves which moreover create animpressive 3D structure. The disadvantage of literally connectingdocuments is that the thought flashes occlude the documents

behind them, which makes it sometimes hard to understand whichdocuments the curves are pointing to.

To overcome this disadvantage, documents which are related tothe document under the mouse pointer can simply be visualizedwith semitransparent green boxes around them (see figure 6).Thus, they can be perceived preattentively and do not occludeanything as the boxes are not much larger than the documents

Figure 4. ClusterMode (variation)

Figure 5. Visualization of relations

Figure 6. Relations (variation)

38


5/9

themselves. Also, the box representation can be used withoutcomplications in PlaneMode and in the bookcase. The connectionof the document under the mouse pointer and the ones in green

boxes are, however, not as intuitively clear as in the case of connecting thought flashes. One possible solution to this dilemmamight be to combine the visualization of both relationvisualization types and their advantages in the visualization of a

single relation type i.e., to use thought flashes that end atdocuments in green boxes.

The visualization of relations as shown in figure 7, is differentfrom the previous ones. When enabled, the user can trigger thedisplay of the relation by clicking on a document. This willdisplay the documents the selected document is related to in theform of ghost documents hovering in front of the main documentspace. The user now has the possibility to select one of these

ghosts by clicking on it, which will trigger an animation thatseems to collapse all of the ghost documents to the real documentthe selected ghost stands for, and to further visually emphasizethis document by displaying a semitransparent red box around itin the moment of collapse. With this visualization technique for relations the user's attention is automatically moved from theselected document to all related documents and finally to therelated document that seems to be most useful to him or her. Fromthis document he or she can start the process again to movefurther to more related documents.

4. INTERACTING WITH DOCUMENTSThe most natural way for humans to manipulate their surroundings, including the documents e.g. on their desktop, is of

course by using their hands. Hands are used to grab and moveobjects or manipulate them in other ways. They are used to pointat, indicate or mark objects of interest. Finally, hands can be usedto communicate with others and state intentions by making

postures or gestures. In most cases, this is done without having tothink about it, and so without interrupting other tasks the personmay be involved with at the same time. Therefore, the most

promising approach to minimize the cognitive load required for learning and using a user interface in a virtual environment is toemploy a gesture recognition engine that lets the user interact

with the application in a natural way by just utilizing his hands inways he is already used to.

Consequential, there is a need for a gesture recognition that is both flexible to be adapted to various conditions like alternatingusers or different hardware, possibly even transportable devices,yet fast and powerful enough to enable a reliable recognition of avariety of gestures without hampering the performance of theactual application. Similar to the introduction of the mouse as anadequate interaction device for graphical user interfaces, gesturerecognition interfaces should be easily defined and integratedeither for interaction in three-dimensional settings or as a meansto interact with the computer without having to use an abstractinterface. This might sound easy to achieve, but it required ussome effort.

4.1 Applied HardwareThe glove hardware we used to realize our gesture recognitionengine was a P5 Glove from Essential reality [24], shown inFigure 2. The P5 is a consumer data glove originally designed as agame controller. It features five bend sensors to track the positionof the wearer's fingers as well as an infrared-based opticaltracking system, allowing computation of the glove's position andorientation without the need for additional hardware. The P5consists of a stationary base station housing the infrared receptorsenabling the spatial tracking. The attainment of position andorientation data is achieved with the help of reflectors mounted on

prominent positions on the glove housing. Dependent on howmany of these reflectors are visible for the base station and onwhich positions the visible reflectors are registered, the glove'sdriver is able to calculate the orientation and position of the glove.

During our work with the P5, we learned that the calculatedvalues for the flexion of the fingers were quite accurate, while thespatial tracking data was, as expected, much less reliable. Theestimated position information was fairly dependable, whereas thevalues for yaw, pitch and roll of the glove were, dependent on

lighting conditions, very unstable, with sudden jumps in thecalculated data. Because of this, additional adequate filteringmechanisms had to be applied to ascertain sufficiently reliablevalues. Of special attention is the very low price of the P5. It costsabout 50 , by comparison to about 4000 for a professional dataglove, which of course provides much more accurate data but onthe other side doesn't come with integrated and transportable

position tracking. Indeed, the low price was one reason we chosethe P5 for our gesture recognition, because it shows thatserviceable interaction hardware for virtual environments can berealized at a cost that makes it an option for the normal consumer

Figure 8. Essential Reality P5 Data Glove

Figure 7. Relations (variation)

39


6/9

market. The other reason for our choice was to show that our recognition engine is powerful and flexible enough to enablereliable gesture recognition even when used with inexpensivegamer hardware.

4.2 Posture and Gesture Recognition andLearningA major problem for the recognition of gestures, especially whenusing visual tracking, is the high amount of computational power required to determine the most likely gesture carried out by theuser. This makes it very difficult to accomplish a reliablerecognition in real-time. Especially when gesture recognition is to

be integrated in running applications that at the same time have torender a virtual environment and manipulate this environmentaccording to the recognized gestures, this is a task that cannot beaccomplished on a single average consumer PC. We aim toachieve a reliable real-time recognition that is capable of runningon any fairly up-to-date workplace PC and can easily beintegrated in normal applications without using too much

processing power of the system. Like Bimbers 'fuzzy logicapproach' [25], we use a set of gestures that have been learned by

performing the gesture to determine the most likely match.However, for our system we do not define gestures as motion over a certain period of time, but as a sequence of postures made atspecific positions with specific orientations of the user's hand.

Thus, the relevant data for each posture is mainly given by theflexions of the individual fingers. However, for some postures theorientation of the hand may be more or less significant. For example, for a pointing gesture with stretched pointing finger, theorientation and position of the hand may be required to determinewhat the user is pointing at, but the gesture itself is the same,whether he is pointing at something to his near left or his far right.On the other hand, for some gestures the orientation data is muchmore relevant, for example the meaning of a fist with outstretchedthumb can differ significantly whether the thumb points upward

or downward. In other cases, the importance of orientation datacan vary, for instance a gesture for dropping an object mayrequire the user to open his hand with the palm pointingdownwards, but it is not necessary to hold his hand completely

plain. Due to this fact, the postures for our recognition engine arecomposed of the flexion values of the fingers, the orientation dataof the hand and an additional value indicating the relevance of theorientation for the posture.

As mentioned before, the required postures are learned by thesystem simply by performing them. This approach makes itextremely easy to teach the system new postures that may berequired for specific applications. The user performs the posture,captures the posture data by hitting a key, names the posture andsets the orientation quota for the posture. Of course, the posture

name can also be given by the application, enabling the user todefine individual gestures to invoke specific functionality.

Alternately, existing postures can be adapted for specific users.To do so, the posture in question is selected and performedseveral times by the user. The system captures the differentvariations of the posture and determines the resulting averaged

posture definition. In this manner, it is possible to create a flexiblecollection of different postures, termed a posture library, withlittle expedience of time. This library can be saved and loaded inform of a gesture definition file, making it possible for the same

application to have different posture definitions for differentusers, allowing an on-line change of the user context.

4.3 Recognition ProcessOur recognition engine consists of two components: the dataacquisition and the gesture manager. The data acquisition runs asa separate thread and is constantly checking the received datafrom the glove for possible matches from the gesture manager. Asmentioned before, position and especially orientation datareceived from the P5 can be very noisy, so they have to beappropriately filtered and smoothed out to enable a sufficientlyreliable matching to the known postures.

First, the tracking data is piped through a deadband filter toreduce the chance of jumping error values in the tracked data.Alterations in the position or orientation data that exceed a givendeadband limit are discarded as improbable and replaced withtheir previous values to eliminate changes in position andorientation that can only be considered as erroneous calculation of the glove's position. The resulting data is then straightened out bya dynamically adjusting average filter. Depending on thevariations of the acquired data, the size of the averaging values is

altered within a defined range. If the data is fluctuating in a smallregion, the size of the filter is increased to compensate jitteringdata. If the values show larger changes, the filter size is reducedto reduce latency in the consequential position and orientation.

The resulting data is reasonably correct enough to provide a good basis for the matching process of the gesture manager. Should thegesture manager find out that the provided data matches a known

posture, this posture is marked as a candidate. To lower the possibility of misrecognition, a posture is only accredited asrecognized when held for an adjustable minimum time span.During our tests it showed that values between 300 and 800milliseconds are suitable to allow a reliable recognition withoutforcing the user to hold the posture for too long. Once a posture isrecognized, a PostureChanged-event is sent to the application thatstarted the acquisition thread. To enable the application to use therecognized posture for further processing, additional data is sentwith the event. Apart from the timestamp, the string identifier of

Figure 9. Our tool for training new postures

40


7/9

the recognized posture as well as the identifier of the previous posture is provided to facilitate the sequencing of postures to amore complex gesture. Furthermore, the position and orientationof the glove at the moment the posture was performed is provided.

In addition to the polling of recognized postures from the gesturemanager, the acquisition thread keeps track of the glove'smovement. If the changes in the position or orientation data of theglove exceed an adjustable threshold, a GloveMove-event is fired.This event is similar to common MouseMove-events, providing

both the start and end values of the position and orientation dataof the movement. Finally, to take into account hardware that

possesses additional buttons like the P5 has, the data acquisitionthread also monitors the state of these buttons and generatescorresponding ButtonPressed- and ButtonReleased-events,

providing the designated number of the button.

It is important to note that although the data acquisition weimplemented was fitted to the Essential Reality P5, it can easily

be adapted to be suitable for any other data glove, either for mere posture recognition or in combination with any additional 6Degrees Of Freedom tracking device like the Ascension Flock of Birds [23] to achieve full gestural interaction.

4.4 The Gesture ManagerThe gesture manager is the principal part of the recognitionengine, maintaining the list of known postures and providingmultiple functions to manage the posture library. As soon as thefirst posture is added to the library or an existing library is loaded,the gesture manager begins matching the data received from thedata acquisition thread to the stored datasets. This is done by firstlooking for the best matching finger constellation. In this firststep, the bend values of the fingers are interpreted as five-dimensional vectors and for each posture definition the distance tothe current data is calculated. If this distance fails to be within anadjustable minimum recognition distance, the posture is discardedas a likely candidate. If a posture matches the data to a relevant

degree, the orientation data is compared in a likewise manner tothe actual values. Depending whether this distance exceedsanother adjustable limit, the likelihood of a match is lowered or raised according to the orientation quota associated with thecorresponding posture dataset. This procedure has proved to bevery reliable concerning both a very fast matching of postures anda very consistent recognition of the performed posture.

Apart from determining the most probable posture, the gesturemanager provides several means to modulate parameters on runtime. New postures can be added, existing postures adapted, or new posture libraries can be loaded. In addition, the recognition

boundaries can be adjusted on the fly, so it is possible to start witha wide recognition range to enable correct recognition of theuser's postures without the posture definitions adapted to this

specific person, and can then be narrowed down as the posturesare customized to the user.

4.5 Recognition of GesturesAs mentioned before, we see actual gestures as a sequence of successive postures. With the help of the PostureChanged-events,our recognition engine provides an extremely flexible way totrack gestures performed by the user. The recognition of single

postures like letters of the American Sign Language ASL is aseasily possible as the recognition of more complex, dynamic

gestures. This is done by tracking the sequence of performed postures as a Finite State Machine. For example, let's consider thedetection of a "click" on an object in a virtual environment. Testswith different users showed that an intuitive gesture for this task is pointing at the object and then tapping at it with the indexfinger. To accomplish the detection of this gesture, one defines a

pointing posture with outstretched index finger and thumb and the

other fingers flexed, then a tapping posture with half-bent indexfinger. All there remains to do in the application is to check for aPostureChanged-event indicating a change from the pointing tothe tapping posture. If, in a certain amount of time, the state of therecognized posture is reversed from tapping to pointing, aclicking gesture is registered at the position provided by thePostureChanged-event. In this manner, almost any desired gesturecan quickly be implemented and recognized.

4.6 Implementation and ResultsWe have evaluated our gesture recognition engine in severaldemo applications representing a virtual document space. Due tothe thread-based architecture of our engine, it was easilyintegrated by adding the recognition thread and reacting to thereceived events, making it comparably straightforward as addingmouse functionality. In the implemented virtual environments, theuser can manipulate various objects representing documents and

trigger specific actions by performing a corresponding intuitivegesture. The next implementation step will be the integration of the developed functionalities into a single prototype.

In order to enhance the degree of immersion for the user, we useda particular demonstration setup as shown in Figure 1. To allowthe user a stereoscopic view of the scene, we used a specialized3D display device, the SeeReal C-I [27]. This monitor creates areal three-dimensional impression of the scene by showing one

perspective view for each eye and separating them through a prism layer. To compensate for the resulting loss in resolution especially while displaying texts we used an additional TFTdisplay to also show a high resolution view of the scene. Atestimony for the speed of our recognition engine is the fact thatwe were able to realize the application logic including therendering of three different perspectives (one for each eye,another one for the non-stereoscopic display), and the trackingand recognition of gestures on a normal consumer grade computer in real-time.

In our environment we used two different kinds of gestures [26],namely semiotic and ergotic gestures. Semiotic gestures are usedto communicate information (in this case pointing out objects tomark them for manipulation), while ergotic gestures are used tomanipulate a persons surroundings. Our demo scenario shown inFigure 10 consists of a virtual desk, on which different documents

Figure 10. Immersive dragndrop operation: picking up,moving and placing documents / document stacks

41


8/9

are arranged randomly. In the background of the scene, a wallcontaining a pin board and a calendar can be seen. Additionally,the users hand is represented by a hand avatar, showing itslocation in the scene as well as the hands orientation and theflexion of the fingers, so that the user can get a better impressionif the glove captures the data correctly.

The user was given multiple means to interact with thisenvironment. First, he could rearrange the documents on the table.This was done by simply moving his hand avatar over a chosendocument, then grabbing it by making a fist. He could then movethe selected document around and drop it in the desired location

by opening his fist, releasing his grip on the document. Another interaction possibility was to have a closer look at either thecalendar or the pin board. To do this, the user had to move hishand in front of the object and point at it. In the implementation,this was realized by simply checking if the hand was in a

bounding box in front of the object when a PostureChanged-eventindicated a pointing posture. Once this happened, the calendar respectively the pin board was brought to the front of the sceneand remained there. The user could reverse them to their originallocation by making a dropping gesture, performed by making a

fist, then spreading his fingers with his palm pointing downward.Additionally, there were several possibilities to interact withspecific documents. For this, one of the documents had first to beselected. To select a document, the user had to move his handover it, and then tap on it in the way described earlier in this

paper. Originally, the gesture for selecting a document had beendefined as pointing at it with thumb spread, then tapping thethumb to the side of the middle finger, but users felt that tappingon the document was the more intuitive way to do this. As ameasure for the flexibility of our gesture recognition interface, thereplacing of one gesture for the other took less than five minutes,

because all we had to do was replacing the thumb-at-middle-finger-posture with the index-finger-half-flexed-posture.

Once a document was selected, it moved to the front of the scene,allowing a closer look at the cover page. The user then had thechoice between putting the document back in its location,

performing the same dropping gesture used for returning thecalendar and pin board, or he could open the document. To openit, he had to grab it in the same way (by making a fist), thenturn his hand around and open it, spreading his fingers with his

palm facing upward. This replaced the desktop scene with a viewof the document itself, represented as a cube with the sides of thecube displaying the pages of the document. Turning the cube tothe left or right enabled the user to browse through the document.This could be done in two ways. The user could either turn single

pages by moving his hand to the left or right side of the cube andtapping on it, in the same manner as described before. In addition,it was possible for him to browse rapidly through the contents of

the document. To do this, the user had to make a thumbs up-gesture with his thumb pointing straight up. He then had the possibility to indicate the desired browsing direction by tilting hishand to the left or right, triggering an automatic, fast turning of the cube. To stop the browsing, he just had to either tilt his thumb

back up to a vertical position or completely cancel the spread-thumb posture.

We had several users (e.g., knowledge workers, secretaries,students, and pupils) test our demonstrational environment,moving documents and browsing through them. Apart from initial

difficulties, especially due to the unfamiliarity with the glovehardware, after a short while most users were able to use thedifferent gestures in a natural way, with only few adaptations of the posture definitions to the individual users. However, duringthe evaluation the browsing gesture turned out to be toocumbersome for leafing through larger documents. Therefore weare currently extending our interaction concept by additional

interaction possibilities. For example, a force feedback joystick seems to be a well fitting additional device by supporting acontinuous browsing speed selection and haptic feedback.

5. CONCLUSIONHuman thinking and knowledge work is heavily dependent onsensing the outside world. One important part of this perception-oriented sensing is the human visual system. It is well-known thatour visual knowledge disclosure - that is, our ability to think,abstract, remember, and understand visually - and our skills tovisually organize are extremely powerful. Our overall vision is torealize an individually customizable virtual world which inspiresthe users thinking, enables the economical usage of his

perceptual power, and adheres to a multiplicity of personal details

with respect to his thought process and knowledge work. We havemade major steps towards this vision, created the necessaryframework, created a couple of modules, and continue to get goodfeedback.

The logical conclusion is that by creating a framework thatemphasizes the strengths of both humans and machines in animmersive virtual environment, we can achieve greatimprovements in the effectiveness of knowledge workers andanalysts. We strive to complete our vision by further extendingour methods to present and visualize data in a way that integratesthe user into his artificial surroundings seamlessly and giveshim/her the opportunity to interact with it in a natural way. In thisconnection, a holistic context and content-sensitive approach for information retrieval, visualization, and navigation in

manipulative virtual environments was introduced. We addressthis promising and comprehensive vision of efficient man-machine interaction in manipulative virtual environments by theterm immersion: a frictionless sequence of operations and asmooth operational flow, integrated with multi-sensory interaction

possibilities, which allows an integral interaction of human work activities and machine support. When implemented to perfection,

Figure 11. Virtual desktop demo

42


9/9