Towards a Linguistically Motivated Model for Selection in

Towards a Linguistically Motivated Model for Selection in Virtual RealityThies Pfeiffer∗

A.I. Group, Faculty of Technology, Bielefeld University

ABSTRACT

Swiftness and robustness of natural communication is tied to theredundancy and complementarity found in our multimodal commu-nication. Swiftness and robustness of human-computer interaction(HCI) is also a key to the success of a virtual reality (VR) envi-ronment. The interpretation of multimodal interaction signals hastherefore been considered a high goal in VR research, e.g. follow-ing the visions of Bolt’s put-that-there in 1980 [1].

It is our impression that research on user interfaces for VR sys-tems has been focused primarily on finding and evaluating technicalsolutions and thus followed a technology-oriented approach to HCI.In this article, we argue to complement this by a human-oriented ap-proach based on the observation of human-human interaction. Theaim is to find models of human-human interaction that can be usedto create user interfaces that feel natural. As the field of Linguis-tics is dedicated to the observation and modeling of human-humancommunication, it could be worthwhile to approach natural user in-terfaces from a linguistic perspective.

We expect at least two benefits from following this approach.First, the human-oriented approach substantiates our understandingof natural human interactions. Second, it brings about a new per-spective by taking the interaction capabilities of a human addresseeinto account, which are not often explicitly considered or comparedwith that of the system. As a consequence of following both ap-proaches to create user interfaces, we expect more general modelsof human interaction to emerge.

Index Terms: H.5.2 [Information Interfaces and Presentation]:User Interfaces—Natural Language; I.3.6 [Computer Graphics]:Methodology and Techniques—Interaction Techniques; H.5.1 [In-formation Interfaces and Presentation]: Multimedia InformationSystems—Artificial, Augmented, and Virtual Realities

1 INTRODUCTION

In the following, we will concentrate on object selection as a veryfundamental interaction task. Similar considerations can be madefor other interaction tasks as well. This work has two contributions.First, we will classify object selection from a linguistic perspectiveand thus introduce the relevant terminology used in this scientificdiscipline. As we expect the reader to be familiar with the basic lit-erature on 3D user interaction (see [2] for an introduction), we fo-cus on providing reference to relevant work in linguistics. Second,we provide examples from different studies to show similarities offindings between the fields and thus stress the comparability of theareas and to highlight insights from human-human interaction thatcould prove valuable for user interface design.

2 THE LINGUISTIC PERSPECTIVE ON OBJECT SELECTION

In VR, object selection is a common task, which requires the userto refer to a specific object (geometry) in the virtual environmentwith an appropriate interaction technique.

∗e-mail: [email protected]

Figure 1: Deictic expressions are used to refer to objects in the world.In the example depicted above, the interlocutor makes an underspec-ified deictic expression. The intended referent object is the bolt closeto the center. The potential extension of the deictic expression in thespeech alone covers a set of possible referent objects. The manualpointing gesture adds the required information to further restrict thepotential extension to the intended object, the referent of the multi-modal deictic expression.

In linguistics, the term reference can be found both in semanticsand pragmatics (see [10] for an overview). It refers to the relationbetween an expression, for example a noun, and the entities that arenamed by such an expression. First, there is the potential mean-ing tied to the expression (semantics) and second, there is at leastone entity that is linked to such a referential expression, the referent(pragmatics). Intuitively, the referent of an expression depends onthe context. The term extension refers to the set of referents of anexpression, given a specific context (see Figure 1). If an expressionis underspecified, only a potential extension with a set of alterna-tive potential referents might be found. Analogously to extensionin pragmatics, denotation refers to the constant meaning of an ex-pression in semantics.

Deixis subsumes referential expressions used to locate and iden-tify concrete or abstract entities within a certain context. Differentcategories of deixis can be identified [6, 11]: place deixis, timedeixis, person deixis, social deixis, and discourse deixis. The selec-tion of object is part of place deixis, i.e., referring to the location ofobjects in space.

In addition, one can distinguish symbolic and gestural usagesof deixis. If general knowledge is sufficient to establish the refer-ence, it is called symbolic usage. If an active sensory process isneeded to understand the deictic expression, it is called gestural us-age. A typical case are deictic expressions that comprise a pointinggesture using the index finger, accompanied by a verbal expres-sion like “this X”. Besides such pointing gestures, the direction ofgaze may also be part of a gestural usage of deixis. According toBuehler [3], place deixis is one of the most basic means of humancommunication. It establishes the link between internal symbolsand the entities in the exterior world.

From a linguistic perspective it is thus not surprising to find thatobject selection is a basic interaction task in VR environments. Theuser communicates his concepts, which can be thought of as thesymbols of his mind, by linking them via a pointing – or selection– act with the objects presented in the VR environment. The sys-tem then has to interpret the user’s selection and will finally comeup with a symbolic representation suitable for further processing inthe system. Most interfaces implementing an object selection tech-nique are based on such a gestural usage of deixis.

3 EXAMPLES FOR RELEVANT LINGUISTIC FINDINGS

3.1 Ray Casting vs. Flashlight

Typical object selection techniques are ray casting and flash-light [9]. The latter uses a cone for object selection, which showsa better performance, e.g., with objects of small visual appear-ance (because of object distance or size). Butterworth and Itakurashowed in a series of experiments [5] that the precision with whichhumans can differentiate between target objects when looking inthe same direction as the pointing interlocutor is between 10◦ and15◦. The human addressee thus also seems to use a model for theinterpretation of a pointing gesture similar to the flashlight modeldescribed by [9]. In a follow-up paper Butterworth concludes thatthe precision of a vector-based interpretation of pointing is not suf-ficient to single out the referent and additional cues are required [4].Kranstedt et.al. [8] substantiated this model of human interpretationof pointing when presenting their pointing cone. Thus cone-basedmodels, such as flashlight or aperture-based selection [7], have abroader applicability beyond 3D user interfaces.

3.2 Timing of Pointing Gestures

An example of a human-oriented approach is the work of Mueller-Tomfelde [12] on dwell-based pointing. In two studies he investi-gated whether there is a natural dwell time in human pointing ac-tions. In the first study, the participants were told to use dwell-based pointing without feedback in a way, that they expect anotherperson to understand the gesture (imagined human-human interac-tion). In the second study, participants were shown animated ex-amples from the first study and confirmed the pointing gestures.The median dwell time used by the participants in the first studywas about 1 s while the median response time in the second studywas about 0.43 s. In a subsequent technical experiment (without animagined interaction partner as addressee), he confirmed that themedian 0.43 s to be reasonable as an average of a natural dwelltime. Mueller-Tomfelde especially stressed the different qualityof the findings compared to previous work, emphasizing that theywere based only on empirical evidence without any technical con-straints.

3.3 The Direction of Pointing and the Role of Gaze

In our own work (see [13]), we focused on manual pointing gesturesin dyadic interactions. We found an interaction between speechand gesture, which suggests that humans are aware of the loss ofprecision with increasing distance and have different strategies tocompensate for that. We found an increased use of words in theverbal description (cross-modal compensation) or, for participantswho were not allowed to speak, a set of strategies used to adapt thegestures aiming at a higher precision (uni-modal compensation).We also found that the direction of pointing was best described bya ray originating in the dominant eye of the speaker aiming over theextended index finger towards the target. As an optimal openingangle for a pointing cone we found 14◦ for distal pointing and foundorthogonal distances to provide better means to identify pointingtargets in the proximal area (reaching space).

4 CONCLUSION

Approaching typical user interface problems from a technology-oriented and a human-oriented perspective could be beneficial. Theexamples show, e.g., that there has been some parallel research onthe issues of object selection/pointing in the different fields. Theadvantages of the technology-oriented community are precise mea-suring instruments, where most linguists are still bound to the an-notation of 2D video recordings. On the other hand, the linguisticperspective is more holistic, considering multimodality and the em-bedding of the interaction in the visual context (density of objects,etc.) and in the history of interaction, e.g., the dialog context incommunication. The introduction of the relevant terms in Linguis-tics can be a starting point for own research in this field.

In technical applications for experts, the precision required forunimodal selection or manipulation tasks highly motivates the de-sign of specific tools, such as a pointing devices, and the perfor-mance of the interactions designed for these devices exceeds thecapabilities of a human interaction partner. However, with the es-calating dissemination of VR and congeneric technologies, morecasual interfaces are of interest, where users just step in and startinteracting – without a more or less steep learning curve and ide-ally without attaching or using any artificial tracking or interactiondevices. Much could be learned for the design of such interfacesby studying human-human interactions, as it is done in other fields,such as Linguistics or Psychology. In our framework for deicticreference in virtual reality (DRIVE) [13], we follow the proposedapproach aiming at a linguistically motivated framework for objectselection/object deixis in VR.

REFERENCES

[1] R. Bolt. Put-That-There: Voice and gesture at the graphics inter-face. In ACM SIGGRAPH - Computer Graphics, pages 262–270, NewYork, 1980. ACM Press.

[2] D. A. Bowman, E. Kruijff, J. Joseph J. LaViola, and I. Poupyrev. 3DUser Interfaces – Theory and Practice. Addison-Wesley, 2005.

[3] K. Buhler. Sprachtheorie: Die Darstellungsform der Sprache. GustavFischer, Jena, 1934.

[4] G. Butterworth. Pointing is the royal road to language for babies.In S. Kita, editor, Pointing: Where Language, Culture, and CognitionMeet, chapter 2, pages 9–33. Lawrence Erlbaum Associates, Mahwah,New Jersey, 2003.

[5] G. Butterworth and S. Itakura. How the eyes, head and hand serve def-inite reference. British Journal of Developmental Psychology, 18:25–50, 2000.

[6] C. J. Fillmore. Santa Cruz Lectures on Deixis 1971. Indiana Univer-sity Linguistics Club, University of California, Berkeley, November1975.

[7] A. Forsberg, K. Herndon, and R. Zeleznik. Aperture based selectionfor immersive virtual environments. In Proceedings of the 9th annualACM symposium on User interface software and technology, UIST’96, pages 95–96, New York, NY, USA, 1996. ACM.

[8] A. Kranstedt, A. Lucking, T. Pfeiffer, H. Rieser, and I. Wachsmuth.Deixis: How to Determine Demonstrated Objects Using a PointingCone. In S. Gibet, N. Courty, and J.-F. Kamp, editors, Gesture Work-shop 2005, LNAI 3881, pages 300–311, Berlin Heidelberg, 2006.Springer-Verlag GmbH.

[9] J. Liang and M. Green. JDCAD: a highly interactive 3D modelingsystem. Computers & graphics, 18(4):499–506, 1994.

[10] J. Lyons. Introduction to theoretical linguistics. Cambridge UniversityPress, 1968.

[11] J. Lyons. Semantics, volume 2, chapter Deixis, space and time, pages636–724. Cambridge Univ. Press, 1977.

[12] C. Muller-Tomfelde. Dwell-Based Pointing in Applications of HumanComputer Interaction. In Human-Computer Interaction INTERACT2007 (LNCS 4662), pages 560–573. Springer Berlin, 2009.

[13] T. Pfeiffer. Understanding Multimodal Deixis with Gaze and Gesturein Conversational Interfaces. Berichte aus der Informatik. Shaker Ver-lag, Aachen, Germany, December 2011.

Documents

Towards a Linguistically Motivated Model for Selection in