Influencing factors on multimodal interaction during selection tasks

J Multimodal User InterfacesDOI 10.1007/s12193-012-0117-5

ORIGINAL PAPER

Influencing factors on multimodal interactionduring selection tasks

Felix Schüssel · Frank Honold · Michael Weber

Received: 10 April 2012 / Accepted: 5 November 2012© OpenInterface Association 2012

Abstract When developing multimodal interactive systemsit is not clear which importance should be given to whichmodality. In order to study influencing factors on multimodalinteraction, we conducted a Wizard of Oz study on a basicrecurrent task: 53 subjects performed diverse selections ofobjects on a screen. The way and modality of interactionwas not specified nor predefined by the system, and the userswere free in how and what to select. Natural input modal-ities like speech, gestures, touch, and arbitrary multimodalcombinations of these were recorded as dependent variables.As independent variables, subjects’ gender, personality traits,and affinity towards technical devices were surveyed, as wellas the system’s varying presentation styles of the selection.Our statistical analyses reveal gender as a momentous influ-encing factor and point out the role of individuality for theway of interaction, while the influence of the system outputseems to be quite limited. This knowledge about the preva-lent task of selection will be useful for designing effectiveand efficient multimodal interactive systems across a widerange of applications and domains.

Keywords Natural human computer interaction ·Multimodal interaction · User study · Gender differences ·Selection task · Input · Output

F. Schüssel (B) · F. Honold · M. WeberInstitute of Media Informatics, Ulm University, Ulm, Germanye-mail: [email protected]

F. Honolde-mail: [email protected]

M. Webere-mail: [email protected]

1 Introduction

Advances in processing power, sensory technology, andrecognition techniques in recent years gave rise to new formsof interaction. Respectively a multitude of input modalitiesis available for both, users and developers of HCI systems.Nowadays, even handheld systems are able to perform realtime speech recognition and static computing environmentscan be equipped with gesture sensors that are able to recog-nize 3D movements of individual body parts. Thus moreand more effort is dedicated to systems allowing users tointeract the way they find natural, freeing them (at least par-tially) from the necessity to learn the interaction languagethe system understands. Today the burden lies on the sys-tem’s side to understand the user’s language. This is notan easy task and researchers have difficulties finding outwhy people interact with computing systems the way theydo. Although there is empirical evidence for some kind of“rules” in different contexts, special domains, and specialtasks at hand, little is known about general factors that influ-ence people’s way of interaction with multimodal systems.One basic question that comes to mind: which modalitydo users choose naturally, if they are not constrained bydevice limits or by an a priori defined interaction language,i.e. strictly limited interaction possibilities in each interac-tion step? The presented study should help to answer thisquestion.

The paper is structured as follows. Related work onthis topic is presented in the next section, which leadsto the specific research questions about influencing fac-tors (e.g. the role of gender) that we are going to answerwith our user study. The experimental design section statesthe specific problem instance chosen for the study and theentire data acquisition process, and concludes with knownlimitations of the design. The results section then lists

123

J Multimodal User Interfaces

all findings relevant to the research questions. Within thediscussion, the findings are critically examined before ourconclusions are summed up and a course is set for futureresearch.

2 Related work

There have been several studies investigating the use of mul-timodal inputs in different contexts. Some of them haverelevance to our own study by already revealing differentinfluencing factors on the use of or preference for certainmodalities.

First of all, the application domain, and hence the involvedtasks, can have a momentous influence. Several researchersreport this fact on different levels of detail. Ren et al. [20]found out, that for a CAD application, users prefer pen+speech+mouse mode, while in an independent study withother users, pen+speech mode is preferred for a map appli-cation. More detailed influence of the task at hand is reportedin [15], were results show that users prefer touch input fordenoting positions while speech is preferred for abstractactions. Similarly, Ratzka [18] reported mainly touch inputfor selecting options on screen and speech use for textinput.

Abstracting from the task itself, several researchers foundthe complexity of the task to be of importance for the use ofmultimodality. Oviatt et al. report in [17] that the likelihoodof multimodality increases with the cognitive load of thetask. Similarly Jöst et al. [8] show that the more complex thetask is, the higher are the usability ratings for multimodalinteraction compared to unimodal interaction.

The presentation of a task is another influencing factor onthe choice of input modalities users make, when performingit. De Angeli et al. [4] proved this with a form-filling task.Presence of short labels encourages users to make redun-dant references, i.e. co-occurrence of mouse pointing at anentry field and a fully determined verbal specification. Theyalso found out that feedback on mouse pointing gestures(i.e. visually detectable change of the pointed object interms of highlighting) increases the likelihood of multimodalinputs. Despite the presentation in a single modality, the veryuse of an output modality by the system changes the modal-ities users choose for input. In a smart home environment,Bellik et al. [1] show that verbal output modalities like textand speech lead to speech input, while graphics output (icons)leads to touch screen use. The authors assume some kind of“influence power” for each output modality that gets sub-sumed when multiple modalities are combined. Using allavailable modalities (in that case text, graphical icons, andspeech) for system output, users still prefer speech over touchand remote control input.

Another important influencing factor is the situationalcontext in terms of privacy of the situation where the inter-action takes place. In addition to the influence of task com-plexity, the study in [8] proofs a declining willingness to usemultimodal interaction with decreasing intimacy of the rela-tionship between the user and other persons around. Going alittle more into detail, Wasinger and Krüger report that usersprefer to use non-observable modalities like handwriting andtouch gestures in public environments where they are in pres-ence of strangers [20]. Likewise, Reis et al. report in [19],that speech is not used in public environments and addition-ally emphasize the influence of the privacy of informationthat is exchanged with the system.

The user itself also plays a decisive role for the interac-tion. De Angeli et al. identified computer literacy (expert vs.naive users) as deeply influencing the likelihood of multi-modal occurrence in their form-filling task [4]. Expert usersclearly preferred to interact multimodally, whereas naiveusers showed no preference between unimodal and multi-modal inputs. This could be due to the fact, that expert usersare more curious about technology and innovation, as Reiset al. report the tendency of their users (that were all famil-iar with computer and mobile phone use) to use new anduncommon ways of interaction [19].

Surprisingly, from the aforementioned studies, onlyWasinger and Krüger [21] have examined the role of genderfor the interaction. Their usability test reveals gender differ-ences in modality preferences. In their real world study in apublic environment, women report lower ratings on feelingsof “comfortable using speech” than men and higher ratings of“hesitant” and “embarrassed” then men when using speech.Similar differences hold for the use of pointing gestures onreal world objects.

Summing up, there has been much work on different fac-tors that influence multimodal interaction, mostly derivedfrom usability studies or Wizard of Oz experiments of com-plete systems with a complex mixture of tasks. All thesestudies provide interesting insights into the complex depen-dencies between the user, the system, and the interaction in-between. However, their transferability to new systems fromdifferent domains remains difficult. Be it that new forms ofinteraction replaced the employed modalities of these stud-ies today (like pen input through direct touch) or just that thetasks are too special for a reliable transfer of the findings todifferent domains. Above all, our understanding of the roleof the user itself as the ultimate instance deciding the way ofinteraction seems still very limited.

The presented study answers some of the questions thatare implied by this, as being elaborated further in the nextsections. What is not in focus of our work, are influencefactors on the performance of users (as in [10,11]), or under-lying cognitive mechanisms (as in [2,5]) when interactingwith multimodal systems.

123


3 Research questions

As apparent by the aforementioned related work, there are atleast five factors that have been proved to influence the choiceof input modalities: task, presentation, situational context,computer literacy, and gender. While it seems too complexto set up an experiment investigating all these factors alto-gether, we decided to constrain some, while others should beinvestigated more broadly.

As the task at hand strongly depends on the domain andapplication, we wanted to focus on a single task, that is asdomain independent as possible. For this reason, we investi-gate selection as a common and prevalent task in almost everysystem. With that in mind, the presentation should still be var-ied, because it seems obvious, that the possibilities to selectan item (e.g. by verbally describing it) strongly depend on theway it is presented. The study should take place in the control-lable environment of our technical laboratory, to have a fixedsituational context (participants are alone in a silent envi-ronment). Instead of using a subjective performance metriclike computer literacy, we chose Karer’s TA-EG (technologyaffinity-electronic devices) [9], as a more differentiated con-sideration of the users attitude towards technical systems.To characterize users in a non technology related way, inaddition to gender, we decided to use personality traitsby means of the NEO-five factor inventory by Costa andMcCrae [3].

Summing up, the overall research questions we are tryingto answer are as follows:

– How does the use of input modalities for selectionsdepend on the user’s gender?

– How does the use of input modalities for selectionsdepend on the user’s personality traits?

– How does the use of input modalities for selectionsdepend on the user’s affinity towards technical systems?

– How does the use of input modalities for selectionsdepend on the way the task and the selectable items arepresented by the system?

The specific operationalizations of these questions are givenin the according result sections of this paper.

With respect to state of the art technologies and theincreasing success of natural ways of interaction offered bytouch screens, gesture, and voice recognizers, we decided toinclude only those ways of interaction, that get along with-out auxiliary devices the user has to operate. This means weinvestigate those input modalities that are natural in humancommunication in the form of speech, gestures and touch. Aswell-known output modalities of the system, visual displayand speech output are employed.

4 Experimental design

The study was performed on a stationary wall-mounted sys-tem, as these kinds of systems will presumably be amongthe first to be equipped with sensory technologies enablingthe intended natural interaction. As an application scenario,one could think of public information displays for floor plans,where guests can read up on the different facilities of a build-ing, or interactive advertising in shop-windows. At home,there could be an interactive TV set or a home automationsystem. Basically, all scenarios involving an interactive wallmounted display have selection as a fundamental task to beperformed.

In order to minimize effort and to avoid the risk oferror-prone sensory systems, a Wizard of Oz scenario wasimplemented, where an operator in a separate control roominterpreted all user inputs utilizing audio and video trans-mission from the interaction scene (see Fig. 1). Using aremote desktop connection to the test system, an interac-tive Microsoft PowerPoint presentation was controlled bythe operator as response to observed user actions. Figure 2shows a user in front of the system as he performs a selectionvia a pointing gesture. When asked after the study, none ofthe subjects reported suspicion of a human operator behindthe system behavior, proving the technical feasibility of theexperimental design.

4.1 Tasks

The selection tasks to be performed consisted of graphi-cal icons of apples as a simple and neutral abstraction ofselectable items. To allow for long distance interaction, thetasks were spread over two screens and the participants werefree to move in space. There were 23 selection tasks in total,

Fig. 1 An operator in the control room observing the user interactionto trigger the system’s reaction

123


Fig. 2 A user in front of the test system performing a selection usinga pointing gesture

with varied amounts of items, arrangements, and presentationstyles (see Table 1, independent variables). Figure 3 showssome examples of actual selection tasks. Note that the taskdescription, whether textual, spoken, or mixed, only statedto select an apple (e.g. “Select one, please!”), not whichone. This way, the participants were totally free in whichapple to select (but exactly one) and how to perform theselection. This differs from studies that focus on the perfor-mance of visual search (e.g. [10]), as they usually let par-ticipants search and select a pre-specified object as fast aspossible. In contrast to that, we wanted to identify the inputmodality (or combination), that was most naturally chosenby participants.

In order to avoid the monotony of 23 selections in a rowand to simultaneously record facial expressions and posturedata for research colleagues, there were 27 other tasks not infocus of this paper. These consisted of different riddles andpuzzles like mathematical word problems, picture puzzlesor “find the hidden number” pictures, as shown in Fig. 4.Additionally, these other tasks should help to distract subjectsfrom the selection tasks, so that they don’t reflect too muchon the way they perform them. All tasks were presented inrandomized order to avoid sequence effects.

4.2 Participants

The participants were acquired through notices posted oncampus and were paid for their attendance. Overall, 53 vol-unteers took part in the study, with an average age of 21.9 anda standard deviation of 2.4 years. There were 27 female and26 male participants, mostly students with no background ofcomputer science (47.2 % medicine, 13.2 % mathematics,9.4 % biology/chemistry, 7.5 % physics, 7.5 % mathemati-cal biometrics and 15.2 % others). All of them were nativespeakers.

Table 1 Acquired variables used for the statistical analysis

Independent variables Task related

Task description for the user Purely textual (graphic output)

Purely linguistic (speech output)

Mixed

Labeling of objects None

Numerical (e.g. “1”, “2”)

Aplhabetical (e.g. “A”, “B”)

Textual (e.g. “Granny Smith”)

Diverseness of objects Identical

Partially identical

Different

Number of objects 2, 3, 4, ≥ 5

Arrangement of objects Horizontal

Vertical

Diagonal

Like dice pips

Random

Color mode of objects Black-and-white

Colored

Distinguishability of objects Discriminable

Not discriminable

Subject variables User related

Gender Male

Female

Age 0–99

Field of study Medicine, biology,mathematics etc.

Neuroticism 0–48 NEO-FFI

Extraversion 0–48

Openness 0–48

Agreeableness 0–48

Conscientiousness 1.0–5.0 TA-EG

Excitement 1.0–5.0

Competence 1.0–5.0

Negative attitude 1.0–5.0

Positive attitude 1.0–5.0

Dependent variables Interaction related

Modality of selection Speech

Gesture

Touch

Speech + gesture

Speech + touch

Gesture + touch

Speech + gesture + touch

Multimodality of selection Unimodal

Multimodal

123


Fig. 3 Eight out of the 23 selection tasks, differing in the amount of items, arrangement and presentation styles. The lower right is an example fora selection task spreaded over two screens

Fig. 4 Examples of the 27 other tasks that should distract subjects from the selection tasks: mathematical word problems, spot the differencepictures, a find the hidden number task, and a picture puzzle

4.3 Experimental procedure

After being greeted by a staff member, the participants wereinformed about our research consortium and its general goals.It was stated, that a novel system is to be tested, that should becapable of understanding almost everything the user is doing,due to its large amount of sensors (cameras, laser-scanners,

directional microphones etc.). Then they were asked to sign aletter of agreement, and completed the standardized question-naires on personality traits [6], and their technology affinitytowards electronic devices [9].

After that, they were equipped with a radio headset micro-phone and got a brief introduction to the system. Theywere told, that the system will present them with different

123


self-explanatory tasks and that all of the natural interactionmodalities are feasible, explicitly stating that we don’t pre-scribe their way of interacting with the system.

The test itself began with a greeting by the system anda faked calibration sequence, which introduced the differentkinds of possible interactions (speech, touch, pointing ges-tures and combinations of these). The participants had to readwritten commands on the screen, touch on selected areas, andperform pointing gestures while reading written commands.Then, the overall 50 tasks (including the 23 selection tasks)as described in Sect. 4.1 were presented in randomized order.Regarding the selections, the operator had clear instructionson how to react upon user actions. Selections were triggeredby a direct touch of the user on the object, by discriminatingit verbally (e.g. “the red one”, “the upper one”, etc.), or byperforming a pointing gesture on an apple (clearly remain-ing on one object or triggering the selection by an additionalutterance like “this one”). When an apple was selected insuch way, the operator highlighted it on screen and a confir-mative sound was played, before proceeding to the next task.In case the operator was not sure about the selected itemor just did not understand the participant, several predefinedsentences could be talked to the user via a text-to-speechsystem, e.g.“Have I marked the right object?” or “I could notunderstand you. Please speak louder”.

When the test was over, the staff member re-entered theroom and instructed the participants to fill-out additionalquestionnaires and to give their demographical data. Finallythey were informed about the Wizard of Oz procedure andasked to maintain confidentiality until the end of the entirestudy.

4.4 Data recording

For the purpose of this study the following data was recorded:

– Log files: the actions triggered by the wizard were writteninto a log file containing the resulting task order and allsystem events (as reaction to the observed subject behav-ior) in chronological order.

– Video recordings: all sessions were observed by two cam-eras, one in front of the object, and another from an over-head perspective behind the test subject.

– Audio recordings: two audio streams were recorded,one from the lightweight radio headset microphone, andanother from two directional microphones covering thearea in front of the system.

– Questionnaires: for the personality traits, participantscompleted the German version [6] of the NEO-FiveFactor Inventory (NEO-FFI) by Costa and McCrae [3],resulting in scores for Neuroticism, Extraversion, Open-ness, Agreeableness, and Conscientiousness. For mea-suring their affinity towards technical devices, they filled

out the TA-EG (technology affinity-electronic devices)questionnaire [9], resulting in scores for Enthusiasm,Competence, as well as negative and positive attitude.

4.5 Acquired variables and labeling of data

Table 1 gives a complete overview of all acquired variablestogether with their possible values. To obtain the valuesfor the modality of a selection, each subject’s recordings(audio and video) were labeled using the video annotationtool ANVIL [12]. As a result, each selection performed by aparticipant got one of the following labels:

– Speech, e.g. just saying: “The rightmost one”– Gesture, e.g. just pointing at an object– Touch, e.g. just touching the object of choice– Any multimodal combination of the above,

e.g. Speech + Gesture when using deictic references like“this one” together with a pointing gesture.

4.6 Statistical analysis

All statistical analysis was conducted with IBM SPSS®

statistics. Since the requirements for parametric tests wereoften not fulfilled, due to the fact that most of the recordedvariables were measured on nominal or ordinal scales, andnormal distribution was often not given in the data, solelynon-parametric tests were applied. This not only avoideda mix-up of different statistical methods, but also simpli-fied computations and minimized the risk of inappropriateuse. Although the statistical power of non-parametric testscan be lower compared to their parametric counterparts, nosignificant differences in the results were to be expected.Regarding our research questions, three different tests werechosen. Dependencies between the used input modalities andthe way selections are presented were calculated using theWilcoxon matched-pairs signed-rank test. For correlationsbetween the input modalities and personality traits or tech-nology affinity, Spearman’s rank correlation coefficient wasused. Significance of gender differences was analyzed withMann–Whitney U tests. The applied level of significance was5 % across all tests.

4.7 Limitations

The chosen experimental design described above has somelimitations we are ware of, as described in the followingparagraphs.

Study design issues: regarding the selection tasks in retro-spect, their assortment was not optimal. Initially, the depen-dencies on the presentation style should be examined in moredetail, e.g. what is the influence of the number of selectionobjects and their arrangement. Due to the limited number

123


of selection tasks, however, it sometimes lacked necessarypairs that only differ in the aspect of interest. So only fewaspects of presentation style (task description, labeling, anddistinguishability) could be examined.

Instead of recording dependent samples by having all sub-jects performing all selection tasks, analysis would have beeneasier if subjects were divided into separate groups. Thiswould have enormously increased the necessary number ofsubjects, though.

Embedding the selection tasks into the others indeed pre-vented repetitive tasks and distracted subjects from the wayof selection. On the other hand, some influence on the wayof interaction cannot be excluded, although we tried to mini-mize this influence by an even ratio of tasks to be performedby speech and touch.

Issues of the participant demographics: with an averageage of 21.9 years and a standard deviation of 2.4 years,the participants are quite young. Also the educational back-ground is very homogenous with nearly half of them beingmedical students. These demographics yield accurate resultswith little danger of producing over-generalizations. Nev-ertheless, statistical inference on the parent population ofusers seems difficult, because interaction with technical sys-tems certainly depends on habit and daily experience of theobserved population. Additionally, the validity period of sucha user study’s results is always limited, as technology pro-gresses and market penetration of standards changes. It isreasonable, for instance, that nearly all participants of thestudy at hand have experience with touch screen devices likesmartphones, whereas familiarity with natural voice interac-tion can still be regarded as quite limited.

5 Results

With 53 participants, data from 1,219 selections were avail-able for analysis, although for testing of specific conditions(regarding presentation style) only a subset of matched pairswas used.

The core issue of our analysis deals with the percentageduse of input modalities for the selection of objects underdifferent conditions. Overall, the subjects show a clear pref-erence for touch input, followed by speech and pointing ges-ture inputs. Multimodal inputs are used very rarely. Figure 5shows the percentages of all modalities used during the study.

The following sections describe our findings correspond-ing to the four research questions stated in Sect. 3.

5.1 The role of gender

Before studying the influence of gender on the use of inputmodalities, we first describe gender specific differences inthe NEO-FFI and TA-EG values.

Speech +Touch

Speech +Gesture

GestureSpeechTouch

Per

cent

ages

70,0%

60,0%

50,0%

40,0%

30,0%

20,0%

10,0%

0,0%

11,2%21,6%

63,2%

0,5%3,6%

Modalities used for Selection

Fig. 5 Overall percentages of the used modalities for all 1,219 selec-tions performed during the study

48

40

32

24

16

8

0Conscientiousness

AgreeablenessOpenness

ExtraversionNeuroticism

fm

Gender

NEO-FFI Scores

Fig. 6 NEO-FFI values for male and female subjects

Differences in personality traits and technology affinity:Figs. 6 and 7 shows the results of the NEO-FFI and TA-EGquestionnaires for male and female participants. Through-out all personality traits of the NEO-FFI, females scorehigher values then males (similar to the German Norm [13]),although only the difference in Agreeableness is significant(N = 26 male/27 female, Mann–Whitney U = 206, p = 0.01).Within the technology affinity scores of the TA-EG, the sit-uation is contrary with males scoring higher than femalesthroughout all values. The male scores for excitement andcompetence are notably higher with significant differences(N = 26 male/27 female, Mann–Whitney U = 70, p < 0.001and U = 223, p = 0.02).

Differences in modality use: percentaged usages of mo-dalities differ greatly between male and female participants.Figure 8 illustrates these differences. While females highlyfavor touch interaction to select objects (82.7 % touch,10.1 % speech, 4.7 % gesture), males tend to use othermodalities way more often (42.8 % touch, 33.4 % speech,

123


5,0

4,5

4,0

3,5

3,0

2,5

2,0

1,5

1,0

Positive AttitudeNegative Attitude

CompetenceExcitement

fm

GenderTA-EG Scores

Fig. 7 TA-EG values for male and female subjects

Speech+TouchSpeech+Gesture

GestureSpeech

Touch

Mea

n P

erce

ntag

e of

Use

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

fm

Gender

Modalities used for Selection

Fig. 8 Mean percentages of modality use for male and female subjectsperforming a selection

17.9 % gesture). Multimodal ways of input for selectiontasks are quite rare for both genders. The mean numbersof touch, speech, and gesture modalities differ highly signif-icant (N = 26 male/27 female, Mann–Whitney U = 152.5,p < 0.001; U = 184, p = 0.002; U = 225.5, p = 0.01).

5.2 Influence of personality traits

Can personality traits serve as a predictor for modality usage?We tried to answer this question by performing correlationanalyses on the results of the NEO-FFI questionnaire. Wedecided to perform separate analyses for each gender due tothe differences in the mean NEO-FFI scores and in modalityuse. Prior to the correlation calculation, outliers within theNEO-FFI scores were removed from the data (two femalesubjects, cf. Fig. 6). Regarding the modalities, only thosewere taken into account that had high enough frequenciesof usage, so that they exhibited a real distribution and didnot merely consist of outliers. So for male subjects, touch,speech, and gesture modalities were taken into account, whilefor females only touch remained in consideration.

Extraversion4035302520

Num

ber

of S

elec

tions

via

Spe

ech

25

20

15

10

5

0

Males

Fig. 9 Extraversion as a predictor for the use of speech for male sub-jects. The regression line shows the positive relationship

Openness403530252015

Num

ber

of S

elec

tions

via

Spe

ech

25

20

15

10

5

0

Males

Fig. 10 Openness as a predictor for the use of speech for male subjects.The regression line shows the positive relationship

For the all-female group of participants, there is no sig-nificant correlation between any of the personality traits andthe number of selections per touch.

For the male subjects, there are positive correlationsbetween extraversion and openness and the number of speechuses, which are statistically significant by Spearman’s rankcorrelation (N = 26, rs(24) = 0.429, p = 0.029 for extraver-sion; rs(24) = 0.391, p = 0.048 for openness). Figures 9 and 10illustrate these ties.

123


Competence4,003,803,603,403,203,002,80

Num

ber

of S

elec

tions

via

Tou

ch

25

20

15

10

5

Females

Fig. 11 Competence in technology as a predictor for the use of touchinput for female subjects. The regression line shows the positive rela-tionship

5.3 Influence of technology affinity

Similar to the personality traits, we performed gender-specific correlation analyses on the results of the TA-EGquestionnaire. Outliers within the TA-EG scores wereremoved (three male and two female subjects, cf. Fig. 7). Forthe same reasons as above, the modalities taken into accountwere: touch, speech, and gesture for males, but only touchfor females.

This time, no significant correlations exist for male sub-jects. However, female subjects show a significant positivecorrelation between Competence and the number of selec-tions via touch (N = 25, rs(23) = 0.421, p = 0.036). At thesame time, there is a significant negative correlation for neg-ative attitude and the use of touch (N = 27, rs(25) = −0.461,p = 0.016). Figures 11 and 12 illustrate these ties.

5.4 Influence of presentation

Regarding the influence of the way the task and the selectableitems are presented on the use of input modalities, threeaspects of presentation are studied: the modality of the taskdescription, the labeling of selection objects, and the distin-guishability of the objects.

Task description: the task description was realized as texton the screen, e.g. “Select one, please!”, as a verbal systemoutput “Which apple do you want?”, or in a mixed way. Fourtasks (two matched pairs) where selected for the analysis:two of which had a purely spoken task description, the othertwo had a textual description.

Negative Attitude4,003,503,002,502,001,501,00

Num

ber

of S

elec

tions

via

Tou

ch

25

20

15

10

5

Females

Fig. 12 Negative Attitude towards technology as a predictor for theuse of touch input for female subjects. The regression line shows thenegative relationship

As illustrated in Fig. 13, the touch interaction predomi-nates the other modalities under both conditions. Neverthe-less, the mean percentage of selections per user via touchdecreases from 61.3 % for the textual description to 52.8 %for the spoken one. At the same time, speech interaction per-centages increase from 22.6 to 34.9 %. The Wilcoxon testreveals, that these differences are significant (N = 53, speech:Z = −2.71, p = 0.007; touch: Z = −2.18, p = 0.029). The othermodalities, gesture and multimodal ones, show no significantchanges. A per gender analysis shows no significances, dueto the sample sizes becoming too small when only analyzingfour tasks.

Labeling of selection objects: investigating, if the pres-ence or absence of textual labels on the selection objects hasan influence on the use of input modalities, four tasks (twomatched pairs that offered the same apples) were selected foranalysis. Two tasks contained visible labels (one with alpha-betical labels and one with textual labels) while the other twoshowed no labels.

The average percentages of used selection modalities forboth conditions do not significantly differ from each other.In both cases, touch interaction clearly predominates theother modalities in usage percentages with 63.2 % in labeledand 65.1 % in unlabeled conditions, respectively. With 22.6and 19.8 % usage, the values for the speech modality differonly slightly, likewise for the gesture modality with 11.3 and10.4 %. Again, the usage of the other modalities and mul-timodal combinations is only marginal. Again, per genderanalysis gives no further insights.

123


OthersGestureSpeechTouch

Mea

n P

erce

ntag

e pe

r U

ser

70%

60%

50%

40%

30%

20%

10%

0%

52,8

9,4

34,9

61,3

13,222,6

2,82,8

spokentextual

Task Description

Modalities used for Selectiondepending on Task Description

Fig. 13 Mean percentages of used selection modalities per user in tex-tual and spoken task description conditions

Distinguishability of selection objects: to test, if the dis-tinguishability of selection objects has any influence on theuser’s applied modality, we chose four tasks with discrim-inable objects, and four tasks with objects hard to differen-tiate. While the former contained (any kind of) labels anddifferent apples, the latter consisted of unlabeled identicalapples or different apples presented in black-and-white.

As with the labeling, subjects show no significant changein modality use. However, the use of speech increases slightlyfrom 20.8 % (not discriminable) to 24.1 % (discriminable).The overall usage distributions are as follows: 63.2 % touch,20.8 % speech, 12.7 % gesture, and 3.3 % others for the taskswith not discriminable objects. For the tasks with discrim-inable objects the results are: 61.8 % touch, 24.1 % speech,11.3 % gesture, and 2.8 % others. Per gender analysis doesnot reveal any significance, either.

6 Discussion

The results on the overall use of modalities show nearly nouse of multimodal inputs and quite a strong predominanceof touch input during selection tasks. This could have mul-tiple reasons. First of all, the selections in our study arevery simple, probably too simple to promote multimodalinputs. This is concordant to the findings in [8] and [17], thatmultimodality is promoted by the complexity and cognitiveload of the task. Another factor leading to such high usagerates of touch input may be the participant’s demograph-ics. All participants were quite young and are used to touchscreen technology, as some of the subjects reported afterthe study. Lastly, the (perceived) level of privacy during thestudy could be an influencing factor that perhaps preventedsome users of using speech or pointing gestures (cf. [21]).

Although subjects were on their own during the interaction,we cannot foreclose that the knowledge about the record-ings and the clearly visible technical equipment lead to somefeeling of surveillance similar to situations were strangers arein presence. It is worth noting, that our findings strengthensome of Oviatt’s ten myths of multimodal interaction, e.g.“Myth #1: if you build a multimodal system, users willinteract multimodally.” and “Myth #4: speech is the pri-mary input mode in any multimodal system that includesit.” [16].

Regarding the role of gender, a remarkable influence iscertifiable. Wasinger and Krüger’s findings from retrospec-tive interviews [21] that women feel less comfortable usingspeech are confirmed with the real interactions in our study.While 63 % of the female subjects never used speech, so didonly 27 % of the males. Additionally, women’s reliance ontouch was much greater than that of males. While 83 % ofall female selections where done by touch, males performedmore selections by speech and gesture (51 % altogether) thanby touch (43 %). This shows, that males tend to use differentinput possibilities much more often than females do. Thisimposes special considerations when designing adaptive anduser centered interactive systems.

Personality traits can be a predictor of modality use, asthe significant correlations for male subjects demonstrate.Both Extraversion and Openness promote the use of speechto select objects. The reason for this could be an increasedwillingness to test a system and available input modalities,similar to findings of Reis et al. [19]. Such an obvious tiecould not be found for female subjects, as their speech usagewas too rare.

Instead of that, only women show an interesting correla-tion between the competence and negative attitude scoresof the TA-EG questionnaire. While competence slightlyincreases the use of touch input, negative attitude clearlyreduces it. The positive correlation of competence and touchinput could be a consequence of practice. It is comprehensi-ble that women familiar with using state of the art technologylike touch enabled smart phones, judge themselves as beingmore competent with electronic devices. The negative corre-lation for Negative Attitude could stem from a reduced will-ingness of women to directly interact with something theydo not like by physically touching it.

Regarding the presentation, we could show that speechoutput (verbal task description) enforces the use of speechby the user, too. This corresponds to Bellik et al.’s findings[1]. In contrast to the results of De Angeli et al. [4], wecould not report a significant influence of the presence oflabels for the interaction, although the tasks of both studiesare not directly comparable. Generally speaking, the presen-tation did play a surprisingly small role for the choice ofinput modalities during selection tasks. Especially the dis-tinguishability of objects did not promote the use of speech

123


in a significant way, even though discriminable objects makethe verbal description very easy.

7 Summary and outlook

We presented a Wizard of Oz study on the use of multipleinput modalities during selections, as a common and preva-lent task. The 53 subjects were unconstrained in their use ofinput modalities and could interact by using a touch screen,speech, pointing gestures, and any multimodal combinationof these. The selections to be performed consisted of dif-ferent apples on a dual touch screen. The goal was to findout more about the dependencies between the use of inputmodalities and influencing factors like the users’ gender, theirpersonality traits (NEO-FFI), their affinity towards technicaldevices (TA-EG), and the way tasks and selectable items arepresented.

The most significant findings are related to the influence ofgender, an influencing factor on modality use that is, to ourknowledge, not well studied so far. While females almostonly use the touch screen to perform selections, males aremuch more diverse. Not even half of their inputs are madeby touch, about a third is done using speech, and nearlya fifth is performed using pointing gestures. For both gen-ders, multimodal inputs are very rare for such easy selectiontasks. Furthermore we demonstrate that NEO-FFI personal-ity traits (for males) and TA-EG technology affinity scores(for females) can be predictors for modality use. A surpris-ingly low influence is exerted by the way of selection presen-tation in our study. Only when the system uses speech outputto tell the users that they have to select an object, they showa significant, but small increase in using speech, too.

Our findings can not only be helpful during the design ofinteractive computing systems, they can also be applied touser adaptive systems. While the assessment of user char-acteristics like personality traits is out of the scope of thispaper, there are approaches of automatic recognition fromverbal cues that could be used at runtime [14]. A fissionprocess that arbitrates output modalities can be enhanced byknowledge about the gender or individual characteristics ofthe user to compose a more expedient system output. Whena female user with a positive attitude operates a system, thesystem could avoid speech input by providing a graphicalinterface that facilitates touch input to play to the preferredinput mode of women. Considering multi-user scenarios in awell-equipped environment, another benefit may result fromintelligent device allocation for different user types. Within amulti-party interaction the system may try to address a maleuser via speech to offer an exclusively available touchpadfor use to the female participant. Such knowledge can alsobe used in the fusion process of a system, of course. Know-ing what type of user is currently performing inputs could

help avoiding misunderstandings and resolving conflicts ofmultiple input modalities. When there is a conflicting inputbetween a pointing gesture and a verbal utterance, this con-flict could be resolved by relying on the speech input, if thereis a male user with a high score in Openness, for example. Thefindings of this study could also be combined with machinelearning techniques as described in [7] for multimodal inte-gration patterns, to enhance robustness and reduce error ratesof multimodal fusion. Furthermore, an intelligent schedulingof system resources could reduce processing time and mem-ory consumption.

However, the presented study is just a small contributionin shedding light on the role the user and his individual char-acteristics play in influencing the way people interact withsystems. Some details of the presented study are still sub-ject of further analysis, e.g. task completion times for differ-ent modalities, and modality shifts during the course of thestudy. The studied task of selection is just one of multiplebasic interactions which are offered by today’s systems andof which they are composed of. Further basic tasks that couldbe examined are confirmation, query, scrolling, entering val-ues, and triggering actions to name only some. Additionally,there are a lot more characteristics of a user not explicitlyfocused in this study that could be of importance, e.g. age,expertise, level of education and so on.

The long-term goal of this research is to empirically mea-sure the essential influencing factors on multimodal interac-tion during basic tasks, to lay the foundations for predictinguser behavior in more complex scenarios and across differentdomains.

Acknowledgments The authors would like to thank Miriam Schmidtfor fruitful support during the setup of this study, Steffen Walterfor assistance with data collection and scoring, as well as ThiloHörnle and Peter Kurzok for technical assistance. This work is orig-inated in the Transregional Collaborative Research Centre SFB/TRR62 “Companion-Technology for Cognitive Technical Systems” fundedby the German Research Foundation (DFG).

References

1. Bellik Y, Rebaï I, Machrouh E, Barzaj Y, Jacquet C, Pruvost G,Sansonnet JP (2009) Multimodal interaction within ambient envi-ronments: an exploratory study. In: Proceedings of the 12th IFIPTC 13 international conference on human-computer interaction:part II, INTERACT ’09, Springer, Berlin, pp 89–92. doi:10.1007/978-3-642-03658-3-13

2. Chelazzi L (1999) Serial attention mechanisms in visual search:a critical look at the evidence. Psychol Res 62:195–219

3. Costa PT, McCrae RR (1992) Professional manual: revised NEOpersonality inventory (NEO-PI-R) and NEO five-factor inventory(NEO-FFI). Psychological Assessment Resources

4. De Angeli A, Gerbino W, Cassano G, Petrelli D (1998) Visualdisplay, pointing, and natural language: the power of multimodalinteraction. In: Proceedings of the working conference on advancedvisual interfaces, AVI ’98, ACM, New York, NY, USA, pp 164–173. doi:10.1145/948496.948519

123

http://dx.doi.org/10.1007/978-3-642-03658-3-13

http://dx.doi.org/10.1007/978-3-642-03658-3-13

http://dx.doi.org/10.1145/948496.948519


5. Frank AU (1998) Formal models for cognition—taxonomy of spa-tial location description and frames of reference. In: Freksa C,Habel C, Wender KF (eds) Spatial cognition. Springer, Berlin,pp 293–312

6. Gerhard U, Borkenau P, Ostendorf F (1993) NEO-Fünf-Faktoreninventar (NEO-FFI) nach Costa und McCrae. Zeitschrift für Klinis-che Psychologie und Psychotherapie 28(2):145–146. doi:10.1026/0084-5345.28.2.145

7. Huang X, Oviatt SL, Lunsford R (2006) Combining user modelingand machine learning to predict users’ multimodal integration pat-terns. In: Renals S, Bengio S, Fiscus JG (eds) Machine learning formultimodal interaction, 3rd international workshop, MLMI 2006,Bethesda, MD, USA, 1–4 May 2006, revised selected papers, lec-ture notes in computer science, vol 4299/2006, Springer, Berlin,pp 50–62. doi:10.1007/11965152-5

8. Jöst M, Häußler J, Merdes M, Malaka R (2005) Multimodal inter-action for pedestrians: an evaluation study. In: Proceedings ofthe 10th international conference on intelligent user interfaces,IUI ’05, ACM, New York, NY, USA, pp 59–66. doi:10.1145/1040830.1040852

9. Karrer K, Glaser C, Clemens C, Bruder C (2009) Technikaffinitäterfassen-der Fragebogen TA-EG. Der Mensch im Mittelpunkt tech-nischer Systeme 8. Berliner Werkstatt Mensch-Maschine-Systeme7. bis 9. Oktober 2009 (ZMMS Spektrum) 22(29):196–201

10. Kieffer S, Carbonell N (2007) How really effective are multimodalhints in enhancing visual target spotting? some evidence from ausability study. CoRR abs/0708.3575. http://dblp.uni-trier.de/db/journals/corr/corr0708.html

11. Kieffer S, Carbonell N (2007) Oral messages improve visualsearch. CoRR abs/0709.0428

12. Kipp M (2001) Anvil—a generic annotation tool for multimodaldialogue. In: Proceedings of EUROSPEECH 2001, Aalborg, Den-mark, pp 1367–1370

13. Körner A, Drapeau M, Albani C, Geyer M, Schmutzer G, Bräh-ler E (2008) Deutsche Normierung des NEO-Fünf-Faktoren-Inventars (NEO-FFI). Zeitschrift für Medizinische Psychologie17(2–3):133–144

14. Mairesse F, Walker MA, Mehl MR, Moore RK (2007) Using lin-guistic cues for the automatic recognition of personality in conver-sation and text. J Artif Intell Res 30:457–501

15. Mignot C, Valot C, Carbonell N (1993) An experimental studyof future “natural” multimodal human-computer interaction. In:INTERACT ’93 and CHI ’93 conference companion on Humanfactors in computing systems, CHI ’93, ACM, New York, NY, USA,pp 67–68. doi:10.1145/259964.260075

16. Oviatt S (1999) Ten myths of multimodal interaction. CommunACM 42(11):74–81

17. Oviatt S, Coulston R, Lunsford R (2004) When do we interactmultimodally?: cognitive load and multimodal communication pat-terns. In: ICMI ’04, proceedings of the 6th international conferenceon multimodal interfaces, ACM, New York, NY, USA, pp 129–136.doi:10.1145/1027933.1027957

18. Ratzka A (2008) Explorative studies on multimodal interaction in apda- and desktop-based scenario. In: Proceedings of the 10th inter-national conference on multimodal interfaces, ICMI ’08, ACM,New York, NY, USA, pp 121–128. http://doi.acm.org/10.1145/1452392.1452416

19. Reis T, de SM, Carrio L (2008) Multimodal interaction: real con-text studies on mobile digital artefacts. In: Pirhonen A, BrewsterS (eds) Haptic and audio interaction design, lecture notes in com-puter science, vol 5270, Springer, Berlin, pp 60–69. doi:10.1007/978-3-540-87883-4-7

20. Ren X, Zhang G, Dai G (2000) An experimental study of inputmodes for multimodal human-computer interaction. In: Tan T, ShiY, Gao W (eds) Advances in multimodal interfaces ICMI 2000,lecture notes in computer science, vol 1948, Springer, Berlin,pp 49–56. doi:10.1007/3-540-40063-X-7

21. Wasinger R, Krüger A (2006) Modality preferences in mobile andinstrumented environments. In: Proceedings of the 11th interna-tional conference on Intelligent user interfaces, IUI ’06, ACM,New York, NY, USA, pp 336–338. doi:10.1145/1111449.1111529

123

http://dx.doi.org/10.1026/0084-5345.28.2.145

http://dx.doi.org/10.1026/0084-5345.28.2.145

http://dx.doi.org/10.1007/11965152-5

http://dx.doi.org/10.1145/1040830.1040852

http://dx.doi.org/10.1145/1040830.1040852

http://dblp.uni-trier.de/db/journals/corr/corr0708.html

http://dblp.uni-trier.de/db/journals/corr/corr0708.html

http://dx.doi.org/ 10.1145/259964.260075

http://dx.doi.org/ 10.1145/1027933.1027957

http://doi.acm.org/10.1145/1452392.1452416

http://doi.acm.org/10.1145/1452392.1452416

http://dx.doi.org/10.1007/978-3-540-87883-4-7

http://dx.doi.org/10.1007/978-3-540-87883-4-7

http://dx.doi.org/10.1007/3-540-40063-X-7

http://dx.doi.org/10.1145/1111449.1111529

Documents

Influencing factors on multimodal interaction during selection tasks